The quantum computing industry is shifting from single, monolithic quantum processors (QPUs) to distributed systems where multiple QPUs are connected by quantum networks. This trend, visible across superconducting qubits[1], trapped ions[2], and neutral atoms[3], requires specialized software for orchestration and quantum link hardware & architecture co-design. This includes distributed compilers and quantum network orchestrators to dynamically optimize workload partitioning across QPUs, aiming to minimize expensive inter-QPU remote gates and noisy swap operations. It also demands hardware-aware circuit simulators that can accurately model the unique noise profiles of large-scale, entanglement-based remote operations, essentially creating a “digital twin” of a distributed QPU. Therefore, moving from a single instance approach to a networked QPU cluster necessitates comprehensive simulation frameworks that integrate hardware-aware noise models, primitives for remote operations, network-aware circuit compilation, and a robust circuit simulation backend.
This post introduces memQ’s full-stack simulation framework for entanglement-based distributed quantum computing. It integrates the NVIDIA CUDA-Q simulation backend [4,5], adding domain-specific layers tailored for entanglement-based distributed quantum computing & quantum networking. A network-aware compiler intelligently partitions a circuit across distributed processors, minimizing the number of entanglement-based remote operations. CUDA-Q’s GPU-accelerated simulation backends enable the benchmarking of partitioned circuits with qubit counts beyond the limitations of standard CPU-based statevector simulation methods. This capability is key to identifying optimal network architectures, hardware specifications, and algorithms for distributed processors. Finally, we simulate the distributed circuit using CUDA-Q’s noisy trajectory simulation technique. This simulation implements primitives for remote gate and state teleportation, alongside hardware-aware noise models, enabling us to evaluate the distributed algorithm’s performance in relation to the network’s hardware properties and architectural topology.
Our distributed quantum compiler (DQC) extends traditional quantum compilation by adding a network-aware partitioning layer on top of standard intra-QPU optimization passes. While conventional compilers focus on minimizing circuit depth, gate count, and local complexity on a single device, the memQ DQC optimizes and converts an input circuit for distributed execution across an entire quantum network.
The DQC first divides the circuit into sequential layers. At each layer, it analyzes the circuit structure and constructs an interaction graph that captures the frequency and pattern of qubit interactions. This graph is then partitioned into groups, producing smaller sub-circuits that can be assigned to distinct QPUs. The objective is twofold: minimize remote operations (and therefore entanglement consumption) while maintaining balanced computational load across network nodes.
Using this partitioning, the DQC dynamically maps circuit qubits to physical qubits on specific QPUs over time. These mappings are chosen by evaluating the underlying network topology to identify feasible, low-cost placements that respect both local hardware constraints and inter-QPU connectivity.
This produces a time-dependent allocation schedule: a mapping of circuit qubits to physical qubits across discrete execution intervals. From this schedule, the DQC constructs a fully distributed program by inserting state teleportation operations to migrate logical qubits between QPUs and remote (gate teleportation) operations to implement multi-qubit gates spanning different processors
The result is a distributed circuit that respects the physical constraints of the network while minimizing entangled-pair (e-bit) usage. A key feature of the memQ DQC is its network-awareness. As a distributed compiler, the memQ DQC supports arbitrary network topologies, including heterogeneous architectures with varying QPU sizes, asymmetric connectivity, and non-uniform communication links. The DQC accepts any custom network architecture as input and automatically adapts its partitioning and scheduling strategy to that topology. In doing so, it jointly accounts for local connectivity within each processor, remote connectivity between processors, available communication qubits, and any routing constraints imposed by the network.
Before running noisy, protocol-heavy simulations, we verify the mathematical correctness of the partitioned circuits using NVIDIA’s CUDA-Q platform.
Specifically, we use the CUDA-Q ‘tensornet-MPS’ (Matrix Product State) and ‘nvidia’ simulator backend with GPU acceleration[4] to validate that the distributed partitioning preserved the logical behavior of the original circuit. This allows us to check fidelity against the ideal circuit output even for circuits that would be infeasible to simulate using conventional CPU-based statevector simulators.
Crucially, CUDA-Q’s GPU-accelerated simulation backend allows us to scale circuit verification capabilities. In practice, this eliminates the traditional tradeoff between performance and correctness checking: we can validate large partitioned circuits at scale, ensuring that teleportation insertion and remote-gate synthesis introduce no logical errors before executing more complex network simulations. This verification step is critical in confirming that the compiler’s graph partitioning, scheduling, and remote-operation insertion collectively preserve end-to-end circuit fidelity.
The partitioned circuit provided by the DQC is simulated by adding modules that represent entanglement-based remote gate operations and the corresponding noise model. Specifically, the simulator implements entanglement-based remote operations primitives and their corresponding noise models to enable a faithful simulation of a ‘digital twin’ of a distributed processor. These remote operations introduce errors dependent on the quantum link hardware properties, interconnect topology, and local errors. Having the ability to map local and remote operation errors to circuit-level noise allows us to leverage GPU-accelerated simulation backends to perform Monte-Carlo simulations of noisy trajectories for problem sizes beyond the reach of CPU-based simulators.
To validate the framework, we ran a full end-to-end workflow on two representative circuits: a 192-qubit GHZ circuit and a 30-qubit Quantum Volume circuit.
In the first experiment, we compiled a 192-qubit GHZ circuit onto a network of eight QPUs under four different interconnect topologies. Each QPU contained 24 data (computation) qubits and 8 communication qubits. We evaluated performance across four network architectures: All-to-All, Grid, Ring, and Hub. Circuits are input as OpenQASM3 files, while network configurations are specified using a JSON schema. Figure 2 displays the local connectivity of QPU’s as well as the network architectures used.
For the second experiment, we compiled a 30-qubit Quantum Volume (QV) circuit onto a scaled-down version of the same 8-QPU architecture. In this case, each QPU contained 4 data qubits and 4 communication qubits. The same four network topologies were used, allowing for a consistent comparison across circuit types and hardware scales. In every case, the compiled output was validated using CUDA-Q to ensure functional correctness.
As a baseline, we compare against a static sequential mapping scheme. In this benchmark, QPUs are treated as fixed-capacity buckets, and circuit qubits are assigned in contiguous index order: each QPU is filled before assignment proceeds to the next. While this implicitly groups neighboring circuit qubit indices onto the same QPU, it does not account for the circuit’s interaction graph, cross-QPU communication costs, or network connectivity constraints. It provides a simple, topology-agnostic reference point.
We measure performance in terms of e-bit consumption — the number of entangled pairs required to execute the distributed program. For both input circuits, we compare the memQ network-aware compiler against the static baseline across all four network topologies. The partitioned GHZ-192 circuit was verified using tensornetwork MPS backend, while the QV circuit was verified using the statevector backend.
The results clearly illustrate the value of network-aware compilation. In the fully connected All-to-All topology, the memQ DQC performs similarly to the static benchmark. This is expected: when every QPU can directly share entanglement with every other QPU, since remote operations do not require routing through intermediate nodes and even a naive placement performs reasonably well.
The difference becomes pronounced in the more sparsely connected topologies: Grid, Ring, and Hub. In these architectures, remote operations between QPUs that are not directly connected require additional teleportation steps to route the quantum state through intermediate QPUs. Each additional hop increases e-bit consumption. Because the memQ DQC is topology-aware, it dynamically places highly interacting circuit qubits on the same QPU or on directly connected QPUs, reducing the need for expensive routing. As a result, it significantly lowers communication overhead compared to the static mapping.
Importantly, all results shown here were produced using the same compiler configuration. Unlike approaches that rely on topology-specific heuristics or custom subroutines for particular network layouts, the memQ DQC adapts automatically to the underlying interconnect. It remains competitive in fully connected networks and delivers substantial gains when connectivity is constrained, without any manual evaluation or retuning.
Finally, we use the compiler’s partitioned circuit output to run a noisy circuit simulation. This simulation includes entanglement-based remote gate protocols and their corresponding noise models. We model noisy entanglement generation via a noisy Bell state preparation circuit using CUDA-Q’s custom Kraus noise models. To incorporate hardware-specific characteristics into the simulator, we utilize remote gate primitives and a custom remote noise model for superconducting qubits connected by optical links, building on our prior research[6].
Simulating a noisy remote circuit is computationally demanding because of the complexity of representing feed-forward circuits and custom noise models for remote operations. Figure 4(a) compares the runtime of the GPU-based approach using CUDA-Q with a CPU baseline for a 100-shot remote GHZ circuit. These simulations were run on an Intel 16-CORE Xeon w5-3435X with an NVIDIA RTX PRO 6000 Blackwell Workstation Edition. The CPU-based density matrix simulation (blue line) took up to 4,000 seconds for a relatively small, noisy remote GHZ-5 circuit. In sharp contrast, CUDA-Q’s Monte-Carlo statevector trajectory sampling method utilizes batched execution and 96 GB of GPU memory, and can easily simulate circuits with > 28 qubits.
The performance of a distributed quantum circuit critically depends on the quality of the quantum links connecting the QPUs. Imperfections, such as losses and noise in these links, degrade the resulting circuit fidelity. Therefore, it is essential to map low-level hardware specifications, like quantum link efficiency and noise, to the resultant circuit-level performance. Figure 4(b) illustrates the simulated Hellinger Fidelity of a compiler partitioned GHZ-20 circuit against the dark count noise probability in quantum link channels. Using CUDA-Q’s GPU-accelerated backend allows us to significantly scale up the circuit sizes beyond the GHZ-5 baseline demonstrated in our prior work [6].
This release marks a milestone: the first version of a hardware-aware distributed quantum compiler software framework for end-to-end simulation of distributed quantum computing that takes into consideration the underlying type and availability of resources. We are committed to making a practical and extensible open-source platform, empowering researchers and engineers to accelerate innovation, build upon our work, contribute to the ecosystem, and establish new benchmarks for distributed quantum systems.
Our vision is to drive the co-design of hardware, architecture, and software frameworks for distributed quantum computing. Future updates are planned to include advanced partitioning and scheduling algorithms for the compiler and expanded hardware modeling capabilities. Additionally, we will implement a quantum network orchestration layer that leverages NVQlink’s low-latency, high-throughput connections to schedule cross-QPU quantum operations dynamically. This layer will bridge QPU-to-QPU connections, enabling the execution of QPU-classical HPC workloads for larger qubit sizes than currently possible with monolithic processors.
The software’s current version is available upon request. The open-source version of the software is scheduled for release in June 2026.
[3]https://nano-qt.com/nanoqt-quera-quantum-interconnect/
[4]https://nvidia.github.io/cuda-quantum/latest/using/backends/sims/noisy.html
[5]https://nvidia.github.io/cuda-quantum/latest/index.html
[6]S. Gupta et al., Gate Teleportation vs Circuit Cutting in Distributed Quantum Computing:doi.org/10.48550/arXiv.2510.08894
[7]https://www.nvidia.com/en-us/solutions/quantum-computing/nvqlink/
Media Contacts :