Advanced Large-Scale Quantum Simulation Techniques in cuQuantum SDK v25.11

quantum computing

Explore cuQuantum SDK v25.11's new Pauli propagation and stabilizer simulation techniques, significantly accelerating large-scale quantum computing. Learn how GPU-powered NVIDIA DGX B200 supercomputers enhance quantum error correction, validation, and data generation for AI models, demonstrating substantial speedups over CPU alternatives.

Simulating large-scale quantum computers has become increasingly challenging with advancements in quantum processing unit (QPU) quality. Validating these simulations is crucial, especially as devices scale beyond classical simulability. Additionally, generating large-scale datasets for AI models used in quantum processor operations (e.g., AI quantum error correction decoders, compilers, control agents, and device design models) requires efficient GPU-accelerated training data at various scales and abstractions.

NVIDIA's cuQuantum SDK provides high-performance libraries and tools that significantly accelerate quantum computing simulations at both circuit and device levels. The latest release, cuQuantum SDK v25.11, introduces components that speed up two key workloads: Pauli propagation and stabilizer simulations, both essential for large-scale quantum computer simulation.

This post explores how to leverage GPU-accelerated supercomputers to run Pauli propagation simulations and accelerate sampling from stabilizer simulations.

cuQuantum cuPauliProp

Pauli propagation is a relatively new method for efficiently simulating observables in large-scale quantum circuits, including those with realistic noise models. By representing states and observables as weighted sums of Pauli tensor products, this method can dynamically discard terms that contribute negligibly to an expectation value. This enables the estimation of experimental quantities that would otherwise be intractable with exact simulation.

Many quantum computing applications, such as Variational Quantum Eigensolver (VQE) and quantum simulation of physical dynamics, center on computing expectation values. While various exact and approximate classical simulation techniques exist for large circuits, their cost can become prohibitive in certain scenarios. For instance, the Matrix Product State (MPS) technique, a popular approximate tensor network method, is often unsuitable for large circuits encoding the dynamics of 2D or 3D physical systems.

Pauli propagation offers a valuable, complementary addition to the approximate circuit simulation toolbox for both pure and noisy circuits. It is provably efficient for simulating near-Clifford and/or very noisy circuits, and has demonstrated impressive performance in simulating circuits that Trotterize the evolution of specific quantum spin systems. This includes "utility circuits" relevant to IBM's utility experiment involving a 127-qubit device, as detailed in "Evidence for the Utility of Quantum Computing Before Fault Tolerance." Ongoing research focuses on characterizing which circuits can be efficiently simulated using Pauli propagation and refining the algorithmic details.

cuQuantum 25.11 empowers developers and researchers to advance classical circuit simulation with NVIDIA GPUs by providing primitives to accelerate Pauli propagation and derivative methods through this new cuQuantum library. Core functions are detailed below.

Library initialization

Initialize the library handle and workspace descriptor necessary for operations:

import cupy as cp
from cuquantum.bindings import cupauliprop
from cuquantum import cudaDataType

# Create library handle and workspace descriptor
handle = cupauliprop.create()
workspace = cupauliprop.create_workspace_descriptor(handle)

# Assign GPU memory to workspace
ws_size = 1024 * 1024 * 64  # Example: 64 MiB
d_ws = cp.cuda.alloc(ws_size)
cupauliprop.workspace_set_memory(
    handle, workspace, cupauliprop.Memspace.DEVICE,
    cupauliprop.WorkspaceKind.WORKSPACE_SCRATCH, d_ws.ptr, ws_size
)

Define an observable

To begin the simulation, allocate device memory for Pauli expansions (sums of products of Pauli operators, expressed as unsigned integers and coefficients) and initialize the input expansion with an observable, such as $Z_{62}$ .

# Helper to encode Pauli string into packed integers (2 bits per qubit: X and Z masks)
def encode_pauli(num_qubits, paulis, qubits):
    num_ints = cupauliprop.get_num_packed_integers(num_qubits)
    # Packed integer format: [X_ints..., Z_ints...]
    packed = np.zeros(num_ints * 2, dtype=np.uint64)
    x_mask, z_mask = packed[:num_ints], packed[num_ints:]
    for p, q in zip(paulis, qubits):
        idx, bit = divmod(q, 64)
        if p in (cupauliprop.PauliKind.PAULI_X, cupauliprop.PauliKind.PAULI_Y):
            x_mask[idx] |= (1 << bit)
        if p in (cupauliprop.PauliKind.PAULI_Z, cupauliprop.PauliKind.PAULI_Y):
            z_mask[idx] |= (1 << bit)
    return packed

# 1. Allocate Device Buffers
# Define capacity (max number of Pauli strings) and allocate buffers
max_terms = 10000 
num_packed_ints = cupauliprop.get_num_packed_integers(num_qubits)
d_pauli = cp.zeros((max_terms, 2 * num_packed_ints), dtype=cp.uint64, order="C")
d_coef = cp.zeros(max_terms, dtype=cp.float64, order="C")

# 2. Populate Initial Observable (Z_62)
encoded_pauli = encode_pauli(num_qubits, [cupauliprop.PauliKind.PAULI_Z], [62])

# Assign the first term
d_pauli[0] = cp.array(encoded_pauli)
d_coef[0] = 1.0

# 3. Create Pauli Expansions
# Input expansion: pre-populated with our observable
expansion_in = cupauliprop.create_pauli_expansion(
    handle, num_qubits,
    d_pauli.data.ptr, d_pauli.nbytes,
    d_coef.data.ptr, d_coef.nbytes,
    cudaDataType.CUDA_R_64F,
    1, 1, 1  # num_terms=1, is_sorted=True, is_unique=True
)

# Output expansion: empty initially (num_terms=0), needs its own buffers
d_pauli_out = cp.zeros_like(d_pauli)
d_coef_out = cp.zeros_like(d_coef)

expansion_out = cupauliprop.create_pauli_expansion(
    handle, num_qubits,
    d_pauli_out.data.ptr, d_pauli_out.nbytes,
    d_coef_out.data.ptr, d_coef_out.nbytes,
    cudaDataType.CUDA_R_64F,
    0, 0, 0
)

Operator creation

Define quantum gates or operators, such as a Pauli rotation $e^{-i \frac{\theta}{2} P}$ .

# Create a Z-rotation gate on qubit 0
paulis = [cupauliprop.PauliKind.PAULI_Z]
qubits = [0]
gate = cupauliprop.create_pauli_rotation_gate_operator(
    handle, theta, 1, qubits, paulis
)

Operator application

Apply an operator (a gate or noise channel) to the expansion, thereby evolving the system. Note that most applications operate in the Heisenberg picture, meaning circuit gates are applied in reverse order to the observable. This also necessitates passing the adjoint argument as True when applying the operator.

# Get a view of the current terms in the input expansion
num_terms = cupauliprop.pauli_expansion_get_num_terms(handle, expansion_out)
view = cupauliprop.pauli_expansion_get_contiguous_range(
    handle, expansion_in, 0, num_terms)

# Apply gate: in_expansion -> gate -> out_expansion
cupauliprop.pauli_expansion_view_compute_operator_application(
    handle, view, expansion_out, gate,
    True,         # adjoint?
    False, False,  # make_sorted?, keep_duplicates?
    0, None,       # Truncation strategies (optional)
    workspace
)

Expectation values

Compute the expectation value (trace with the zero state $\langle 0 | O | 0 \rangle$ ).

import numpy as np
result = np.zeros(1, dtype=np.float64)

# Compute trace
cupauliprop.pauli_expansion_view_compute_trace_with_zero_state(
    handle, view, result.ctypes.data, workspace
)

Combining these methods demonstrates that NVIDIA DGX B200 GPUs offer significant speedups over CPU-based codes. For small coefficient cutoffs, multiple orders of magnitude speedups are observed compared to single-threaded Qiskit Pauli-Prop on the most recent dual-socket data center CPUs.

Figure 1. cuQuantum GPU simulations for pi/4 rotations of the 127-qubit IBM utility circuit show multiple orders of magnitude speedups for various truncation schemes on NVIDIA DGX B200, compared to Qiskit PauliProp on an Intel Xeon Platinum 8570 CPU.

cuQuantum cuStabilizer

Stabilizer simulations originate from the Gottesman-Knill theorem, which posits that gates within the Clifford group (the normalizer of the qubit Pauli group) can be efficiently simulated classically in polynomial time. This Clifford group comprises CNOT, Hadamard, and Phase (S) gates. Consequently, stabilizer simulations have been crucial for resource estimation and large-scale testing of quantum error correcting codes.

Several approaches exist for building stabilizer simulators, ranging from tableau simulators to frame simulators. cuStabilizer currently focuses on improving throughput for sampling rates within a frame simulator.

Frame simulation specifically models the effects of quantum noise on a quantum state. As quantum devices are imperfect, circuit execution imperfections can be modeled by inserting random "noisy" gates. If the noise-free result is known, obtaining the noisy result only requires tracking the difference—how the noisy gates alter the circuit output.

This effect is considerably easier to compute than a full circuit simulation. However, the number of possible noisy gate insertion combinations grows rapidly with circuit size, necessitating a large number of "shots" to reliably model error correction algorithms.

Frame simulation is ideal for users developing quantum error correcting codes, testing new decoders, or generating data for AI decoders. cuStabilizer provides APIs to enhance sampling and accelerate any frame simulation on NVIDIA GPUs, exposing both a C API and a Python API. While the C API offers superior performance, the Python API is more flexible and handles memory allocation, making it excellent for getting started.

Create a circuit and apply frame simulation

cuStabilizer involves two primary classes for simulation: Circuit and FrameSimulator. The Circuit class can accept a string of circuit instructions, similar to the format used in the Stim CPU simulator. To create a FrameSimulator, you need to specify circuit information to allocate sufficient resources.

import cuquantum.stabilizer as cust
# Circuit information
num_qubits = 5
num_shots = 10_000
num_measurements = 2

# Create a circuit on GPU
circ = cust.Circuit("""
H 0 1
X_ERROR(0.1) 1 2
DEPOLARIZE2(0.5) 2 3
CX 0 1 2 3
M 0 3
""")

sim = cust.FrameSimulator(
      num_qubits,
      num_shots,
      num_measurements
)
sim.apply(circ)

A simulator can be reused across different circuits, provided it has enough qubits available. The following code applies a second circuit (circ2) to a state modified by the first circuit (circ).

circ2 = cust.Circuit("""
Z_ERROR(0.01) 1 4 
""")
sim.apply(circ2)

Read simulation results

The simulator's state comprises three bit-tables: x_bits, z_bits, and measurement_bits. The first two store the Pauli Frame (similar to cuPauliProp's Pauli Expansion but with a different layout and without weights). The third table records the difference between noise-free and noisy measurements for each shot.

The most efficient storage for these bits is a "bit-packed" format, where each byte stores eight significant bits. While efficient, manipulating individual bits in this format requires additional programming steps and is not easily integrated with the common "array" notion, which typically holds multi-byte values like int32.

For easier representation in NumPy, cuStabilizer offers the bit_packed argument to toggle between formats. If bit_packed=False, each bit is encoded as one uint8 value, consuming eight times more memory. When specifying input bit tables, the format is also critical for performance, as detailed in the cuQuantum documentation.

# Get measurement flips
m_table = sim.get_measurement_bits(bit_packed=False)
print(m_table.dtype)
# uint8
print(m_table.shape)
# (2, 10000)
print(m_table)
# [[0 0 0 ... 0 0 0]
#  [1 0 0 ... 0 1 1]]

x_table, z_table = sim.get_pauli_xz_bits(bit_packed=True)
print(x_table.dtype)
# uint8
print(x_table.shape)
# (5, 1252)

For convenient access to the underlying Pauli frames, cuStabilizer provides a PauliTable class, which can be indexed by the shot index:

# Get pauli table
pauli_table = sim.get_pauli_table()
num_frames_print = 5
for i in range(num_frames_print):
    print(pauli_table[i])
# ...XZ
# ZXX..
# ...Z.
# .....
# ...Z.

Leveraging the sampling API significantly improves throughput compared to Google Stim, a state-of-the-art code on the latest data center CPUs.

Surface code simulation

cuStabilizer accepts Stim circuits as input, allowing for surface code simulations:

import stim

p = 0.001
circ_stim = stim.Circuit.generated(
    "surface_code:rotated_memory_z",
    distance=5,
    rounds=5,
    after_clifford_depolarization=p,
    after_reset_flip_probability=p,
    before_measure_flip_probability=p,
    before_round_data_depolarization=p,
)
circ = cust.Circuit(circ_stim)
sim = cust.FrameSimulator(
    circ_stim.num_qubits,
    num_shots,
    circ_stim.num_measurements,
    num_detectors=circ_stim.num_detectors,
)
sim.apply(circ)

pauli_table = sim.get_pauli_table()
for i in range(num_frames_print):
    print(pauli_table[i])

The most efficient simulation is achieved with a large number of samples and qubits. Optimal performance is further realized when the resulting bit tables are kept on the GPU, such as when using the cupy package.

Figure 2. Runtime performance on surface code of different distances and 1 million shots, comparing Stim plus cuStabilizer on an NVIDIA DGX B200 GPU with Stim on an Intel Xeon Platinum 8570 CPU.

Get started with new cuQuantum libraries

The latest functionalities in cuQuantum continue to expand the possibilities of GPU-based quantum computer emulations, enabling two major new classes of workloads. These workloads are vital for quantum error correction, verification, validation, and algorithm engineering for intermediate to large-scale quantum devices.

Get started with cuQuantum cuPauliProp using: pip install cupauliprop-cu13 For more details, refer to the cuPauliProp documentation.

Get started with cuQuantum cuStabilizer using: pip install custabilizer-cu13 For more details, refer to the cuStabilizer documentation.