GPU-Accelerated Automorphism Ensemble Decoding of Quantum LDPC Codes

This work was carried out as part of the Grace Hopper Superchip Seed Program between NVIDIA, Supermicro and the University of Edinburgh.

Quantum computers promise breakthroughs on problems that are intractable for even the most powerful supercomputers. However, the building blocks of quantum computers, qubits, are fragile and susceptible to noise, which corrupts calculations. Quantum Error Correction (QEC) allows us to bypass those issues, but it requires fast and accurate classical decoding algorithms.

Here, we showcase a powerful technique that significantly accelerates and improves the standard Belief Propagation with Ordered Statistics Decoder (BP+OSD). By implementing an automorphism ensemble decoder (AutDEC) (Koutsioumpas, Sayginel, et al. 2025; Geiselhart et al. 2021) for BP-OSD (Roffe et al. 2020; Fossorier and Lin 1995; Panteleev and Kalachev 2021) on an NVIDIA A100 GPU hosted in the Edinburgh Compute and Data Facility (ECDF), we achieve an accuracy and efficiency boost for bivariate bicycle codes under realistic circuit-level noise. Specifically, our implementation of AutDEC halves the required decoder iterations while simultaneously more than doubling the accuracy.

The decoding problem and the role of QPU–HPC integration

One of the greatest challenges in QEC is the real-time decoding problem. Stabiliser measurements produce a continuous stream of syndromes—binary bit-strings indicating where errors may have occurred. A classical decoder must infer the most likely error configuration and issue recovery instructions back to the quantum processing unit (QPU) before qubits decohere.

At scale, this becomes daunting: syndrome data rates are expected to reach terabits per second. Decoders must therefore be fast, accurate, and massively parallelizable. Achieving this demands tight QPU–HPC integration and efficient algorithms capable of leveraging modern accelerators like GPUs.

Trapping Sets in Belief Propagation Decoders

Low Density Parity Check codes (LDPC) are a promising candidate for error correction, as they require very low physical qubit overhead. They can be represented by a graph called a Tanner graph. Decoders like Belief Propagation (BP) work by passing messages along the edges of this graph to identify and correct errors.

However, a known weakness of BP decoders is their vulnerability to “trapping sets” (Raveendran and Vasić 2021)—small graphical structures, like short cycles, that can cause the algorithm to get stuck in a loop, failing to converge on the correct solution.

Circuit Level noise

To account for circuit-level faults, we use the superconducting qubit inspired Si1000 noise model (Gidney et al. 2021). Unlike a uniform baseline, Si1000 assigns different error rates by operation, with higher error probabilities for measurements, two-qubit gates, and idling during measurement/reset windows. These differences create a model more reflective of real hardware. The heterogeneous structure means decoders must contend with noisy syndromes and fault propagation across gates, not just independent flips—making circuit-level decoding substantially more challenging.

Figure 1: Detector Error Model Tanner Graph. The Tanner graph of the matrix corresponding to the detector error model of the [[72,12,6]] X- memory experiment circuit. Green nodes correspond to the checks (detectors) and red nodes to error mechanisms under circuit level noise.”

Designing decoding algorithms for GPUs

The QEC decoding problem is inherently well-suited to GPU acceleration. With platforms such as NVIDIA CUDA-Q and its QEC library (NVIDIA Corporation 2025), parallel decoding algorithms can be developed, benchmarked, and refined at pace, shortening the design cycle from theoretical models to practical, deployable protocols. CUDA-Q offers out of the box implementations of BP-OSD decoders.

Decoders also need to be versatile. In large-scale systems where logical information is actively manipulated, they must continuously update correlations between qubits in the logical register in real time. GPUs provide the flexibility to support this dynamic behaviour, offering a more adaptable solution than specialised hardware such as FPGAs or ASICs.

Figure 2: Bypassing short cycles. On the left, a standard Tanner graph shows a short cycle (highlighted in red) that can ‘trap’ the decoder. An automorphism (transformation ‘A’) re-wires the graph, as shown on the right. This new graph may not have the same short cycle at that location (highlighted in green), allowing the decoder to find a valid solution.

Automorphism Ensemble decoding

The core idea underpinning AutDEC is that instead of trying to decode on a single, potentially flawed graph, we can decode on many different, but equivalent, graphs at the same time in an ensemble.

An automorphism is a symmetry of the code; it’s a way of shuffling the qubits and check nodes without changing the fundamental structure of the code. Applying an automorphism effectively “rewires” the Tanner graph, changing its structure and potentially breaking the trapping sets that would pose problems to the decoder.

The ensemble method proceeds as follows:

Take the initial error syndrome, \(s\).
Create multiple, unitary-equivalent versions of the decoding problem (\(U_{A} * s\)).
Run an instance of the CUDA-Q BP-OSD decoder on each of these problems in parallel.
Collect the results from all decoders and select the most likely correction from the list of candidates.

Because this process is highly parallelizable, it is perfectly suited for the massively parallel architecture of Graphics Processing Units (GPUs). The NVIDIA CUDA-Q QEC library allows us to leverage its batch processing utility to process all ensemble paths in parallel, making it a promising candidate for real-time decoding.

Figure 3: The Automorphism Ensemble Decoder pipeline. A received syndrome ‘s’ is decoded simultaneously on multiple automorphically-equivalent instances of the code. The final output is the most likely correction among all successful decoding attempts.

Results: Accuracy Gains and Reduced Iterations

We tested this architecture using the high-performance CUDA-Q BP+OSD decoder on bivariate bicycle codes, subjected to a realistic circuit-level noise model (Si1000). We compare AutDEC, configured with an ensemble size of 36 BP+OSD-0 constituent decoders - each limited to 500 BP iterations with OSD-0 - against two baselines which do not leverage GPU-accelerated automorphism ensemble decoding: (1) BP+OSD-CS10 with up to 500 iterations and order-10 combination sweep reprocessing, and (2) BP+OSD-CS10 with up to 1000 iterations and order-10 reprocessing. We highlight below some early results:

>2x Higher Accuracy: By running multiple decoders in parallel, the probability of at least one of them avoiding a trapping set and finding the correct solution increases dramatically. This more than halved the logical error rate of the code, effectively doubling its accuracy.
50% Fewer Iterations: The ensemble approach requires significantly fewer iterations to reach a valid solution. We halve the number of iterations allowed for our decoder, making the process more efficient without compromising results.

Figure 4: Qubit footprint comparison results. We benchmark the gain of the ensemble versus the baseline BP+OSD decoders for bivariate bicycle codes under Si1000 circuit noise with a physical error rate p=0.1%. The plot shows the logical error rate per round versus total qubit count. Three decoders are compared: Aut-BP500-OSD0 (red, solid line), BP1000-OSD10 (blue, dashed line), and BP500-OSD10 (green, dash-dotted line).

Next steps

The automorphism ensemble improves the accuracy of any constituent belief-propagation (BP) decoder without increasing latency. In (Koutsioumpas, Sayginel, et al. 2025), it was shown that an Automorphism Ensemble Decoder using only BP, without any postprocessing, can achieve performance comparable to BP+OSD-0 but without the additional overhead. The NVIDIA CUDA-Q platform further enhances this by providing a fast implementation of Gaussian elimination and enabling high parallelization through batch processing, making OSD-based approaches far more attractive in terms of performance. Moreover, combining OSD-0 with our ensemble allows us to reduce BP iterations while boosting decoding accuracy.

Recently, a more efficient postprocessing method, called Localised Statistics Decoding (LSD), was introduced in (Hillmann et al. 2025). LSD parallelizes many of the traditionally sequential steps in OSD, significantly reducing latency. Building on this, we developed an ensembling strategy that combines a few serial-schedule BP iterations with LSD postprocessing, introducing a powerful new technique we call “Vibe” for colour codes (Koutsioumpas, Noszko, et al. 2025). VibeLSD delivers high accuracy, offers strong parallelizability, and is well-suited for GPU acceleration.

A key strength of ensemble decoding—especially on flexible hardware like GPUs—is the ability to integrate multiple decoding strategies within the ensemble to maximize performance. In future work, we plan to optimize the ensemble implementation further and explore the accuracy–speed trade-offs introduced by different decoder hyperparameters. Additionally, comparing or merging VibeLSD with AutDEC and other ensemble techniques on GPUs promises to unlock substantial additional performance gains.

Conclusion

The automorphism ensemble, when accelerated on GPUs through the CUDA-Q platform, marks a significant advancement toward practical quantum error correction. By overcoming the inherent limitations of traditional BP-based decoders, this approach unlocks higher performance from promising QLDPC codes, paving the way for more efficient and scalable fault-tolerant quantum computing.

References

Fossorier, M. P. C., and Shu Lin. 1995. “Soft-Decision Decoding of Linear Block Codes Based on Ordered Statistics.” IEEE Transactions on Information Theory 41 (5): 1379–96. https://doi.org/10.1109/18.412683.

Geiselhart, Marvin, Ahmed Elkelesh, Moustafa Ebada, Sebastian Cammerer, and Stephan ten Brink. 2021. “Automorphism Ensemble Decoding of Reed–Muller Codes.” IEEE Transactions on Communications 69 (10): 6424–38. https://doi.org/10.1109/tcomm.2021.3098798.

Gidney, Craig, Michael Newman, Austin Fowler, and Michael Broughton. 2021. “A Fault-Tolerant Honeycomb Memory.” Quantum 5 (December): 605. https://doi.org/10.22331/q-2021-12-20-605.

Hillmann, Timo, Lucas Berent, Armanda O. Quintavalle, Jens Eisert, Robert Wille, and Joschka Roffe. 2025. “Localized Statistics Decoding for Quantum Low-Density Parity-Check Codes.” Nature Communications 16 (1). https://doi.org/10.1038/s41467-025-63214-7.

Koutsioumpas, Stergios, Tamas Noszko, Hasan Sayginel, Mark Webster, and Joschka Roffe. 2025. “Colour Codes Reach Surface Code Performance Using Vibe Decoding.” https://doi.org/10.48550/ARXIV.2508.15743.

Koutsioumpas, Stergios, Hasan Sayginel, Mark Webster, and Dan E Browne. 2025. “Automorphism Ensemble Decoding of Quantum LDPC Codes.” https://doi.org/10.48550/ARXIV.2503.01738.

NVIDIA Corporation. 2025. “CUDA-QX.” https://github.com/NVIDIA/cudaqx.

Panteleev, Pavel, and Gleb Kalachev. 2021. “Degenerate Quantum LDPC Codes with Good Finite Length Performance.” Quantum 5 (November): 585. https://doi.org/10.22331/q-2021-11-22-585.

Raveendran, Nithin, and Bane Vasić. 2021. “Trapping Sets of Quantum LDPC Codes.” Quantum 5 (October): 562. https://doi.org/10.22331/q-2021-10-14-562.

Roffe, Joschka, David R. White, Simon Burton, and Earl Campbell. 2020. “Decoding Across the Quantum Low-Density Parity-Check Code Landscape.” Physical Review Research 2 (4). https://doi.org/10.1103/physrevresearch.2.043423.

Citation

BibTeX citation:

@online{koutsioumpas2025,
  author = {Koutsioumpas, Stergios and Roffe, Joschka},
  title = {GPU-Accelerated {Automorphism} {Ensemble} {Decoding} of
    {Quantum} {LDPC} {Codes}},
  date = {2025-09-16},
  url = {https://qec.codes/blog/autdec/},
  doi = {10.59350/2tsb3-hy509},
  langid = {en}
}

For attribution, please cite this work as:

Koutsioumpas, Stergios, and Joschka Roffe. 2025. “GPU-Accelerated Automorphism Ensemble Decoding of Quantum LDPC Codes.” September 16, 2025. https://doi.org/10.59350/2tsb3-hy509.