Harvard’s Cascade Neural Decoder Cuts Quantum Error Rates 17×, Reveals Waterfall Effect

Marin Ivezic April 11, 2026

6 minutes read

April 9, 2026 — A team at Harvard University published a paper introducing Cascade, a convolutional neural network decoder for quantum error correction that achieves logical error rates up to 17× lower than the best existing practical decoders on quantum LDPC codes, with throughput 3-5 orders of magnitude higher. The paper, by Andi Gu, J. Pablo Bonilla Ataides, Mikhail D. Lukin, and Susanne F. Yelin, appeared on arXiv (2604.08358).

The headline number is striking on its own, but the deeper result may matter more: Cascade reveals a previously unobserved “waterfall” regime of error suppression in quantum LDPC codes, where logical error rates fall far more steeply than the standard distance-based scaling predicts. On the [[144, 12, 12]] Gross code (a bivariate bicycle code), the logical error rate follows a P_L ~ p^{10.8} power law in the waterfall regime, compared to the p^{6.4} scaling predicted by the code’s distance of 12. At a physical error rate of 0.1%, Cascade achieves a logical error rate of approximately 10⁻¹⁰ per logical qubit per cycle, roughly 4,000× below BP+OSD (the standard decoder for quantum LDPC codes) and 17× below Relay, the most recent high-performance decoder from IBM.

The waterfall effect is not specific to quantum LDPC codes. On surface codes, Cascade achieves an error suppression factor Λ ≈ 8.4 at p = 0.2%, compared to Λ ≈ 5.0 for standard minimum-weight perfect matching (MWPM) and Λ ≈ 7.8 for correlated MWPM. This approaches the Λ ≈ 9.1 achieved by Tesseract, a near-optimal decoder whose computational cost (up to 1 second per shot) makes it impractical for real-time use.

On latency, Cascade’s largest models achieve single-shot inference times of approximately 40 μs per cycle on an NVIDIA H200 GPU. Batched inference reduces amortized latency by up to two orders of magnitude. These latencies fall within the decoding budgets for trapped-ion (~1 ms) and neutral-atom platforms, but remain above the ~1 μs required for superconducting qubits. The authors’ roofline analysis suggests that depthwise convolution variants on FPGA hardware could approach the superconducting budget at moderate model widths, but this has not been demonstrated.

The paper also shows that Cascade tolerates FP8 quantization (8-bit floating point) with no measurable accuracy loss, a property the authors attribute to the regularity of convolutional weight distributions and the absence of attention-based dynamic range challenges. Models trained at a single high noise level generalize reliably across seven orders of magnitude in logical error rate, with no error floor observed down to P_L ≈ 2 × 10⁻¹¹.

The authors note a conflict of interest: Lukin is a co-founder, shareholder, and Chief Scientist of QuEra Computing; Yelin is a spouse of a QuEra shareholder; and Gu and Bonilla Ataides have served as consultants for QuEra. The paper is a preprint and has not yet undergone peer review.

My Analysis

This paper matters for three reasons, each of which connects to a different dimension of the CRQC threat assessment.

The Decoder Bottleneck Gets Less Bottlenecked

I have written extensively about why decoder performance is the most underappreciated capability on the path to a cryptographically relevant quantum computer. The CRQC Quantum Capability Framework tracks it as a separate capability dimension (D.2) precisely because a quantum computer that cannot decode its error syndromes in real time cannot compute, regardless of how many qubits it has.

The decoder has historically forced an unpleasant trade-off: fast decoders (union-find, lookup tables) sacrifice accuracy, while accurate decoders (Tesseract, tensor network methods) are orders of magnitude too slow for real-time use. Cascade breaks this trade-off more convincingly than any prior result. It achieves accuracy comparable to Tesseract (the near-optimal benchmark) at throughput 3,000-100,000× higher than existing decoders.

For superconducting qubits, Cascade’s 40 μs single-shot latency is still roughly 40× too slow for the ~1 μs real-time budget. The authors’ FPGA roofline analysis is encouraging (depthwise convolution variants at moderate model widths approach the 1 μs target on an AMD Versal AI Core FPGA), but roofline estimates assume 100% hardware utilization, which is optimistic. The gap between simulation and deployed hardware is real, and closing it is an engineering project that has not yet been executed.

For trapped-ion and neutral-atom platforms, the story is different. Cascade’s latencies are already within the decoding budget for these modalities. Combined with IonQ’s December 2025 demonstration that software decoders on commodity CPUs can keep pace with trapped-ion hardware for 1,000 logical qubits, the decoder bottleneck for non-superconducting modalities is looking increasingly tractable. This has direct implications for which hardware approach might reach CRQC-scale fault tolerance first. The modality with the most qubits (superconducting) may not be the modality that solves the decoder problem first.

The Resource Estimates May Need Revision

The most consequential finding for CRQC timeline assessment is the waterfall’s impact on resource estimates.

Current projections for fault-tolerant quantum computing, including Gidney’s 2025 estimate of under 1 million physical qubits to break RSA-2048, assume error suppression factors calibrated to MWPM-class decoders (Λ ≈ 10 at p = 0.1%). The authors demonstrate concretely that Cascade reaches a target logical error rate of ~10⁻⁹ at code distance d = 15, compared to d = 19 for MWPM. That is a ~40% reduction in physical qubit count on the surface code family that underpins Google’s and other groups’ current hardware experiments.

To put this in CRQC-relevant terms: if the ~40% surface code reduction applies to a Gidney-class architecture, the physical qubit count for breaking RSA-2048 drops from under 1 million to potentially under 600,000. That is still a large number, but it is meaningfully closer to what hardware roadmaps project within the next decade. The Chevignard et al. estimate of 1,193 logical qubits for breaking P-256 ECDSA would see a proportional reduction.

The advantage grows with stricter targets. Because the waterfall gives steeper-than-distance scaling, the gap between waterfall-aware and conventional resource estimates widens as the target logical error rate decreases toward the 10⁻¹⁰ to 10⁻¹² regime required for large-scale algorithms like Shor’s. The authors state this directly: the space-time costs of fault-tolerant quantum computation “may be significantly lower than previously anticipated.”

For quantum LDPC codes specifically, the implication is even sharper. qLDPC codes have long promised dramatically better encoding rates than surface codes (more logical qubits per physical qubit), which is why architectures like Pinnacle propose them for CRQC-scale computation. But qLDPC codes’ theoretical advantages have been gated by the absence of a decoder that could realize them in practice. BP+OSD, the standard qLDPC decoder for the past six years, misses the waterfall entirely (scaling as p^{5.4} on the Gross code, compared to the p^{10.8} Cascade achieves). The entire waterfall regime of error suppression was invisible because no existing decoder was accurate enough to access it. This is a striking illustration of a point the authors make explicitly: the decoder determines how much of a code’s error-correcting capability is realized in practice. A mediocre decoder sees a mediocre code.

The classical error correction parallel the authors draw is worth emphasizing. Gallager introduced LDPC codes in 1962. They achieved near-Shannon-limit performance in theory but remained largely unused for three decades, until practical iterative decoders in the 1990s transformed them into the foundation of modern communications (WiFi, 5G, satellite). Quantum LDPC codes may be at a similar inflection point: the codes have been developed, but their practical potential has been obscured by the absence of decoders powerful enough to unlock it. If Cascade represents the beginning of that unlocking, the resource estimates for fault-tolerant quantum computing will shift meaningfully downward.

What This Does Not Change

The balanced assessment requires stating clearly what Cascade does not do.

It does not demonstrate real-time decoding on quantum hardware. The results are from simulation, not from a decoder integrated into the control loop of a running quantum processor. The Riverlane/Rigetti FPGA demonstration (October 2024) and Riverlane’s Local Clustering Decoder (December 2025) remain the state of the art for real-hardware, real-time decoder integration. Cascade’s accuracy advantages must be translated into hardware implementations before they affect the practical CRQC timeline.

It does not close the superconducting decoder gap. At 40 μs single-shot latency on GPU, Cascade is roughly 40× too slow for superconducting qubits. The FPGA roofline estimates are promising but unvalidated. The path from roofline estimate to deployed FPGA decoder is a significant engineering effort.

It does not eliminate the uncertainty in CRQC timeline predictions. The waterfall effect could reduce physical qubit requirements by 40% or more, but the CRQC timeline depends on ten interdependent capabilities (per my CRQC Capability Framework), and the decoder is one among ten. Improvements in one dimension can be offset by challenges in others. And the waterfall regime has been demonstrated on specific code families under idealized (circuit-level depolarizing) noise models. Real hardware noise is structured, correlated, and device-specific in ways that may reduce the waterfall effect.

The Bottom Line

Cascade is the most significant decoder result I have seen since I began tracking this capability dimension. It does three things simultaneously that no previous decoder has achieved: near-optimal accuracy on both surface codes and qLDPC codes, throughput orders of magnitude above practical alternatives, and the revelation of a waterfall regime that could materially reduce the physical resources required for fault-tolerant quantum computation.

The classical LDPC parallel is the right frame. Gallager’s codes waited 30 years for decoders that could realize their potential. Quantum LDPC codes have waited less than five. If Cascade or its successors prove robust on real hardware with realistic noise, the resource estimates for a CRQC will need revision, and the revision will be downward. The decoder, the capability that almost nobody is talking about, may turn out to be the capability that determines whether a CRQC arrives at the early or late end of current predictions.

Quantum Upside & Quantum Risk - Handled

My company - Applied Quantum - helps governments, enterprises, and investors prepare for both the upside and the risk of quantum technologies. We deliver concise board and investor briefings; demystify quantum computing, sensing, and communications; craft national and corporate strategies to capture advantage; and turn plans into delivery. We help you mitigate the quantum risk by executing crypto‑inventory, crypto‑agility implementation, PQC migration, and broader defenses against the quantum threat. We run vendor due diligence, proof‑of‑value pilots, standards and policy alignment, workforce training, and procurement support, then oversee implementation across your organization. Contact me if you want help.

Talk to me Contact Applied Quantum