Quantum Systems Integration Quantum Computing Quantum Open Architecture QOA

Engineering the Quantum Operating System (OS) Stack: From Nanosecond Pulse Control to System-Level Orchestration

Marin Ivezic February 28, 2026

24 minutes read

The argument for quantum computing’s “PC moment” has become surprisingly compelling. QuantWare ships superconducting QPUs to customers in 22 countries. Qblox sells modular control stacks to over 100 labs. Bluefors has installed 1,800 cryogenic systems worldwide. The Quantum Open Architecture movement and reference designs like the Quantum Utility Block are proving that you can assemble a working quantum computer from commercial off-the-shelf components — much as Dell once assembled PCs from Intel CPUs, Seagate drives, and Microsoft software. The quantum systems integration challenge is real, and a growing number of organizations — from Italy’s first quantum computer at the University of Naples to Elevate Quantum’s Q-PAC facility in Colorado — are tackling it.

But there is a problem nobody has solved yet. A problem so fundamental that it threatens to become the bottleneck for the entire open-architecture quantum ecosystem.

There is no quantum operating system.

Not in any meaningful sense of the term. There is no software layer that can take a QuantWare QPU, a Qblox controller, a Bluefors cryostat, and a Riverlane decoder — and orchestrate them into a functioning, multi-user, fault-tolerant quantum computer. No equivalent of the Linux kernel that lets you swap hardware vendors without rewriting your entire stack. No unified system for managing qubit allocation, error correction, job scheduling, calibration drift, and security across heterogeneous quantum hardware.

What exists instead is a patchwork. IBM’s Qiskit Runtime handles job scheduling and error mitigation — but only for IBM hardware. Riverlane’s Deltaflow manages real-time error correction — but focuses on decoding, not full system orchestration. Q-CTRL’s Fire Opal provides autonomous calibration — but as a middleware layer, not an OS. Google, Microsoft, and Amazon each run sophisticated internal control stacks — but none are downloadable, portable, or open. The field sits in its mainframe era: vertically integrated, vendor-specific, and pre-standardization.

Building a true Quantum OS requires solving engineering problems that have no classical precedent — from abstracting over physics that differs as fundamentally as microwave pulses differ from laser beams, to decoding error syndromes faster than they accumulate, to calibrating hardware that drifts on millisecond timescales. This article maps the full technical stack, layer by layer: what each component must do, what engineering challenges it faces, who is building what, and where the critical gaps remain.

1. The hardware abstraction layer spans millikelvin physics to room-temperature software

The fundamental challenge of a quantum HAL is abstracting radically different physical control mechanisms — microwave pulses, laser beams, optical tweezers — into a unified programming interface. Each qubit modality demands distinct signal types, frequencies, and timing constraints.

Superconducting qubit control: microwave pulses shaped by DRAG

Transmon qubits operate at transition frequencies of 4–8 GHz (typically 5.0–5.9 GHz), with anharmonicity Δ/2π ≈ −315 MHz separating the computational |0⟩→|1⟩ transition from the leakage |1⟩→|2⟩ transition. Control requires room-temperature Arbitrary Waveform Generators (AWGs) producing MHz-bandwidth baseband I/Q envelope signals at 1–5 GS/s sampling rates. These envelopes modulate a continuous-wave GHz carrier via IQ mixing.

The dominant pulse-shaping technique is DRAG (Derivative Removal by Adiabatic Gate), which constructs a two-quadrature pulse where the Q component is proportional to the time derivative of the I component. This cancels spectral energy at the leakage transition. A typical DRAG pulse in OpenPulse syntax takes the form drag(0.2+0.1im, 160dt, 40dt, 0.05) — encoding amplitude, duration, sigma, and beta parameters. Newer variants like FAST DRAG achieve error rates of (1.56 ± 0.07) × 10⁻⁴ at 7.9 ns gate duration, while closed-loop optimized pulses have reached 99.76% fidelity at 4.16 ns — among the fastest single-qubit gates demonstrated.

The room-temperature electronics stack is substantial. Taking QuTech’s Starmon-5 as an example: a Central Controller orchestrates all instruments (generating up to 50 million sequences per instrument per second), a Zurich Instruments UHFQA handles readout (AWG + digitizer + demodulation + thresholding), and a Vector Switch Matrix routes microwave pulses across 32 output channels with nanosecond timing. Signals travel through coaxial cables with staged attenuation (20 dB at 4K, 10 dB at mixing chamber) down to 10–27 mK, with return signals amplified by a Josephson Parametric Amplifier at base temperature and a HEMT amplifier at 4K.

Trapped ions demand laser precision and physical shuttling

Trapped-ion qubits (¹⁷¹Yb⁺, ⁴⁰Ca⁺, ¹³⁷Ba⁺) use fundamentally different control signals. For ¹⁷¹Yb⁺ hyperfine qubits with ~12.6 GHz splitting, state preparation uses resonant laser beams at 369.5 nm, while qubit manipulation relies on off-resonant Raman transitions with beams detuned 14–33 THz from the 2S₁/₂ → 2P₁/₂ transition. The two Raman beams’ frequency difference matches the qubit’s microwave frequency, driving a two-photon transition that provides optical-wavelength spatial resolution (~μm) for individual ion addressing — far superior to direct microwave drive (~cm wavelength).

The control hardware includes CW lasers stabilized to ~1 part in 10⁹, acousto-optic modulators (AOMs) and electro-optic modulators (EOMs), programmable 16-bit DACs with ±300 μV resolution and 20 ns response time for DC trap electrodes, and DDS chips (Analog Devices AD9912) with 4 μHz frequency resolution. In QCCD architectures, ions are physically transported between trap zones via 100+ DC electrodes updated at up to 430 kHz. Gate fidelities reach 10⁻⁴ to 10⁻⁶ error for single-qubit operations (microwave-driven) and ~99% for two-qubit Mølmer-Sørensen gates.

Neutral atoms: tweezers, Rydberg blockade, and rearrangement

Neutral atom platforms (⁸⁷Rb, ¹³³Cs) trap individual atoms in optical tweezers at wavelengths ~810–1064 nm, with trap depths of 200 μK to 1.4 mK and arrays scaling to 6,100+ atoms. Entangling gates use two-photon Rydberg excitation to states n ~ 60–70 via counterpropagating lasers (e.g., 459 nm + 1040 nm for ⁸⁷Rb). The Rydberg blockade radius of a few micrometers prevents simultaneous excitation of nearby atoms, enabling parallel gate fidelities of 99.5% on up to 60 atoms. A HAL for neutral atoms must abstract tweezer laser power/position (AOD frequencies), Raman laser parameters, Rydberg excitation pulse shapes, atom rearrangement sequences for filling vacancies, magnetic field control (~3 G), and fluorescence imaging readout via EMCCD cameras.

Riverlane QHAL: three levels, now deprecated

The Riverlane QHAL, developed under the UK’s £7.6M NISQ.OS project with partners including ARM, NPL, and Oxford Quantum Circuits, defined three abstraction levels inspired by classical CPU protection domains:

Level 1 (lowest) communicated via hardware accelerators with minimal latency, targeting error correction and real-time control.
Level 2 used software containers for platform engineers.
Level 3 (highest) operated over internet infrastructure for cloud access.

The specification defined a minimum common instruction set across four qubit technologies — including STATE_PREPARATION, single-qubit gates (X, H, T, S, SX, RZ, PIXY), two-qubit gates (CNOT, SWAP, PSWAP), and QUBIT_MEASURE — with an optimized binary format designed for direct hardware consumption.

The QHAL repository was archived in March 2022 and marked deprecated. Riverlane initially pivoted toward Deltaflow.OS as a quantum operating system building on QHAL, but subsequently refocused entirely on quantum error correction. Their current open-source offering is QECi PHY — an FPGA-to-FPGA communication layer for QEC decoding, reflecting their strategic shift from abstraction layers to decoder hardware.

QuTech QNodeOS: FreeRTOS on MicroZed driving quantum network nodes

Published in Nature (March 2025), QNodeOS is the first OS for quantum network nodes, built on a three-layer architecture. The QNPU (Quantum Network Processing Unit) runs C++ on FreeRTOS atop a MicroZed board with a Xilinx Zynq 7000 SoC — a dual ARM Cortex-A9 at 667 MHz (only one core used). FreeRTOS was chosen to leverage existing OS primitives (tasks, message-passing, TCP/IP), with the kernel extended by a QNPU scheduler managing quantum resources (qubits, entanglement requests, entangled pairs).

The critical component is QDriver — the only hardware-dependent element, translating NetQASM instructions into platform-specific physical operations. Two validated implementations exist: a trapped-ion QDriver interfacing via SPI at 12.5 MHz with an operating rate of 50 kHz, and an NV-center QDriver operating at 100 kHz using an ADwin-Pro II and Zurich Instruments HDAWG. QDriver’s clean separation from the rest of the stack means porting QNodeOS to new hardware requires only reimplementing this single layer.

Munich QDMI: a C header-only interface for heterogeneous quantum hardware

The Quantum Device Management Interface from TU Munich (published at IEEE QCE 2024) takes a pragmatic approach: a C header-only library (Apache 2.0 license) with four components — Core (session management), Control (circuit submission, job queue), Device (calibration, status), and Query (key/value property access). The query interface exposes QDMI_DEVICE_PROPERTY_SITES (qubit list), QDMI_DEVICE_PROPERTY_COUPLINGMAP (connectivity pairs), supported gate sets, per-gate error rates, and dynamic calibration state. Built atop QDMI, the FoMaC (Figures of Merit and Constraints) library abstracts hardware details into higher-level metrics like expected fidelity, enabling software tools to make platform-selection decisions. QDMI is deployed at the Leibniz Supercomputing Centre integrated with Slurm for HPC scheduling, and recent extensions add Amazon Braket backends as logical QDMI devices.

OpenPulse: the pulse-level grammar inside OpenQASM 3.0

OpenPulse extends OpenQASM 3.0 with a calibration grammar for pulse-level programming, centered on three abstractions. Ports are hardware I/O endpoints (extern port d0). Frames combine a clock, frequency, and phase into a stateful carrier signal (frame driveframe1 = newframe(d0, 5.0e9, 0.0)). Waveforms are time-dependent envelopes (built-in types include drag, gaussian, gaussian_square, constant). The defcal construct maps gate-level operations to pulse implementations on physical qubits, with the restriction that bodies must have definite duration known at compile time. The play(frame, waveform) instruction schedules an envelope modulated at the frame’s frequency, and barrier aligns frame clocks. Oxford Quantum Circuits directly accepts OpenQASM 3 + OpenPulse as a source language, demonstrating that the specification is production-ready.

2. Qubit allocation confronts NP-hard routing on sparse topologies

The qubit mapping problem is formally NP-complete

Quantum algorithms assume arbitrary qubit interactions, but physical hardware has constrained connectivity. The compiler must find a mapping from logical to physical qubits and insert SWAP gates to satisfy adjacency requirements. This decomposes into initial placement (assigning logical to physical qubits, factorial complexity O(Q!)) and routing (inserting SWAPs during execution). Finding an optimal mapping with minimum SWAP count has been proven NP-complete (Siraichi et al., 2018), and the routing subproblem is NP-hard even when the physical topology is a star graph.

The workhorse algorithm is SABRE (Swap-Based Bidirectional heuristic search), Qiskit’s default router. It iterates forward and backward through circuit layers, scoring candidate SWAPs by their impact on upcoming two-qubit gates, with polynomial complexity ~O(N·M). Because SABRE is stochastic, users typically run 100+ trials and select the best result. More expensive approaches like SATMAP (MaxSAT-based, MICRO 2022) produce provably better solutions — up to 15× fewer gates than heuristics — but require ~30-minute compilation budgets. The recently published Marol compiler generator (POPL 2026, University of Wisconsin-Madison) takes a fundamentally different approach: it defines a domain-specific language where 12 lines of Marol specify the NISQ QMR problem, and a parametric solver automatically instantiates for any topology. Marol matches or beats Qiskit for 50% of benchmarks and dominates for interleaved logical qubit architectures (93% of cases).

Topology shapes everything: heavy-hex vs. grid vs. all-to-all

IBM’s heavy-hex lattice limits connectivity to degree 2–3 (average ~2.67) — deliberately sparse to minimize frequency collisions (order-of-magnitude improvement in collision-free yield) and spectator qubit errors during cross-resonance gates. This achieved <1% average CNOT error but imposes the highest SWAP overhead. IBM is transitioning to square lattice with 4-way connectivity in Nighthawk (2025–2028), explicitly motivated by reducing SWAP overhead.

Google’s Sycamore/Willow processors use 4-way grid connectivity at interior qubits with tunable couplers, achieving T1 ~68 μs (Willow) and physical error rates ~2× better than Sycamore. Trapped-ion platforms (IonQ, Quantinuum) offer all-to-all connectivity through collective vibrational modes — completely eliminating SWAP overhead. Quantinuum’s H2-1 achieves >99.9% two-qubit gate fidelity across all qubit pairs, with coherence times measured in seconds to minutes versus microseconds for superconducting systems.

SWAP overhead compounds errors exponentially

Each SWAP gate decomposes into 3 CNOT gates. If a single CNOT has error rate ε = 1%, each SWAP introduces ~3% error, giving success probability per SWAP of (0.99)³ ≈ 0.970. For a circuit requiring 10 SWAPs, SWAP-induced success probability drops to (0.97)¹⁰ ≈ 0.74. At 50 SWAPs: (0.97)⁵⁰ ≈ 0.22 — the circuit becomes essentially useless. IBM reported that programs exceeding 16 CNOT operations have less than 50% chance of executing correctly on NISQ devices.

Fidelity-aware allocation uses real-time calibration data to mitigate this. The TRAM algorithm (T2-Aware Qubit Mapping) uses coherence times to prioritize reliable communication channels, achieving 3.59% fidelity improvement over SABRE. DeepQMap uses deep reinforcement learning for dynamic noise prediction, achieving 49.3% fidelity improvement over QUBO methods with 79.8% inter-chip communication reduction.

Multi-programming partitions a single QPU for concurrent users

QuCloud+ (ACM TACO 2023) uses community detection to partition physical qubits for concurrent programs and introduces X-SWAP, the first inter-program SWAP mechanism. Results: 9.03% fidelity improvement on IBM Nairobi and 40.92% fewer inserted CNOT gates in 4-program scenarios. HyperQ (OSDI 2025) creates quantum Virtual Machines bin-packed spatially and temporally, with unused buffer qubits between qVMs to mitigate crosstalk. On IBM Eagle (127 qubits), HyperQ achieves >3× throughput improvement with at most 85 of 127 qubits active when all 9 qVM regions are running.

3. Compilation pipelines: from abstract circuits to calibrated pulses

Qiskit’s six-stage pass manager architecture

Qiskit’s transpiler converts QuantumCircuit objects into a DAGCircuit (directed acyclic graph where nodes are gates, edges are qubit/classical-bit dependencies) and applies passes through a StagedPassManager with six stages:

Init: Decomposes multi-qubit gates (>2 qubits) into 1- and 2-qubit operations
Layout: Maps virtual to physical qubits via VF2Layout (subgraph isomorphism, VF2++ algorithm), falling back to SabreLayout if no isomorphism found
Routing: Inserts SWAPs via SabreSwap; VF2PostLayout runs afterward to find lower-error qubit assignments
Translation: Decomposes all gates into the target’s native basis via equivalence-graph search (Dijkstra’s algorithm for lowest-cost decomposition)
Optimization: At level 3, collects 2-qubit blocks → consolidates into unitary matrices → re-synthesizes optimally via KAK decomposition
Scheduling: Inserts Delay instructions for wall-clock timing, enabling dynamical decoupling

The Target object encodes per-gate error rates, durations, and connectivity through a CouplingMap, providing the compiler with the hardware-specific information needed at each stage.

Native gate sets force radically different decompositions

An arbitrary SU(4) two-qubit unitary requires at most 3 CNOT gates and 15 single-qubit gates (proven optimal by Vatan & Williams, 2004). But the cost varies by native gate set. IBM uses {CX, RZ, SX, X}, where any single-qubit unitary decomposes via Euler angles into at most 2 SX and 3 RZ gates. Google uses {√iSWAP, CZ, virtual-Z}, where ~79% of Haar-random two-qubit gates need only 2 √iSWAP gates — a significant advantage over CNOT where two-gate coverage is measure-zero. IonQ’s Mølmer-Sørensen gate XX(θ) enables CNOT with just 1 MS gate plus single-qubit corrections, and Toffoli in only 3 multi-qubit layers (versus 6 CNOTs on superconducting platforms). Quantinuum’s ZZPhase(θ) gate allows parameterized entanglement, converting the common CNOT-RZ-CNOT pattern into a single native gate.

Optimization passes and their measured impact

Among specific passes, gate cancellation is the most impactful — improving 68% of circuits and eliminating 14,024 gates across 371 test circuits in one benchmark study (70% gate reduction on QFT circuits). Commutation analysis identifies gate pairs that can be reordered (e.g., RZ commutes with CX on the control qubit), enabling subsequent cancellation. Template matching finds subcircuit patterns matching known optimizable templates; for Hamiltonian simulation circuits, this achieves average 1.5× CNOT reduction (up to 2.56×). Rotation merging combines consecutive same-axis rotations (RZ(α)·RZ(β) = RZ(α+β)), eliminating 6,512 gates in benchmarks.

IBM’s AI transpiler achieves 42% gate reduction via reinforcement learning

IBM formulates circuit synthesis as a sequential decision process where an RL agent selects gates step-by-step. The reward function penalizes each CX gate (−10 points) and single-qubit gate (−1 point) while rewarding target circuit achievement (+1000 points). Five specialized AI passes cover different circuit classes: AICliffordSynthesis (≤9 qubits), AILinearFunctionSynthesis (≤10 qubits), AIPermutationSynthesis (≤65 qubits), AIRouting, and AIPauliNetworkSynthesis. Combined with heuristic passes, this achieves 42% average reduction in two-qubit gate counts and near-optimal synthesis up to 65 qubits — orders of magnitude faster than SAT solvers.

BQSKit’s numerical synthesis vs. DAG rewriting

BQSKit (Berkeley) takes a fundamentally different approach: given a target unitary U and a parameterized circuit template C(α), it uses gradient-based optimization (Levenberg-Marquardt) to find parameters minimizing ‖C(α) − U‖ to a threshold of Δ = 10⁻¹⁰. QSearch guarantees optimal depth for ≤4 qubits via A* search over circuit structures. QFactor uses tensor network formulations for >12-qubit circuits, enabling optimization of 100+ qubit circuits via gate deletion and partition-and-resynthesize workflows. The key insight is that numerical synthesis can discover optimizations impossible with rule-based DAG rewriting — it “teleports” past local minima that rewrite rules get stuck in. BQSKit achieves average 13% fewer gates than Qiskit/Cirq/tket.

KAK decomposition: the canonical form for two-qubit gates

The KAK (Cartan) decomposition factorizes any U ∈ SU(4) as: U = (A₁ ⊗ B₁) · exp(i(η_x·XX + η_y·YY + η_z·ZZ)) · (A₂ ⊗ B₂), where A₁, B₁, A₂, B₂ are local single-qubit gates and (η_x, η_y, η_z) are Cartan coordinates in the Weyl chamber (0 ≤ |η_z| ≤ η_y ≤ η_x ≤ π/4). These coordinates determine the minimum entangling gate count: identity needs 0 CX, CNOT-class needs 1, iSWAP-class needs 2, and generic unitaries need 3. Peephole optimization applies KAK by collecting consecutive gate sequences, computing their composite unitary, and re-synthesizing with provably minimal entangling gates. The QCCA framework uses Cartan coordinates for analytical constant-time compilation between arbitrary two-qubit gates, achieving 2.5 × 10⁵× speedup over numerical methods.

4. Quantum error correction demands real-time decoding at megahertz rates

Surface code syndrome extraction: six operations per stabilizer per cycle

The distance-d rotated surface code arranges d² data qubits and d²−1 ancilla qubits on a 2D lattice. X-stabilizers (detecting Z errors) and Z-stabilizers (detecting X errors) are weight-4 operators in the bulk and weight-2 on boundaries. Each syndrome extraction cycle comprises: ancilla reset (~30–50 ns), 4 layers of CNOT gates in a carefully ordered sequence to avoid hook errors (~20–30 ns each), and ancilla measurement (~200–500 ns, the dominant cost). Google’s Willow achieves 1.1 μs per cycle on its distance-7 surface code using 101 qubits (49 data + 48 measure + 4 leakage removal), with a logical error rate of 0.143% ± 0.003% per cycle — demonstrating for the first time that a logical qubit can outlive its best physical qubit (by 2.4×).

Approximately d rounds of syndrome extraction are needed for reliable error detection. Google demonstrated up to 10⁶ consecutive cycles for their distance-5 memory experiment, with the error suppression factor Λ = 2.14 ± 0.02 when scaling from d=5 to d=7.

Decoder algorithms trade speed for accuracy

Minimum-Weight Perfect Matching (MWPM) constructs a detection graph where each defect (stabilizer flip) becomes a vertex, edges are weighted by negative log-probability of the most likely error chain, and Edmonds’ blossom algorithm finds the minimum-weight matching. Standard complexity is O(n³), but Google’s Sparse Blossom exploits defect sparsity at low error rates for near-O(1) average per-defect time, processing ~1 million errors per core-second. MWPM achieves the gold-standard ~1% circuit-level noise threshold.

Union-Find offers almost-linear time O(n·α(n)). It grows clusters from each defect vertex, merges colliding clusters using disjoint-set data structures, and applies a peeling decoder for correction. The threshold is slightly lower (~0.8–0.9%) but it runs 10–100× faster than full MWPM, making it the natural fit for hardware implementation.

Machine learning decoders like Google DeepMind’s AlphaQubit (Nature 2024) use recurrent transformer architectures to achieve higher accuracy than MWPM on real hardware data, but currently require GPU inference with latency challenges for real-time superconducting decoding.

The backlog problem: when decoders fall behind, computation halts

Superconducting qubits generate syndrome data at ~1 MHz (one round per ~1 μs). If the decoder takes longer than the syndrome generation rate, unprocessed syndromes accumulate. Each subsequent iteration must decode original syndromes plus the growing backlog, creating a cascade that eventually halts computation. This backlog also creates decoder-induced noise — physical qubits continue accumulating errors during the wait, effectively raising the physical error rate and potentially pushing the system above threshold even when bare hardware is below it.

Google’s Willow demonstrated 63 μs average decoder latency at d=5 (covering ~57 cycles), acknowledged as needing improvement. For logical branching in large-scale algorithms, Gidney and Ekerå estimate that 10 μs response time is needed to factor 2048-bit RSA in 8 hours; at 100 μs response, the computation suffers >6× slowdown.

Hardware decoder implementations: FPGAs lead, ASICs emerging

Riverlane’s Collision Clustering decoder on a Xilinx Ultrascale+ FPGA achieves <1 μs per round for distance-23 surface codes at 0.1% noise, using only 4,794 LUTs. Their ASIC version (12nm FinFET) occupies just 0.06 mm² at <8 mW power — small enough to operate inside a dilution refrigerator within the ~1W 4K power budget. Georgia Tech’s Astrea achieves exact MWPM with 1 ns average, 456 ns worst-case latency up to d=7 at <10% LUT utilization.

Yale’s DecoNet/Helios distributes decoding across 5 FPGAs in a tree topology, handling 100 logical qubits at d=5 with 2.40 μs average latency and backlog-free throughput. NVIDIA’s GPU-based approach achieves 29–35× speedup for single-shot decoding, with DGX Quantum demonstrating <4 μs round-trip QPU-to-GPU latency. SEEQC achieved 6 μs round-trip with a GPU neural network decoder.

Bivariate bicycle codes: 10× fewer physical qubits per logical qubit

IBM’s bivariate bicycle (BB) codes (Nature, March 2024) represent a breakthrough in encoding efficiency. The [[144, 12, 12]] “gross” code encodes 12 logical qubits in 288 total physical qubits (24 qubits per logical qubit), compared to ~288 qubits for a single logical qubit in a distance-12 surface code — roughly 10× fewer physical qubits. These are qLDPC codes with weight-6 stabilizer checks (vs. weight-4 for surface codes) and a circuit-noise threshold of ~0.8%, competitive with surface codes.

For OS design, BB codes create new challenges: multiple logical qubits per code block require block-level resource management rather than individual logical qubit tracking; non-local connectivity demands long-range couplers (IBM’s planned “c-couplers” for Kookaburra in 2026); and decoders must support belief propagation rather than MWPM (with Riverlane’s Ambiguity Clustering providing orders-of-magnitude speedups over BP-OSD). IBM’s June 2025 “Tour de gross” paper outlines a modular architecture where each module holds a gross code block connected via universal adapters and Logical Processing Units.

5. Job scheduling under the tyranny of decoherence

Decoherence creates hard deadlines with no pause or resume

Unlike classical jobs, quantum computations face an absolute physical constraint: circuit fidelity decays as F ≈ ∏F_gate × exp(−t/T₂), where t is total execution time. For superconducting qubits with T₂ ~100 μs and gate times ~50 ns, this limits circuits to approximately 2,000 sequential operations. There is no pause/resume capability — qubits cannot be “frozen” for later continuation. Missing the coherence deadline doesn’t delay results; it destroys the quantum information entirely. Trapped-ion platforms offer more headroom (T₂ measured in seconds to minutes), while neutral atoms can achieve T₂ of 1–40 seconds.

IBM’s Quantum Operation Scheduling formulates gate scheduling as a constraint programming problem analogous to job-shop scheduling, achieving up to 7.36% improvement in schedule length. The default policy in Qiskit backends is ALAP (As-Late-As-Possible) scheduling, which delays operations to minimize idle time before measurement.

IBM Qiskit Runtime’s three execution modes

Job mode submits single primitive requests to a shared queue — simplest but subject to calibration interruptions and re-queuing delays.

Batch mode schedules multiple independent jobs together with minimal inter-job delay and parallel compilation (up to 5 classical jobs simultaneously), though calibrations can still interrupt.

Session mode (Premium Plan only) provides exclusive QPU access — never interrupted, not even for calibrations. The first session job enters the normal queue, but once active, the user holds the QPU exclusively with an interactive TTL timer between jobs. Sessions bill wall-clock time (including idle periods), not just QPU execution time. For VQE-type iterative workflows, sessions reduce wall-clock time by orders of magnitude, and the stable noise model means error mitigation techniques (PEC, PEA) remain valid throughout.

Circuit cutting: exponential overhead but sometimes the only option

Circuit cutting partitions large circuits into smaller subcircuits by cutting along qubit wires (O(4) overhead per cut) or across two-qubit gates. Results are reconstructed via quasiprobability decomposition. Without classical communication, each CNOT gate cut incurs O(9) sampling overhead; with two-way classical communication (LOCC), this reduces to O(4) per cut. The overhead is inherently exponential — provably lower-bounded by the exact entanglement cost. IBM’s Circuit Knitting Toolbox automates cut placement, and FitCut achieves 3× to 2,000× reduction in cutting time with 3.88× improvement in resource utilization.

Dynamic circuits introduce non-deterministic scheduling challenges

IBM’s mid-circuit measurement now executes with ~600 ns feedforward latency (the new MidCircuitMeasure instruction is 940 ns faster than its predecessor). This enables conditional quantum operations, qubit reuse via measure-and-reset, and constant-depth protocols. IBM demonstrated error-mitigated dynamic circuits across 142 qubits spanning two 127-qubit QPUs connected by a real-time classical link (Nature 2024), with 28% fewer two-qubit gates per Trotter step and up to 24% performance improvement over static circuits. However, classical feedforward introduces non-deterministic delays that complicate scheduling — dynamical decoupling must suppress decoherence during idle periods while waiting for classical processing, and ZZ crosstalk during feedforward latency is identified as a key noise source.

6. Calibration drift threatens everything the compiler optimizes

The full calibration parameter landscape

A superconducting QPU requires calibration of: qubit frequencies (4–6 GHz, drifting by 100–500 kHz over calibration intervals), T₁ times (median ~269 μs on current IBM hardware, range 20–500+ μs), T₂ times (median ~172 μs), single-qubit gate fidelities (99.99–99.999% via randomized benchmarking), two-qubit gate fidelities (97–99% via interleaved RB, with ECR pulse duration ~665 ns), readout assignment errors (per-qubit confusion matrices), and crosstalk matrices (parallel CZ fidelity improves from 87.65% to 92.04% with global optimization). 100-qubit systems require full recalibration approximately every 24 hours.

The calibration procedure follows a specific dependency sequence: resonator spectroscopy → qubit spectroscopy → Rabi oscillation (amplitude calibration) → Ramsey experiment (frequency fine-tuning, T₂*) → T₁ measurement → T₂ echo → DRAG correction → single-qubit gate optimization → two-qubit gate calibration → randomized benchmarking → readout optimization.

Randomized benchmarking isolates gate errors from SPAM

Standard RB samples m random Clifford gates (24 elements for single-qubit, 11,520 for two-qubit), appends the unique inverse gate, and measures survival probability. The twirling over the Clifford group reduces any error channel to a depolarizing map, yielding exponential decay P(m) = A·p^m + B where the average gate infidelity is r = (d−1)(1−p)/d. Crucially, the constants A and B absorb state-preparation and measurement (SPAM) errors, making RB SPAM-independent. Interleaved RB inserts a target gate G between each random Clifford, yielding gate-specific infidelity r_G = (d−1)(1 − p_inter/p)/d. A random 2-qubit Clifford decomposes to ~8.25 single-qubit gates and ~1.5 CZ gates on average.

T₁ fluctuations are faster than anyone thought

The Niels Bohr Institute discovery (February 2026, Dr. Fabrizio Berritta) used an FPGA-based classical controller with Bayesian estimation to track T₁ fluctuations ~100× faster than previous methods — on millisecond timescales versus the ~1 minute previously required. The finding that even “stable” qubits can degrade in milliseconds has profound implications for calibration: point-in-time T₁ measurements may not be representative of the actual relaxation rate during circuit execution.

The underlying physics involves time-varying coupling between qubits and two-level system (TLS) defects on metal/dielectric surfaces, with coupling rates of 50–500 kHz. TLS can exchange energy with thermally fluctuating two-level fluctuators (TLFs) at low frequencies, producing 1/fα noise in T₁. Non-equilibrium quasiparticles generated by cosmic rays also contribute — Google observed correlated error events approximately once per hour (~3×10⁹ cycles). Recent work (Nature Communications 2025) demonstrates electrode-controlled TLS interaction modulation that can stabilize worst-case T₁ instances.

QUAlibrate: graph-based calibration in 140 seconds

Quantum Machines’ QUAlibrate (open-source, May 2025) represents calibration procedures as directed acyclic graphs (DAGs) where nodes are specific calibration tasks (spectroscopy, Rabi experiments, gate tuning) implemented as QUA programs, and edges encode dependencies plus conditional logic based on qubit-specific results. A QualibrationOrchestrator determines execution order based on outcomes, with support for looping, failure handling, and nested subgraphs. Demonstrated at the Israeli Quantum Computing Center: multi-qubit calibration in 140 seconds — a 98% improvement over conventional methods. John Martinis’ Qolab reports calibrations reduced from 2 hours to <10 minutes.

Sandia’s Inline Online Calibration: per-shot FPGA feedback

Sandia’s IOC protocol updates calibration parameters after every single measurement shot using a gain parameter g. For each shot, it measures a Z-basis outcome of a short calibration circuit, computes the probability gradient, and updates: η_{t+1} = η_t + g·(measurement outcome − expected probability). Mean miscalibration converges to zero, with convergence rate increasing with g. The protocol can calibrate single-qubit rotation angles and two-qubit CZ gate phases (three parameters: θ_ZI, θ_IZ, θ_ZZ), tracking random-walk drift with step size ℓ = 0.001 per shot. Critically, IOC can operate alongside QEC using only syndrome data for calibration, demonstrated with the [[5,1,3]] code.

7. Classical-quantum orchestration operates across three timescales

Variational algorithms: hundreds to thousands of iterations

The VQE feedback loop follows a strict sequence: prepare parameterized state |Ψ(θ)⟩ → measure expectation value ⟨H⟩ (requiring multiple shots across Pauli term decomposition) → classical optimizer updates θ → repeat. Classical optimizers range from gradient-free (COBYLA: 1 evaluation/iteration; SPSA: 2 evaluations/iteration, best overall under noise) to gradient-based (L-BFGS-B, SLSQP) to adaptive methods (QN-SPSA+PSR combining approximate Fubini-Study metric with Parameter Shift Rule gradients). Simple molecules (H₂, 4 qubits) require 100–500 iterations, while complex systems (LiH, 8 qubits, 80 parameters) can require 800,000 function evaluations.

Three network tiers in NVIDIA DGX Quantum

The NVIDIA DGX Quantum architecture with Quantum Machines OPX1000 defines three distinct latency tiers. The QRT Network (~hundreds of nanoseconds) handles pulse shaping, deterministic timing, and mid-circuit feed-forward between OPX1000’s Pulse Processing Unit and the QPU. The QEC Network (<4 μs roundtrip) connects OPX1000 to the Grace Hopper Superchip via OP-NIC over PCIe Gen5 for QEC decoding and AI-based calibration. Diraq achieved 3.3 μs roundtrip with silicon quantum dot qubits. The HPC Network (millisecond+ scale) provides standard cluster integration for hybrid application workflows.

Quantum Machines’ OPX+ (now OPX1000) uses a proprietary Pulse Processing Unit (PPU) that generates pulses on-the-fly based on specified logic rather than playing from memory, enabling embedded calibrations (frequency drift tracking mid-circuit), stream processing for real-time fitting, and active qubit reset in ~100–200 ns.

IBM’s quantum-centric supercomputing treats QPUs as accelerators

IBM’s vision positions QPUs alongside CPUs and GPUs in a heterogeneous compute model. Key enablers include Qiskit Runtime (containerized architecture co-located with the QPU, achieving 120× speedup), Quantum Serverless (multi-cloud orchestration connecting elastic classical resources), and QRMI (Quantum Resource Management Interface) for abstracting resource control. The hardware roadmap progresses from Heron (133 qubits, tunable couplers, 3–5× improvement over Eagle) through Nighthawk (120 qubits, square topology, 5,000–15,000 gate depth) to Starling in 2029 (100M error-corrected gates over 200 qubits using the Gross code) and Blue Jay beyond 2033 (100,000 qubits, 1 billion gates).

8. Telemetry must track quantum-specific metrics that drift on hourly timescales

Monitoring quantum systems differs fundamentally from classical infrastructure

Quantum metrics are inherently probabilistic and fluctuating. A study at the Leibniz Supercomputing Centre analyzed 250+ days of daily calibration data on a 20-qubit IQM processor, using temporal autocorrelation functions to detect drift, cross-metric correlations between T₁, T₂, and gate fidelities, and unsupervised clustering to classify qubits by stability. Parameters drift on timescales of hours (not weeks/months like classical hardware), and stale calibration data directly degrades computation quality. The monitoring stack must track qubit frequencies, T₁/T₂ times, single- and two-qubit RB fidelities, readout assignment matrices, crosstalk, system-level metrics (CLOPS, Quantum Volume), and calibration set age.

Dilution refrigerator monitoring spans six temperature stages

A typical dry dilution refrigerator has stages at ~300 K (room temperature), ~50 K (first pulse tube stage), ~4 K (second pulse tube — HEMT amplifiers live here), ~500–700 mK (still, where ³He evaporates), ~100 mK (cold plate), and 10–20 mK (mixing chamber, where the QPU operates). Sensors at each stage monitor temperature (RuO₂ thermistors), pressure throughout the Gas Handling System, pulse tube vibration, ³He circulation rate (directly determining cooling power), and magnetic fields. Failure modes include vacuum leaks, pulse tube compressor degradation, helium mixture contamination, blocked impedance lines, and vibration-induced decoherence. Implementations like Johns Hopkins’ Speller Lab use InfluxDB + Grafana dashboards for continuous logging with backend Python API access for programmatic monitoring.

Amazon Braket exposes task-level metrics, not hardware telemetry

Amazon Braket publishes two primary metrics to CloudWatch: Count (number of quantum tasks) and Latency (total time from task initialization to completion), dimensioned by deviceArn. Users can define custom algorithm metrics for hybrid jobs, automatically logged and displayed in near real-time. The broader AWS integration includes CloudTrail (API auditing), EventBridge (completion triggers), and an open-source cost-control solution using DynamoDB + Lambda + CloudWatch Alarms that can automatically revoke braket:CreateQuantumTask permissions when budget thresholds are exceeded. However, Braket operates as a cloud abstraction layer — it does not expose hardware-level quantum metrics (T₁, T₂, gate fidelities, refrigerator temperatures). Those are managed by QPU providers (IonQ, Rigetti, IQM, QuEra) and exposed only through the device properties API, highlighting the fundamental tension between cloud abstraction and hardware-aware optimization in quantum OS design.

Conclusion: the quantum OS stack is converging on real engineering

Several patterns emerge across these eight layers. First, the latency hierarchy is the defining architectural constraint — from sub-nanosecond pulse timing through microsecond QEC decoding to hour-scale calibration drift, each layer must respect the timescales of the layers below it. Second, hardware diversity is not converging — superconducting, trapped-ion, and neutral-atom platforms demand fundamentally different control signals, and projects like QDMI and Marol explicitly embrace this heterogeneity rather than trying to eliminate it. Third, the NP-hardness of qubit routing and the exponential overhead of circuit cutting set fundamental limits on what software can achieve, making hardware topology choices (IBM’s shift from heavy-hex to square lattice) as much an OS design decision as a physics one. Fourth, qLDPC codes will reshape the entire stack — the 10× reduction in physical qubit overhead from bivariate bicycle codes changes resource management, decoder architecture, and modular design simultaneously. Fifth, real-time calibration is becoming essential — the discovery that T₁ fluctuates on millisecond timescales, combined with Sandia’s per-shot IOC protocol, suggests that future quantum OS kernels will perform continuous calibration as a background process rather than treating it as offline maintenance. The quantum OS is no longer a theoretical construct; it is an engineering discipline with hard numbers, proven implementations, and clearly defined open problems.

Quantum Upside & Quantum Risk - Handled

My company - Applied Quantum - helps governments, enterprises, and investors prepare for both the upside and the risk of quantum technologies. We deliver concise board and investor briefings; demystify quantum computing, sensing, and communications; craft national and corporate strategies to capture advantage; and turn plans into delivery. We help you mitigate the quantum risk by executing crypto‑inventory, crypto‑agility implementation, PQC migration, and broader defenses against the quantum threat. We run vendor due diligence, proof‑of‑value pilots, standards and policy alignment, workforce training, and procurement support, then oversee implementation across your organization. Contact me if you want help.

Talk to me Contact Applied Quantum