Quantum Computing Benchmarks: RCS, QV, AQ, and More

Table of Contents
Introduction
As quantum computing hardware rapidly improves, simple metrics like qubit count are no longer sufficient to gauge a system’s true capability. Unlike classical computers where transistor counts roughly correlate with performance, quantum bits (qubits) can be error-prone and short-lived, so a few high-fidelity qubits can be more valuable than many noisy ones. This has led researchers to develop specialized benchmarks that capture different aspects of quantum computing performance – from the ability to perform classically intractable tasks to the effective computational power and reliability of a device. IBM, for example, categorizes quantum performance along three dimensions: Scale (number of qubits), Quality (measured by Quantum Volume), and Speed (measured by CLOPS, circuit layer operations per second). These metrics, provide a more holistic yardstick for progress than raw qubit counts.
The industry has developed a number of quantum benchmarks. Let’s look at the few leading ones.
Random Circuit Sampling (RCS) and Quantum Supremacy
One of the most headline-grabbing benchmarks in quantum computing is Random Circuit Sampling (RCS), which underpins demonstrations of quantum supremacy (or “quantum advantage”). RCS involves running a quantum computer on a suite of random circuits and checking how well the output distribution matches what quantum mechanics predicts. The idea was first formalized by Boixo et al. (2018) as a way to “Characterize quantum supremacy in near-term devices”. In an RCS experiment, a quantum processor applies a sequence of randomly chosen gates to all its qubits, creating a complex entangled state, and then samples (measures) the resulting bitstrings many times. Because the circuit is random, the output is essentially a random probability distribution over bitstrings – but crucially, a quantum distribution that is hard to simulate classically. To quantify how well the quantum device is performing, researchers use cross-entropy benchmarking (XEB). In simple terms, they compute a fidelity metric comparing the observed output frequencies to the ideal probabilities computed via classical simulation (possible only for small cases). The XEB fidelity $F_{\mathrm{XEB}}$ is defined such that $F_{\mathrm{XEB}}=1$ for a perfect, noiseless quantum device and $F_{\mathrm{XEB}}=0$ if the outputs are no better than random guesses. In practice, a real quantum computer will yield some intermediate fidelity $0 < F_{\mathrm{XEB}} < 1$ depending on its noise levels and how complex a circuit it can handle. An XEB fidelity significantly above 0 indicates the quantum device is outperforming any classical random simulator (which would produce uniformly random outputs on genuinely hard circuits).
What does RCS measure and why is it important?
RCS is essentially stress-testing a quantum processor’s entangling capability and coherence using highly complex, unstructured circuits. It pushes the device to sample from a distribution that is conjectured to be exponentially hard to simulate with any classical computer once the circuit size grows large. This makes RCS a powerful benchmark for demonstrating the regime of beyond-classical computation. In 2019, Google’s 53-qubit Sycamore processor famously used RCS to claim the first experimental quantum supremacy: it generated one million samples from a random 53-qubit circuit (depth 20) in about 200 seconds, whereas they estimated the task would take 10,000 years on the best classical supercomputer. The measured fidelity was extremely low (~0.2% or 0.0024 in XEB terms), but still well above zero, meaning the quantum output had detectable correlations that a random guesser would not produce. This tiny fidelity was strong evidence that the quantum device was indeed following the correct quantum distribution – something far beyond what classical simulation could verify directly at that size. Achieving even a small XEB > 0 on a 53-qubit, depth-20 random circuit became an experimental proof-of-concept for quantum computers outperforming classical ones on a specific task.
Mathematically, verifying RCS output involves calculating ideal circuit amplitudes on a classical computer for a sample of output bitstrings and computing the cross-entropy between the ideal and observed distributions. While the full details are technical, the key point is that for a sufficiently complex random circuit, classical computation of the exact output probabilities is infeasible – so the experiment relies on statistical indicators. If the quantum device is perfect, it will produce more “heavy outputs” (bitstrings with higher-than-median ideal probability) than a random source would. In fact, the RCS success criterion often used is that the quantum computer produces heavy outputs with frequency significantly > 50% (the heavy output generation test). Cross-entropy fidelity can be estimated from the samples and compared to the ideal heavy-output threshold. In Google’s experiment, the results passed this threshold, confirming the quantum device was sampling the correct distribution. Subsequent theoretical work and reproductions have refined the classical comparison: improved simulation algorithms narrowed the 10,000-year gap (for instance, to a few days on the Chinese Sunway supercomputer) but as circuits grow, classical simulation costs still blow up exponentially. In 2021, a team in China (USTC) pushed RCS further with a 60-qubit superconducting processor (Zuchongzhi 2.1), running random circuits of 24 layers and achieving an XEB fidelity of about 0.000366 (0.0366%). Even with that tiny fidelity, the quantum experiment (taking ~4 hours) was estimated to correspond to 10,000 years of computation on Sunway, reaffirming quantum advantage at an even larger scale.
Strengths and weaknesses
Random Circuit Sampling is a maximal stress-test of a quantum computer’s raw processing power. Its strength lies in being platform-agnostic and extremely demanding – a high-water mark for quantum performance. If a quantum computer can succeed at RCS for N qubits, it implies a certain threshold of coherence and gate fidelity across those N qubits. RCS was designed to be hard for classical simulation, so it’s a direct yardstick of “have we crossed the quantum supremacy frontier?” However, RCS is not a practical algorithm for solving real-world problems – it’s essentially a random data generator. Thus, a chief criticism is that demonstrating supremacy via RCS, while scientifically important, doesn’t immediately translate to useful applications. In addition, verifying RCS output requires substantial classical computation (for smaller instances or statistical extrapolation), making it impractical as a routine benchmark beyond certain sizes. The fidelity values in supremacy experiments are typically very low, which raises the bar for measurement precision and statistical confidence. Another weakness is that RCS doesn’t directly tell you how a quantum computer would perform on structured problems or algorithms – a machine might ace RCS but struggle on a more structured task if, say, its qubits are not connected in a way suitable for that task. Therefore, RCS is best viewed as a research benchmark for pushing the limits (primarily used in academia and lab demonstrations). It has been invaluable in charting progress – e.g. Google’s 2019 result, USTC’s 2021 result – and continues to be used to test new devices (often through internal experiments) against classical simulation capabilities. In the commercial realm, however, companies tend not to focus on RCS for marketing (since it’s not directly useful to customers); it remains more of a bragging-rights milestone indicating a quantum computer is “supreme” in at least one regime. In summary, RCS measures the extreme computational power of a quantum processor, and it’s important for validating that power, but it doesn’t capture how useful the processor is for everyday algorithms.
Quantum Volume (QV)
As the field moved into the NISQ (noisy intermediate-scale quantum) era, there arose a need for a holistic benchmark of a quantum computer’s capability – one that accounts for both the number of qubits and the noise/errors in the system. Quantum Volume (QV) emerged as a popular single-number metric for this purpose. Introduced formally by IBM researchers (Cross et al. 2019), QV seeks to answer the question: “How large of a random quantum circuit can this computer successfully run?” It is defined as the size of the largest square circuit (i.e. equal number of qubits and depth) that the quantum computer can execute with a sufficiently high fidelity. In practice, if a device can run a random circuit on n qubits for n layers of gates and still produce meaningful output (passing a certain success criterion), then the quantum volume is $2^n$. For example, if a machine can handle 5 qubits with 5 layers of entangling gates, its QV would be $2^5 = 32$. If it can handle 6×6, QV = 64, and so on. The larger the quantum volume, the more powerful the system is in terms of general capability. QV is designed to encapsulate multiple factors: qubit count, gate fidelity, connectivity, compiler efficiency, crosstalk, measurement error – essentially anything that affects the ability to execute an arbitrary circuit. By compressing all this into one number, QV allows apples-to-apples comparisons of quantum processors even if they have different architectures. IBM’s goal was to create a metric analogous to a classical computer’s bit-count or FLOPS, but for quantum: “Quantum volume is a single number designed to show all-around performance”.
Mathematical framework and methodology
Measuring QV involves a specific protocol. One chooses a circuit size n (start from a small number and increase). Then one randomly generates a large set of circuits of $width = depth = n$. Each circuit consists of random two-qubit gates applied in parallel on the n qubits (IBM’s implementation uses random SU(4) rotations that are compiled to the device’s native gates, ensuring the circuit is sufficiently entangling). The circuit is executed on the quantum device many times to collect output bitstrings. For each circuit, a classical computer can still simulate it (since n is kept small while testing) to get the ideal output probabilities. The success criterion uses the concept of Heavy Output Generation (HOG). The idea (from the original QV paper) is that for an ideal random circuit, about half of the output bitstrings are “heavy” – meaning their ideal probability is above the median probability. If the quantum computer is working well, it will sample those heavy-output bitstrings with higher-than-expected frequency. Concretely, one computes the fraction of measured bitstrings that fall into the heavy-output set for each random circuit, then averages this over the ensemble. If the average heavy-output probability exceeds a threshold (typically 2/3, corresponding to a certain statistical confidence), the device is deemed to successfully implement circuits of size n. One then increases n until the device fails to meet the threshold; the quantum volume is $2^{n_{\max}}$ for the largest passed n. In this way, QV is essentially measuring a volume (width × depth) in the circuit complexity space – the largest “square” that the quantum computer can fill with reliable computation.
It’s worth noting that the original definition of QV by Moll et al. (2018) was slightly different and more theoretical, but IBM’s 2019 redefinition (as above) has become the standard. IBM chose the exponential scale ($QV = 2^n$) so that each increment in n (one more qubit and layer) doubles the quantum volume, reflecting the exponentially growing classical complexity of simulating such circuits. For example, a QV of 128 indicates 7 qubits × 7 depth circuits are doable, and simulating those circuits classically corresponds to an exponentially large state-space of size $2^7 = 128$ (hence the number). Larger QV implies the device can explore a larger portion of Hilbert space reliably.
What does QV measure and why is it important?
Quantum Volume is meant to be an overall capacity benchmark. It doesn’t focus on one extreme (like RCS focuses on a maximally hard task regardless of usefulness) but rather on the balanced ability to run medium-scale circuits with high fidelity. QV is important because it captures the interplay between quantity of qubits and quality of operations. A device with many qubits but high error rates will have a low QV (because you can’t run deep circuits on all those qubits without errors spoiling the result). Conversely, a device with very low error rates but only a few qubits is also limited in QV (because you can’t go beyond that qubit count even though each operation is clean). Thus, QV rewards improvements in error reduction, connectivity, and compiler optimization in addition to just adding qubits. It provides a single metric to track progress as hardware teams improve their systems’ error rates year over year. Generally, the higher the QV, the more complex algorithms the quantum computer can handle before errors dominate. In fact, IBM and others often state that increasing QV broadens the class of feasible quantum applications.
Strengths and use cases
Quantum Volume’s strength is in being hardware-agnostic and comprehensive. The circuits used to measure QV are random but can be compiled to fit any architecture’s topology, meaning QV can be measured for superconducting qubits, trapped ions, photonic qubits, etc., and the results are comparable. This has made QV something of an industry-standard benchmark since 2019. Companies report QV scores to demonstrate their technical progress. For instance, IBM initially measured a QV of 16 on early 5-qubit devices and steadily improved this: QV 32, 64, and by 2020 a QV of 128 on their 27-qubit systems. Honeywell (now Quantinuum) entered the race with their trapped-ion systems, achieving QV 64 in 2020 (on a 6-qubit device) and then quickly surpassing IBM. By 2021, Honeywell’s H1 machine reached QV = 1024, the first to cross four digits. This meant it could handle 10×10 random circuits, thanks to very low error rates. They continued this trajectory, with Quantinuum announcing QV = 4096 in 2022 (12×12 circuits) on the H1-2 system. These achievements empirically verified a 16× performance increase in a year, as Honeywell had promised (from QV 64 to 1024), and then another 4× to 4096. IBM also later announced QV 256 on a newer 127-qubit processor (using an 8-qubit subset with improved calibration). Quantum Volume thus serves as a yardstick for hardware improvements: a higher QV generally indicates a device can run more complex algorithms reliably. In academic research, QV is used to benchmark different technological approaches under a common protocol – for example, to compare a superconducting qubit device vs. an ion-trap device on equal footing. It is also used internally by hardware teams as a target metric when optimizing gate fidelity, crosstalk reduction, etc.: if a change increases QV, it means more circuits become accessible.
Another strength is that QV, by requiring a threshold success probability, is hard to game – one must genuinely improve hardware or error mitigation to increase it. (This point has been subject to debate, which we’ll touch on later.) Because QV tests random circuits, it ensures the device isn’t just optimized for one special algorithm but has generally good performance. In summary, QV’s ideal use case is tracking the general advancement of quantum hardware. It is especially popular in commercial announcements and roadmaps – IBM uses QV as a benchmark of “quantum quality” on its roadmap, and other vendors like Quantinuum and IonQ also quote QV (or related metrics) to demonstrate leads. As a result, QV has become a common reference in discussions of when quantum computers will reach certain thresholds needed for practical applications.
Weaknesses and criticisms
Despite its utility, QV is not a perfect metric. One criticism is that quantum volume tests only square circuits and random ones at that, which may not represent the needs of real algorithms. A machine might have a high QV but could still struggle on a specific structured problem that, say, requires very deep circuits or an algorithm-specific pattern that isn’t captured by random circuits. Moreover, as devices get better, QV numbers grow exponentially and can become astronomically large, which is hard to interpret. For instance, IonQ’s trapped-ion system in 2020 was reported to have a quantum volume around 4,000,000 (about 22 qubits effective), far above IBM’s devices at that time. IonQ argued that quoting such a huge number wasn’t meaningful to end-users. In fact, IonQ’s team noted that “at some point, the [QV] numbers just get far too high” and they preferred a more intuitive measure. This leads to the concept of Algorithmic Qubits, which we discuss next, essentially taking $\log_2$ of QV to get a qubit count. Another issue is the specifics of the QV test: QV is measured with certain error mitigation implicitly (e.g., compiling gates optimally, perhaps post-selecting measurable criteria). Different teams might implement the QV test slightly differently, though the community has broadly adopted IBM’s open specification to keep it fair. Still, some have raised concerns that one could “teach to the test” – optimize the system specifically to pass QV circuits without truly improving general performance. This is part of an ongoing dialogue: how to design benchmarks that reflect real-world usefulness and cannot be gamed.
Finally, QV doesn’t incorporate speed; it doesn’t matter how slow the machine is, as long as it gets the correct distribution. A quantum computer that takes a long time to run the circuits could still have a high QV if it succeeds with good fidelity. This is why IBM later introduced separate speed metrics like CLOPS (more on this in Other Benchmarks below). Despite these caveats, QV remains a widely cited figure of merit. It is used both in academia (especially in benchmarking studies and standards committees) and in commercial marketing as a concise way to claim “world’s most powerful quantum computer” – at least in a certain sense of power. For example, Quantinuum and IBM have traded quantum volume records, and each uses QV to assert leadership in hardware, while IonQ often cites equivalent QV via their AQ metric to show their lead in fidelities.
Algorithmic Qubits (AQ)
While Quantum Volume focuses on random circuits, Algorithmic Qubits (AQ) is a newer benchmark that aims to measure the useful computational quantum bits a system has for running practical algorithms. This metric was spearheaded by IonQ in 2020 as an “application-oriented” performance measure, and it has gained attention as an alternative single-number metric. In simple terms, #AQ (the number of algorithmic qubits) represents the largest problem size (number of qubits in the algorithm) for which the quantum computer can run a selection of reference algorithms with acceptable success. It effectively tells you how many qubits you can actually utilize for real computations (as opposed to just existing on the chip). IonQ describes #AQ as “a tool for showing how useful a quantum computer is at solving real problems”. In practice, #AQ is determined by running a benchmark suite of quantum algorithms (not just random circuits) – for example, quantum Fourier transform, Grover’s search, small quantum chemistry simulations, optimisation routines like QAOA, etc. – across increasing numbers of qubits, and seeing at what size the results remain correct with high probability. The suite used by IonQ is based on the QED-C application-oriented benchmarks developed by the Quantum Economic Development Consortium, which is a cross-industry group that defined a set of meaningful test algorithms. These algorithms cover different domains (cryptography, chemistry, optimization, etc.) to ensure broad coverage. The results are analyzed in a volumetric benchmarking manner (varying circuit width and depth) similar to quantum volume, but targeted at algorithm circuits.
What #AQ measures
If a quantum computer has, say, 20 algorithmic qubits (#AQ = 20), it means you can reliably run 20-qubit versions of several key algorithms and get correct or high-fidelity results. In IonQ CTO Jungsang Kim’s words, “Having an #AQ of 20 means one can execute a reference quantum circuit over 20 qubits that contains over 400 entangling gate operations and expect the results to be correct with meaningful confidence.” In other words, it’s not just that 20 physical qubits exist, but the system can entangle them through hundreds of gate operations (a non-trivial algorithm) and still produce a valid answer. This is a high bar that incorporates both noise levels and circuit compilation efficiency for real tasks. #AQ is important because it directly addresses practical utility: it focuses on whether a quantum computer can solve instances of problems at a certain scale. This resonates with end-users who care about whether the machine can run their algorithm of interest on, say, 30 qubits without crashing into errors. It moves the conversation from abstract circuit fidelity to application-level performance.
Methodology and mathematical basis
The methodology for #AQ was outlined by IonQ in a detailed 2022 technical blog and is rooted in the QED-C benchmark suite. It works as follows: define a set of benchmark algorithms each with a parameterizable size (qubit count). For each algorithm and for a given size n, run the algorithm on the quantum hardware (possibly multiple trials) and evaluate the result’s correctness or fidelity (for example, did the quantum Fourier transform yield the correct spectrum up to some error tolerance? Did Grover’s algorithm find the marked item with sufficiently high probability?) If the algorithm’s outcome meets the success criteria, that size is considered “passed” for that algorithm. They then look for the largest n for which all (or most) of the representative algorithms can be successfully executed. That n is the algorithmic qubit count. In practice, some algorithms will be more demanding (especially ones with deep circuits), so they effectively determine the limiting factor. The underlying mathematical framework still uses volumetric benchmarking concepts – plotting algorithm success as a function of circuit width (qubits) and depth – and determining a frontier where performance drops off. In the IonQ approach, they allowed for certain error-mitigation techniques such as modest error correction or result post-processing (e.g., majority voting on multiple runs) to ensure the results are “meaningful to customers”. This is a key difference: #AQ embraces any technique that improves the practical outcome, whereas Quantum Volume is a more raw hardware metric with a fixed protocol. IonQ also explicitly relates #AQ to quantum volume, stating that algorithmic $qubits = log2(Quantum Volume)$ for their definition. In fact, IonQ acknowledges that they are essentially using QV in spirit but translating it to qubit units and using algorithmic tests rather than random circuits. By doing so, a huge QV like 4,194,304 becomes a more digestible 22 algorithmic qubits.
Strengths and ideal use cases
The strength of #AQ is its direct relevance to real-world tasks. It attempts to quantify the usable size of the quantum computer for solving problems of interest. For end users or businesses, hearing that a machine has, e.g., 25 algorithmic qubits is more concrete than hearing it has QV = 33 million. It means they might be able to tackle a problem requiring 25 qubits. #AQ also fosters a mindset of testing quantum computers with real algorithms, which can uncover issues not seen in random circuits. For instance, an algorithm might require certain gate patterns or longer circuit depth, and if the machine can handle that, it’s a strong proof of capability. IonQ has been a big proponent of this metric, using it to chart their progress. In 2021, IonQ announced their system “Aria” achieved #AQ = 20, which was at the time the highest single-number benchmark in the industry by their measure. They demonstrated that IonQ’s 20 algorithmic qubits allowed circuits with >550 two-qubit gates to run successfully, whereas competitor superconducting systems could only handle dozens of such gates before failing. This highlighted the advantage of IonQ’s high-fidelity, fully-connected ion qubits for running deeper circuits.
More recently (Jan 2024), IonQ’s latest system “Forte” reached #AQ = 35, a full year ahead of their roadmap schedule. At 35 algorithmic qubits, the system can explore an enormous state space $(~2^{35} ≈ 34$ billion basis states in superposition), indicating a significant leap in capability.
IonQ often frames these milestones in terms of what applications become feasible – e.g. they suggest that around 40 algorithmic qubits could start to show quantum advantage in machine learning tasks, and ~64 might be a transformative point (“a ChatGPT moment for quantum,” as one headline put it). The #AQ metric is now prominently used in IonQ’s commercial offerings and roadmaps, and some other companies/consortia are watching it closely. It is ideally used by quantum application developers and those evaluating different hardware for practical algorithm performance. For instance, a financial services team might look at a machine’s #AQ to judge if it can run their option pricing algorithm on N qubits reliably.
Weaknesses and controversy
Being a relatively new benchmark largely promoted by one company, #AQ has invited scrutiny from competitors. One critique, notably voiced by Quantinuum (Honeywell), is that Algorithmic Qubits could be “easier to pass” than Quantum Volume by using certain tricks. In a 2024 technical blog titled “Debunking algorithmic qubits,” Quantinuum’s team argued that #AQ, as defined by IonQ, allows combining results from multiple runs (a “plurality voting trick”) and using tailored gate compilations that might inflate the success rate without truly improving hardware. They claim this makes a system look better than it actually is, and that #AQ is a “poor substitute” for more rigorous measures like QV. In essence, the criticism is that #AQ might hide raw performance issues behind algorithm-specific error mitigation. IonQ would counter that those mitigations are precisely what a user would do to get a correct answer, so they should count – after all, what matters is the end result being correct, not whether every gate was perfect. This touches on a philosophical difference: QV is a hardware benchmark (emphasizing intrinsic noise), whereas #AQ is an application benchmark (emphasizing solved problems by any means). The truth likely lies in between – #AQ is very useful to gauge practical progress, but one should understand the context and methods used.
Another weakness of #AQ is that it’s not yet widely adopted or standardized. The QED-C provided an initial suite and definitions, but different groups might choose different algorithm sets or success thresholds, making cross-platform comparison tricky unless everyone agrees on the same tests. IonQ has published enough details that others could in principle measure their own #AQ, but so far we mostly have IonQ’s numbers and claims, with less independent verification. Nonetheless, academic efforts (like the QED-C paper by Lubinski et al. 2021 ) lend credibility that this approach is being taken seriously industry-wide.
Going forward, #AQ or similar “quantum computing utility” metrics are likely to gain traction as machines inch towards solving real tasks. In practice, both academic researchers and quantum software companies have started incorporating algorithmic benchmarks to complement hardware-focused ones. The ideal use case for #AQ is to inform users how large an algorithm they can run on a given hardware platform with confidence. It aligns benchmarking with real-world goals, ensuring that progress in quantum computing is measured not just by esoteric metrics, but by the ability to do something non-trivial and correct with the machine.
Other Benchmarking Methodologies
Beyond the big three benchmarks above, there are several other important methodologies used to evaluate quantum computers. Each serves a different purpose – some target low-level error rates, others target high-level performance like speed or throughput. We briefly survey a few of these:
Randomized Benchmarking (RB) – Gate Error Rates
While not a “system power” metric like QV or RCS, Randomized Benchmarking is an essential technique to measure the fidelity of quantum logic operations. Introduced in the late 2000s (Knill et al., 2008 ; Magesan et al., 2011), RB involves applying long random sequences of gates to qubits and then applying the inverse of the sequence to ideally return to the initial state. By varying the sequence length and measuring the success probability, one can extract the average error per gate. Mathematically, the survival probability typically decays exponentially with sequence length, and the decay constant directly yields the error rate of the gates. RB has become the gold-standard for characterizing qubit quality because it is scalable and fairly insensitive to state preparation and measurement errors, focusing purely on coherent gate errors. For example, a device might report an average two-qubit gate error of 0.5% from RB – this number indicates how often a gate fails and is crucial for estimating how deep a circuit can go before noise overwhelms it.
RB is important for benchmarking progress in error reduction. Academic papers and industry hardware teams both rely on RB to track improvements in gate fidelity over time. It doesn’t produce a single “score” for the whole system, but rather component-level metrics (error rates for 1-qubit and 2-qubit gates, readout, etc.). Its strength is in providing a rigorous, quantitative measure of quality for quantum operations. Almost every quantum hardware platform publishes RB numbers (e.g., IBM regularly updates a public dashboard of RB-based error rates for each of its devices). The weakness of RB is that it’s somewhat removed from actual algorithms – it measures random gate sequences, not necessarily the patterns used in real computations. Also, RB averages over all possible gates (usually in the Clifford group), so it may not detect worst-case errors for specific gates. Nonetheless, it’s extremely useful for diagnosing and comparing hardware. In the context of our discussion, one can view RB as complementary to QV/AQ: RB tells you how likely errors are at the gate level, while QV/AQ tell you what size problem can be handled given those errors. In summary, RB is a foundational benchmarking method in academic research for quantum error characterization, and it underpins the progress that later reflects in higher QV or AQ scores. A variant called interleaved RB can measure the error of a specific gate (by interleaving it in random sequences), and other variants target multi-qubit operations and crosstalk. All these help quantum engineers improve the system piece by piece.
Cross-Entropy (XEB) vs. Other Fidelity Benchmarks
We already discussed Cross-Entropy Benchmarking (XEB) in the context of RCS. It’s essentially the metric used in random circuit sampling experiments to evaluate fidelity. XEB has also been used as a standalone benchmarking method: one can run random circuits of various sizes and directly compute the XEB fidelity to assess performance. In fact, Google often uses linear XEB fidelity as a way to characterize their devices for different circuit depths. It’s similar in spirit to QV’s heavy output test but uses the full cross-entropy calculation. XEB’s strength is its sensitivity to any deviation in the output distribution, making it a good measure of global performance of a circuit. It’s particularly useful for verifying quantum supremacy results or for benchmarking simulators vs. real hardware on random circuits. However, XEB requires classical computation of ideal probabilities for at least a subset of outputs, so it’s not practical beyond maybe 50 qubits (hence mostly tied to the supremacy regime).
Other fidelity benchmarks include state or process tomography (reconstructing the full quantum state or gate process to compute fidelity). Tomography is very resource-intensive, so it’s limited to small systems (perhaps up to 3–4 qubits). It’s used in academic experiments to fully characterize a few qubits or a multi-qubit gate. Another common method is quantum volume itself can be viewed as a benchmark (we discussed it in detail). There’s also cycle benchmarking and Clifford fidelity measures – all variations focusing on measuring how noise affects multi-qubit operations in aggregate.
Throughput and Speed: CLOPS and rQOPS
While QV and AQ focus on what a quantum computer can do in principle, they don’t account for how fast it can do it. As quantum computing moves toward practical use, throughput becomes important – especially for algorithms that require many repetitions or many circuit evaluations (such as Variational Quantum Eigensolver or QAOA which involve thousands of circuit iterations). Two metrics have been proposed to quantify speed:
CLOPS (Circuit Layer Operations Per Second)
Introduced by IBM in 2021 as a quantum analog of FLOPS. CLOPS measures how many layers of a given depth (specifically, the layers used in a QV circuit) a quantum system can execute per second. It takes into account not just gate speeds, but also qubit reset time, readout, and classical processing delays between circuit iterations. In essence, IBM runs a stream of random QV circuits back-to-back and measures the effective processing rate. Improvements like better scheduling, parallelism, and faster classical-quantum interface boost CLOPS. IBM reported that with their Qiskit Runtime system, they achieved up to 1,400 CLOPS on their fastest systems (meaning 1,400 layers of gates per second). CLOPS is useful to gauge how well a quantum computer could handle, say, a variational algorithm that needs lots of circuit evaluations – higher CLOPS means the answer comes faster. It’s a full-stack metric, involving control electronics and software, so it encourages optimization beyond just the qubit physics. The drawback is that CLOPS is somewhat specific (tied to a certain circuit type) and not widely reported by others yet. But it’s likely to become more important as hardware reaches the point where raw quality is good enough and the bottleneck shifts to iteration speed.
rQOPS (Reliable Quantum Operations Per Second)
Proposed by Microsoft for the era of fully error-corrected quantum computers. Reliable QOPS asks: How many error-free logical operations can your quantum computer perform each second? It’s a forward-looking benchmark targeting the “Level 3 – Scale” quantum supercomputer in Microsoft’s taxonomy. They estimated that a useful quantum supercomputer will need at least 1 million rQOPS to solve impactful problems. rQOPS inherently assumes you have error-corrected qubits (logical qubits) that stay coherent for the duration of the algorithm. In today’s noisy devices, rQOPS would effectively be zero for any long algorithm, since we can’t yet do reliable ops at scale. But Microsoft recently announced progress toward a topological qubit (their approach to error correction), and framed it as moving toward non-zero rQOPS. Think of rQOPS as the quantum equivalent of, say, a LINPACK FLOPS measure for a future fault-tolerant quantum computer. Its value now is largely aspirational – it gives a metric to aim for when designing error-correction architectures. In academic research on quantum error correction, one might calculate how many logical gates per second a given scheme could do if scaled, which ties into rQOPS. For the average user today, rQOPS is not directly applicable, but it will become relevant when comparing quantum supercomputers in the future (much like we compare classical supercomputers by petaflops).
Volumetric & Application-Specific Benchmarks
Recognizing that no single metric can capture everything, researchers have developed volumetric benchmarking frameworks that map out a device’s performance over a range of circuit widths and depths. Quantum Volume can be seen as one point in this volume (the largest square). The QED-C suite we discussed essentially does this volumetric mapping for various algorithms. Some academic papers propose plotting a “benchmarking heatmap” where one axis is qubit count, the other is circuit depth, and you mark regions where the quantum computer can produce correct results. This gives a more complete picture than a single number. Such analyses are useful in research to understand where the breaking points of a machine are (for instance, maybe it can do 30-qubit shallow circuits or 10-qubit very deep circuits, but not 30-qubit deep ones).
There are also emerging application-centric benchmark suites created by independent groups. One notable example is SupermarQ, a suite introduced in 2022 by researchers to measure performance on a variety of application-inspired circuits (like factoring small numbers, chemistry Hamiltonian simulation, optimization problems) using metrics like solution fidelity or success probability. SupermarQ is designed to be hardware-agnostic and uses application-level metrics rather than circuit fidelity. The idea is to score quantum computers on tasks that matter to end-users, much like SPEC benchmarks for classical CPUs. SupermarQ and similar efforts (like those by national labs or consortiums) are still in early stages, but they point toward a future where quantum benchmarking is as rich as classical benchmarking, with different tests for different domains.
Lastly, some companies have introduced their own branded metrics to encapsulate performance. For example, Atos’s Q-score attempts to measure how well a quantum system can solve an optimization problem (Max-Cut) of increasing size – effectively gauging quantum heuristic performance. This hasn’t been as widely adopted, but it’s another angle focusing on specific algorithmic performance.
In summary, the landscape of quantum computing benchmarks is evolving rapidly. Random Circuit Sampling and XEB demonstrated raw quantum computational power, Quantum Volume provided a balanced metric for general capability, and Algorithmic Qubits pushed toward application relevance. Meanwhile, Randomized Benchmarking remains crucial for improving fidelity, and new metrics like CLOPS and rQOPS address speed and future fault-tolerant performance. Each benchmark has its strengths: some are pragmatic and user-oriented, others are challenging and foundational. The weaknesses of one are often addressed by another – which is why the community tracks multiple benchmarks in parallel. Academic researchers use these benchmarks to compare architectures (superconducting vs ion trap vs photonics, etc.) and to gauge progress toward the threshold of quantum error correction and “quantum advantage” in useful tasks. Commercial players use them to set records and assure customers of improvements (e.g., IBM doubling QV regularly, IonQ increasing #AQ, Quantinuum leading in QV, etc.)
Conclusion
Benchmarking quantum computers is a multifaceted endeavor, much like benchmarking classical computers involves LINPACK, SPEC, and application tests. No single number can capture performance in all aspects, but together, benchmarks like RCS, QV, and AQ – along with supporting metrics like RB and CLOPS – paint a comprehensive picture of progress. In the past few years, we’ve witnessed rapid advancements: from the first RCS supremacy experiments to quantum volume records climbing by orders of magnitude, and algorithmic qubit counts now in the 30s. These benchmarks not only track the hardware improvements (better qubits, better gates, more qubits), but also encourage improvements in software and error mitigation (as seen in the debate between QV and AQ approaches). They also serve as milestones for the field – for instance, achieving QV > 1 million or AQ ~50 will be seen as major steps towards useful quantum computers.
Crucially, benchmarks are driving a feedback loop: they guide researchers and engineers on where to focus. If a device fails at a certain circuit depth in QV, that signals a need to improve coherence or gate fidelity. If CLOPS is low, one must streamline the control system. If algorithmic benchmarks falter on a particular algorithm, it may inspire better error mitigation or algorithm redesign. In this way, benchmarking is not just about bragging rights; it’s about identifying how to reach the ultimate goal of quantum computing – solving problems that classical computers cannot, i.e., achieving true quantum utility.
As quantum technology matures, expect these benchmarks to evolve. We may see standardized “quantum benchmark suites” analogous to SPEC CPU in classical computing, and perhaps new metrics for hybrid quantum-classical workflows. The benchmarks discussed here will likely remain foundational. Random Circuit Sampling will continue to be the go-to test for pushing the frontier of quantum computational power. Quantum Volume will remain a broad indicator of incremental hardware progress. Algorithmic Qubits or similar application-based metrics will grow in importance as we care more about useful algorithms. And underlying all of these, techniques like randomized benchmarking will ensure the fidelity keeps improving to support higher-level gains.
In conclusion, the rigorous evaluation of quantum computers requires a portfolio of benchmarks – each illuminating a different dimension of performance. By examining RCS, QV, AQ, and others, a technical audience can better appreciate how far we’ve come and what hurdles remain. These benchmarks provide a common language for researchers to compare results and for companies to demonstrate milestones in quantum computing’s march from laboratory curiosity to practical tool. With clear benchmarks and clear progress, we can be optimistic that today’s “quantum volume 4096” and “35 algorithmic qubits” will, in a matter of years, turn into tomorrow’s 50+ logical qubits and beyond – bringing us ever closer to quantum computers that deliver real-world value.