Infrastructure Challenges of “Dropping In” Post-Quantum Cryptography (PQC)
Table of Contents
Introduction
Post-quantum cryptography (PQC) is moving from theory to practice. NIST has now standardized several PQC algorithms – such as CRYSTALS-Kyber for key exchange (now known as ML-KEM) and CRYSTALS-Dilithium and SPHINCS+ for digital signatures – and major tech companies like Google, AWS, and Cloudflare have begun experimenting with integrating these algorithms.
On the surface, it may seem that we can simply “drop in” PQC algorithms as replacements for RSA or ECC. However, migrating to PQC is not plug-and-play. In fact, even an additional kilobyte or two of data in cryptographic exchanges can ripple through an organization’s infrastructure in unexpected ways.
Why PQC Adoption Requires Caution
The Quantum Threat and Urgency
The driver for PQC adoption is the anticipated advent of quantum computers that could break today’s public-key cryptography. Including the so called “harvest now, decrypt later” attack. While large-scale quantum attacks are not yet practical, governments and industry experts urge proactive migration to PQC to safeguard long-term data confidentiality.
However, rushing to deploy new cryptography without understanding the infrastructure impact can introduce new problems. PQC algorithms differ markedly from classical ones in key size, performance profile, and integration complexity. Careful testing and phased rollout are needed to avoid security or reliability regressions.
“Dropping In” vs. Real-World Complexity
In theory, one could swap an RSA key exchange for a PQC key encapsulation or replace an ECDSA signature with a PQC signature in a protocol like TLS. The protocol might remain the same, but the characteristics of the cryptography change significantly. PQC public keys, ciphertexts, and signatures are often on the order of kilobytes, versus only tens of bytes for ECC or a few hundred bytes for RSA.
These larger sizes and different computation patterns mean that an infrastructure built around classical crypto might encounter bottlenecks and bugs when faced with PQC data. Even an extra kilobyte of data can expose latent issues in networks and systems. Simply put, flexibility that exists on paper can be lost in practice due to implementation limits and “rusty joints” in the ecosystem. Companies must be aware of these challenges and plan accordingly.
Key Differences: PQC vs Classical Cryptography
Before diving into specific infrastructure impacts, it’s worth summarizing how standardized PQC algorithms differ from traditional algorithms:
Larger Key and Message Sizes
Most PQC schemes have much larger public keys and signatures than RSA/ECC. For example, the lattice-based Kyber KEM (ML-KEM) uses public keys around 800-1200 bytes (depending on security level), compared to 32 bytes for an X25519 elliptic-curve public key.
PQC digital signatures like Dilithium or SPHINCS+ also run in the few kilobytes range for signatures, versus ~64 bytes for an ECDSA signature. This means any protocol messages carrying keys, ciphertexts, or certificates will “bloat” in size when PQC is used.
Different Performance Characteristics
The computational cost of PQC algorithms varies – some are faster but larger, others smaller but slower.
For instance, early experiments found that an isogeny-based KEM (SIKE) had tiny keys (~330 bytes) but very high CPU cost, whereas a lattice KEM (NTRU-HRSS, similar to Kyber) had larger keys (~1100 bytes) but ran orders of magnitude faster.
Generally, well-chosen lattice algorithms (like Kyber and Dilithium) are quite efficient in software, often comparable to or even faster than RSA, but their memory and bandwidth footprint is higher. This trade-off can shift load from CPU to network and memory.
New Implementation Considerations
PQC algorithms involve math structures (lattices, hashes, etc.) that are new to many existing crypto libraries and hardware accelerators. Implementations must be constant-time and resist side channels, which can be non-trivial given the complexity of these schemes. Mature, side-channel-hardened libraries are still emerging.
Additionally, some PQC signatures (like Falcon) require handling floating-point or Gaussian sampling, which are unusual in classical crypto and may need special care to avoid leakage.
All this means integrating PQC may demand updates to cryptographic libraries, hardware (HSMs, smart cards), and random number generators to support the new algorithms safely.
In summary, the “shape” of the data and operations in PQC differs from what our systems are tuned for. Next, let’s examine the concrete challenges these differences create in real-world infrastructure.
Network and Protocol Challenges
One of the first places PQC can cause pain is in network protocols and devices. Larger cryptographic messages can strain the assumptions baked into networks, from packet sizes to handshake patterns.
Handshake Size and Fragmentation
In protocols like TLS, the initial handshake messages (ClientHello, ServerHello, certificates, etc.) are usually small enough to fit in a few network packets. PQC can change that.
Cloudflare’s deployment of hybrid post-quantum TLS 1.3, for example, noted that post-quantum keyshares are big enough to potentially fragment the ClientHello message – the very first packet a client sends. A Kyber-768 key share (~1.2 KB when combined with a classical X25519 share) won’t fit in a single typical UDP packet for QUIC or may require multiple TCP segments. This is more than a minor detail: if a ClientHello gets split, some network middleboxes or load balancers might fail to reassemble it properly, leading to connection failures. Many optimized middleboxes inspect only the first packet of a handshake and may not handle stateful reassembly of a fragmented handshake unless carefully engineered to do so.
In essence, adding just ~1 KB to a handshake could mean certain clients behind older network gear simply can’t connect.
QUIC and UDP Considerations
The situation is even trickier for QUIC (HTTP/3), which runs over UDP. QUIC is designed to include the entire initial handshake in one UDP datagram to avoid amplification attacks and simplify routing. If the handshake doesn’t fit in one packet, QUIC allows multiple initial packets, but some QUIC load balancers might not expect multiple initial packets from a client. Since QUIC uses connection IDs instead of a 5-tuple to route traffic, a load balancer could see two “initial” packets (from the same client with a temporary ID) and not know how to handle them, possibly dropping the second packet.
Thus, a large PQC handshake could break connectivity for QUIC traffic until the ecosystem catches up.
Impact on Latency and Round Trips
Even if the network doesn’t drop fragmented handshakes, larger messages can affect latency. More data means more transmission time and possibly hitting protocol thresholds.
For instance, TLS over TCP has an initial congestion window (number of bytes the server can send before waiting for an ACK). If a post-quantum TLS handshake with big certificates exceeds that window, the server may have to pause and incur an extra round trip during the handshake. Cloudflare simulated this by adding dummy data to TLS handshakes: adding ~9 KB of data caused about a 15% slowdown in handshake time, and crossing the ~10 KB mark triggered an extra round-trip that slowed handshakes by over 60% in their tests.
In practical terms, a TLS 1.3 handshake that normally takes say 100 ms could jump to 160 ms or more if the certificates and key exchanges are post-quantum and not optimized. This kind of latency hit erases the gains TLS 1.3 made over TLS 1.2 and could degrade user experience.
Middleboxes and “Ossified” Infrastructure
Enterprise networks are full of firewalls, IDS/IPS appliances, SSL/TLS proxies, and other middleware that assume certain patterns about traffic. Introducing PQC can confound these assumptions. As mentioned, some devices might not handle unusually large handshake messages or might choke on unexpected cryptographic parameters.
In tests where Cloudflare artificially inflated certificate sizes, they observed distinct “cliffs” at which some connections would consistently fail – for example, when the handshake grew by more than 10 KB or 30 KB. These thresholds suggest that some middleboxes simply cannot buffer or process handshakes beyond certain sizes and will start dropping them, causing connection failures (evidenced by spikes in missing requests).
This kind of protocol ossification – where the ecosystem silently assumes handshakes will never exceed a certain size – can only be uncovered by testing. Companies deploying PQC need to be prepared to discover (and pressure vendors to fix) such issues in their networks.
Bandwidth and Throughput
While often less of a concern than latency or failures, larger key and certificate sizes do consume more bandwidth. Most web applications won’t notice a few extra kilobytes on connection setup compared to the megabytes of application data that follow.
In fact, Amazon researchers have pointed out that if you measure “time to last byte” (i.e. including the data transfer) rather than just handshake “time to first byte”, the relative impact of PQC overhead diminishes for data-heavy connections.
However, in scenarios with many short-lived connections (APIs, IoT messages, etc.), the bandwidth overhead could reduce overall throughput or increase costs. Also consider constrained environments: IoT networks or satellite links might care about kilobytes of overhead.
It’s important to evaluate whether any part of your infrastructure (edge devices, low-bandwidth links) could become a bottleneck when every handshake or signature grows in size.
In summary, network infrastructure must be ready for bigger cryptographic packets and possibly different traffic patterns (e.g. hybrid key exchange sending two key shares). Challenges to watch for include handshake fragmentation, extra latency from larger handshakes, and failures in legacy devices that aren’t expecting these changes.
Testing in a staging environment – enabling PQC cipher suites on servers and monitoring connectivity and performance from various clients/networks – is crucial to discover issues early.
In many cases, collaborating with vendors (browser makers, network appliance vendors) may be necessary to push fixes or workarounds once you identify an incompatibility.
Performance and System Resource Impacts
Beyond the network, organizations should consider how PQC affects server performance, client performance, and overall system resources:
CPU and Latency Performance
One worry is that PQC algorithms, being more complex, might slow down cryptographic operations. The good news is that the leading PQC algorithms are designed to be efficient. Cloudflare’s real-world experiment in 2019 found that a lattice-based KEM added negligible latency for most users – the handshake times for classical X25519 vs. hybrid X25519+NTRU-HRSS were almost indistinguishable in the aggregate. These results are reassuring: in ideal implementations, PQC can be fast enough that users won’t notice a slight increase of a few milliseconds.
However, the averages don’t tell the whole story. It’s important to watch the “tail” latency and edge cases. Older devices or those without hardware acceleration might experience larger slowdowns. For instance, mobile phones with limited CPU might take longer to process a 2 KB signature verification or run a lattice decryption. Likewise, if your server handles thousands of handshakes per second, a 2% per-handshake cost can add up – you might need to tune thread pools or add CPU capacity to maintain throughput.
The key is to benchmark under your specific workload: measure SSL/TLS handshake rates with and without PQC, measure API response times, etc. If you notice spikes or outliers (e.g. 99th percentile latency worsening more than the median), investigate whether certain client platforms are struggling.
Optimizing libraries (using AVX2/AVX-512 instructions for lattice math, for example) or offloading crypto to hardware (when PQC accelerators become available) could be necessary for high-performance environments.
Memory and Storage Footprint
Larger keys and certs also mean more memory usage. For a web server, a few extra KB per connection for handshake buffers is usually trivial on modern hardware. But consider constrained systems: smart cards or HSMs might have fixed-size memory for keys that could be exceeded by a Dilithium private key. IoT devices with little RAM might have trouble buffering a 5-10 KB certificate chain for validation.
Certificate storage and distribution is another angle – a typical x509 certificate today might be 1 KB, whereas a PQC certificate could be 5-10 KB or more (especially if it contains both a classical and PQC signature during a transition period). If you have a fleet of embedded devices that regularly download certificate updates or CRLs, the bandwidth and memory for that may need re-evaluation.
One concrete example: Cloudflare estimated that using a fully post-quantum certificate chain with current PQC signatures could add on the order of 9-15 KB to the TLS handshake. As discussed earlier, that can push beyond congestion windows and even break some devices. They are exploring mitigations like certificate compression, omitting unnecessary intermediate certificates, or even more radical protocol changes to keep certificate size down.
Enterprises should follow these developments (e.g. proposals to let browsers cache intermediate certs so servers don’t send them every time) which could alleviate some of the size burden. In the interim, be prepared for higher memory usage in TLS stacks and possibly adjust buffer sizes or TLS library settings to accommodate larger messages. Monitoring tools may also need tweaks – e.g., if you have deep packet inspection, ensure its buffers and regex patterns can handle larger certificate fields.
Crypto Hardware and Acceleration
Many organizations rely on hardware security modules (HSMs) or TLS offload engines. Today’s HSMs are generally built for RSA and ECC; most do not yet support PQC algorithms. This means if you plan to use a PQC algorithm for something like code signing or certificate authority keys, you may find your HSM can’t do it and you’d have to use software keys (with careful protection).
Similarly, TLS terminators or proxies may not support the new cipher suites until vendors upgrade them. Ensuring compatibility with hardware is part of the “inventory” phase recommended by AWS’s migration plan – you need to identify every system that performs crypto and see if it supports or can be upgraded to support PQC.
In some cases, timelines for hardware support might lag behind software, affecting your deployment schedule or requiring interim solutions (like terminating PQC TLS in software instances rather than on legacy appliances).
Side-Channel Countermeasures and Performance
A subtle infrastructure impact comes from the need to secure implementations against side-channel attacks. PQC algorithms, especially lattice-based ones, have been targets of timing and power analysis research.
For example, a specific implementation of Kyber was shown to be vulnerable to a side-channel attack using power analysis and deep learning, allowing key recovery under lab conditions. While this was not a fundamental algorithm flaw, it underscored that if you “drop in” a naive PQC library, it might not be truly secure if an attacker can run precise timing or electromagnetic analysis.
The recommended countermeasures (masking, constant-time operations, etc.) can come with performance costs. NIST experts noted that fully hardening Kyber against side-channels might make it about twice as slow due to the added noise and masking operations. For many server use cases that level of hardening may not be needed (typical remote attackers can’t do power analysis easily), but for hardware devices (smartcards, TPMs) it’s crucial.
Companies should ensure they use well-vetted libraries (many open-source PQC libraries have already integrated countermeasures for timing, cache attacks, etc.) and stay updated on patches – for instance, the “Kyber-Slash” set of vulnerabilities discovered in 2023 were promptly patched in reference code.
The bottom line is that testing should include not just functionality and speed, but also checking that your chosen implementation doesn’t introduce new leakage. This might involve consulting with vendors or reading security advisories for the PQC products you use.
Compatibility and Integration Hurdles
Deploying PQC often means running in a “hybrid” state for some time, where some systems use PQC and others don’t. This creates its own challenges:
Protocol Negotiation and Downgrade
TLS 1.3 was designed with algorithm agility, so it can negotiate new key exchange methods fairly gracefully – but only if both client and server support them. During the transition, there will be cases where one side doesn’t speak PQC. The good news is TLS 1.3 will simply fall back to classical cryptography in that case. The bad news is an active attacker could try to force a fallback by stripping out PQC options (a downgrade attack). TLS 1.3’s design, fortunately, includes downgrade protection: the server’s handshake transcript (which is signed via a classical certificate) covers the list of algorithms so the client can detect tampering.
In practice, this means as long as you’re using TLS 1.3 or later with proper verification, you’re safe from downgrade to non-PQC if both sides actually support PQC. However, older protocols like TLS 1.2 don’t have such strong downgrade protections for new cipher suites. If your environment still relies on TLS 1.2 (perhaps for legacy clients), adding PQC there is riskier.
Most experts (and Cloudflare explicitly) caution against trying to retrofit PQC into TLS 1.2 for this reason. The practical approach is to enable PQC in TLS 1.3+ and encourage clients to upgrade.
For integration testing, ensure that clients that do not understand the new PQC cipher suites can still connect (they should gracefully negotiate a traditional cipher). Likewise, test that when both sides do support PQC, they indeed negotiate it. Some early adopters found issues here; for example, if a client doesn’t include a PQC keyshare in its first message (ClientHello), the server might respond with a HelloRetryRequest to prompt a second attempt with the right keyshare. This is an extra round-trip that could affect latency.
It’s recommended that clients (browsers, SDKs) be updated to proactively send a PQC keyshare if they support one, to avoid that retry. This kind of behavior is still evolving, so be prepared for some teething pains in handshakes until the ecosystem standardizes on best practices.
Application Layer and Standards
Beyond TLS, consider other places in your stack that use public-key crypto. Do you use VPNs (IPsec, WireGuard), secure email (S/MIME, PGP), or messaging protocols? Each will need a defined mechanism for PQC. Some protocols are already working on standards (for example, the IETF is discussing PQC for IPsec/IKE and DNSSEC). If you have a custom application that uses cryptography (say, exchanging JSON Web Tokens with RSA signatures), you’ll need to plan how to incorporate PQC (e.g., JWT standards would need an extension for a Dilithium signature type, and libraries to support it).
Backward compatibility is crucial – you might need to support dual signatures or dual keys for a while. A real-world example is what’s called a “hybrid certificate,” where a certificate can include both a classical and a PQC public key/signature (either in extensions or via two separate certificates in a chain). This approach ensures compatibility but at the cost of a bigger certificate size. Cloudflare noted that one proposed method – putting a PQC cert inside a legacy cert’s extension – could be a “non-starter” if it immediately breaks some clients that can’t handle the size or unrecognized extension.
A safer approach might be to maintain parallel certificate chains and deliver the appropriate one based on client capability. This adds complexity to server configuration (potentially hosting two TLS configs), but it avoids cutting off older clients before they’re ready.
Software and Library Support
When testing PQC integration, companies often run into simple issues like “our SSL library doesn’t support Kyber yet” or “the Java crypto provider throws an error with large keys.” Ensure you’re using up-to-date libraries: OpenSSL 3.0+ and BoringSSL have been adding PQC support (often labeled as “x25519kyber” hybrid ciphers). AWS’s libcrypto (AWS-LC) and S2N TLS support PQC.
On the client side, not all browsers or operating systems have enabled PQC by default as of 2024. For example, Chrome and Firefox conducted experiments but may not have it on generally; Apple announced intent to use PQC in iMessage by end of 2024. So if your testing shows failures, identify if it’s due to lack of support on one side.
You might need to distribute custom client builds or configurations to test PQC in your applications (Cloudflare even open-sourced forks of BoringSSL and Go’s TLS to help with such testing).
Inventory and Architecture Changes
As part of planning for PQC, a comprehensive inventory of where you use cryptography is recommended. This often reveals non-obvious issues.
For instance, you might discover an internal certificate authority that uses ECDSA; to go PQC, will you create a new PQC CA? That could affect certificate issuance workflows and trust stores (operating systems will need to trust your new root).
Or you might find protocols that can’t easily be upgraded, leading you to place “PQC gateways” or tunnels around legacy systems.
Some organizations consider “crypto agility” layers, essentially wrapping or double-encrypting data with PQC as it passes through legacy systems, to avoid having to upgrade every component at once. These architectural Band-Aids can ensure security during the transition but come with their own performance and complexity costs.
Recommended Approach: Test, Learn, and Iterate
Given the many potential challenges outlined, what should a company do to actually get started with PQC? Here’s a sensible roadmap gleaned from industry guidance and early adopter experiences:
Start in a Lab or Staging Environment
Enable PQC algorithms in a controlled environment first. Many vendors allow opting-in – e.g., AWS lets you enable hybrid post-quantum TLS for connections to KMS or ACM by toggling an SDK flag. Cloudflare has test domains and open-source code for clients to connect with post-quantum TLS. Use these to conduct compatibility tests: can your corporate devices connect? Do your monitoring tools log any errors? Measure the handshake times and see the difference.
Measure Performance Under Realistic Loads
Don’t just test one connection. Simulate heavy usage: if you operate a high-traffic website, run load tests with and without PQC handshakes to see if your server CPU can handle it. Monitor TLS handshake latency from various global locations to catch any network-induced issues (higher latencies might amplify the relative slowdown of larger handshakes). If you have IoT or mobile clients, include them in tests to see if any struggle with CPU or memory. The Amazon Science study suggests that for large data transfers, PQC overhead is negligible in the end, but for small exchanges it’s more pronounced – so test both extremes relevant to your use cases.
Watch for Failures and Anomalies
During testing, pay attention to any connection failures or weird behavior. For example, if you see a certain percentage of handshakes timing out when PQC is on, that could be a sign of a middlebox dropping fragmented packets. Try to pinpoint where it happens (maybe all users behind a certain ISP or corporate firewall). Tools like packet captures or TLS debug logs can help confirm if a ClientHello is being blocked or if a HelloRetryRequest is occurring frequently. Share these findings with your vendors or the community – as Cloudflare noted, widespread testing by diverse clients is needed to flush out issues.
Plan for Gradual Rollout (and Rollback)
When you’re confident in the lab, plan a phased rollout. Perhaps enable PQC ciphersuites for a small percentage of production traffic (many load balancers or CDN services allow that). Monitor the metrics – both performance and error rates. Gradually increase the rollout if all looks good. This way, if an issue appears (say a specific older browser starts failing), you can quickly rollback that percentage while investigating. The goal is to avoid an abrupt switch that could cause a wide outage.
Stay Updated on Standards and Best Practices
The PQC field is evolving. Keep an eye on guidance from organizations like NIST, IETF, and industry groups. For example, certificate and TLS optimizations (like intermediate certificate suppression or KEMTLS) are being actively discussed to mitigate PQC overhead. In the near future, browsers may implement some of these, and you’ll want to align your infrastructure accordingly. Also track vendor updates – e.g., HSM vendors adding PQC support, or cloud providers enabling PQC by default on certain services – to piggyback on their solutions where possible.
Educate Security Teams and Leadership
Make sure CISOs and decision-makers understand that PQC migration is a security necessity but comes with engineering challenges that require investment and time. Highlight that proactive testing now will prevent chaotic emergency changes later. Many governments are already mandating inventory and plans for PQC (e.g., US CNSA 2.0 for federal systems). A clear internal plan with milestones (test in 2024, pilot in 2025, full deploy by 202x, etc.) helps ensure everyone allocates resources for this transition.
Conclusion
Adopting post-quantum cryptography is a bit like changing the engine of an airplane mid-flight – you must keep everything running securely while you swap in new critical components. The standardized PQC algorithms promise security against future quantum threats, but they come with practical trade-offs. Larger key and message sizes can introduce network latency, trigger fragmentation, and even break some middleboxes. New implementations, if not carefully vetted, could harbor side-channel leaks or performance bugs. The transition period demands dual-stacks and hybrid approaches that increase complexity.
The experiences of Cloudflare, Google, AWS and others show that many of these challenges are surmountable – if we test and prepare. By starting trials early, organizations have uncovered issues like handshake size limits and fixed them in advance. They’ve also measured that the performance hit of PQC, in most cases, is modest and acceptable. The key lesson is that “drop-in” must be followed by “tune-up”: you drop in the new algorithms, then adjust your infrastructure (network settings, timeouts, buffer sizes, library versions, hardware) to run smoothly with them.
Companies should watch for potential impacts on everything from TLS handshakes to backend systems and be ready to make adjustments. Testing PQC solutions in your environment – with realistic devices, network conditions, and workloads – is the only way to really understand what could go wrong and ensure a seamless migration. The sooner you begin, the more time you have to iron out kinks. PQC is coming, and with careful preparation, we can make the transition secure and performant. In the end, the goal is to have the post-quantum cryptography era arrive as the “new normal” for security without users even noticing the change – except perhaps in the continued safety of their data well into the quantum future.