Testing AI in Defense – A New Kind of Security Challenge

Marin July 24, 2003

8 minutes read

AI Meets the Battlefield

Artificial Intelligence or “AI” these days typically refers to sophisticated decision support systems – often descendants of the expert systems and rule-based algorithms of the 1990s. Although there are developments around e.g. fuzzy logic and Bayesian inference to make it more than DSS. These systems are increasingly being adopted by the military to potentially assist human commanders by processing large amounts of sensor and battlefield data and suggesting courses of action. For example, projects tried to combine expert knowledge with probabilistic methods to detect threats and estimate enemy positions. Instead of giving a simple yes/no answer, such systems might say “There is an 80% probability of an enemy tank unit on that ridge” – introducing a degree of uncertainty by design.

While the AI capabilities are still quite modest, defense innovators are already pushing the envelope. Major defense contractors and research agencies are experimenting with ways to integrate AI into weapons systems to help soldiers make faster and better decisions – or even to delegate some decisions to the machine. In fact, last year DARPA’s Joint Unmanned Combat Air Systems program was testing the Boeing X-45A, a prototype combat drone. The X-45A is envisioned as part of the “next generation of completely autonomous military aircraft,” capable of carrying out strike missions with minimal human intervention. This level of autonomy was still mostly experimental, but it’s a clear signal where things are headed.

Against this backdrop, our team at Cyber Agency was engaged by a defense contractor to test the security and integrity of an AI-driven component in a weapons system. The client was exploring whether an AI module could be integrated into a battlefield system to support warfighters – by analyzing threats, recommending actions, or even automatically cueing weapon responses in certain scenarios. It is a cutting-edge work we completed a few months ago and we have received a special permission to talk about it in general terms today.

We approached the task with our usual cybersecurity toolkit and mindset, but we soon discovered that testing an AI-enabled system was a very different animal compared to auditing a traditional piece of software.

The Engagement: A Peek into an AI-Augmented Weapon System

The system we examined was, essentially, an AI-assisted decision maker embedded in a larger weapons platform. (Due to confidentiality we can’t name it, but think of a smart command and control unit on a vehicle or air defense system.) Its job was to take in sensor inputs, radar tracks, camera feeds, etc., and suggest actions to the human operator. In some modes it could even initiate responses on its own if certain criteria were met. This wasn’t a fully autonomous “killer robot” by any stretch; it was more of a high-tech adviser with an auto-pilot. It is, however, a big deal – a first step for this company from the manual, deterministic systems of the past toward something more adaptive (but also somewhat non-deterministic).

From day one of testing, we noticed how different this AI-driven component was from classical software. A traditional program, given the same input, will produce the same output every time – it’s deterministic. But this AI system had a mind of its own (so to speak). It employed complex algorithms and heuristics that could yield different results in subtly different conditions. In some training scenarios, a tiny change in sensor input (e.g. one extra vehicle on the radar screen) could lead to a different recommendation or threat priority ranking. The underlying code wasn’t just a straightforward series of if-then rules; it was more like a web of weighted factors, possibly even a learning algorithm tuned from prior data. To us as testers, it felt quasi non-deterministic. We couldn’t easily predict what it would do without exhaustive testing of each scenario – and the space of possible scenarios was enormous. This hinted at a fundamental truth: as soon as you add adaptive or AI logic to a system, you introduce a level of unpredictability. Increasing the autonomy and adding more inputs, increased the unpredictability.

AI and Non-Determinism: A Double-Edged Sword

It became clear that the AI’s strength – its ability to handle complex, fuzzy situations – was also a source of new risk. Because the system didn’t follow one fixed decision tree, it could sometimes surprise us. Now, surprise might be acceptable (even desirable) if the AI finds a clever solution faster than a human. In fact, the military was optimistic on that front: in one recent experiment by the U.S. Army Battle Lab, an AI-based decision aid called ICCES helped planning staff generate battle plans much faster than usual, without hurting the quality of decisions. The trial results “alleviated concerns about [the AI tool’s] negative impacts” and showed dramatic time savings, suggesting such technologies were mature enough for near-future use. Those of us in the cybersecurity test team saw that promise – the AI could indeed accelerate data processing and give a commander more decision-making bandwidth.

However, we were also paid to think like skeptics. And one big question loomed for us: what about the unintended actions or mistakes this AI might make? A human officer’s thought process might be slow, but at least a trained officer is accountable and can explain their reasoning. Here we had a black-box algorithm that might flag a neutral object as a hostile threat due to a sensor glitch or some unforeseen input pattern. If that happened in the field, would the human operator trust their gut to override the AI, or trust the machine? This trust dilemma was not abstract to us – it was very real. We even had a vivid case study: the Patriot air defense incidents in the Iraq War. Just a few months ago, Patriot missile batteries – running in an automated mode – misidentified friendly aircraft as incoming enemies, with deadly consequences. In one instance a Patriot battery shot down a British Tornado fighter, and weeks later another Patriot downed a U.S. Navy F/A-18 Hornet, all due to the system automatically engaging what it thought were hostile targets. Investigations later showed the Patriot’s computer-driven identification routines were at fault, essentially a tragic software mistake. The U.S. Army swiftly reacted by switching those batteries back to manual mode (requiring a human to approve engagements), which immediately stopped further friendly-fire mishaps.

For us, the Patriot fratricide was a chilling validation of our concerns. Here was a relatively simple automated weapon, just a radar+missile system with set algorithms, and it created a new kind of error that classic quality assurance hadn’t caught. It underscored that when you let machines make split-second lethal decisions, you face a whole new category of “security” issue: not a hacker intrusion, but the machine itself going awry. Bottom line, the embrace of AI and automation in warfare comes with immense risk because new systems introduce the possibility of novel failure modes and errors.

In one sense, I am grateful for this client for entrusting us with this engagement, as we were beginning to grasp with these novel risks, that are likely going to increase significantly in coming years.

Beyond Classical Cybersecurity: New Challenges We Identified

Our team’s mandate was “security testing,” and we tackled it from multiple angles. Of course, we did look for traditional cybersecurity vulnerabilities in the AI system’s implementation – buffer overflows, network interface flaws, and so on – and we found a few mundane issues to patch. But the heart of our assessment revolved around these broader safety and reliability concerns:

Unpredictable Decision-Making: We logged instances where the AI’s output was inconsistent or inexplicable to the end-user. For example, in one run the system recommended a pre-emptive alert on a target approaching a checkpoint, but in a slightly tweaked replay scenario it stayed silent on the same target. Why? The reasons were buried in complex logic and perhaps non-linear sensor fusion algorithms. This black-box effect meant we had to warn the client that operators might not always understand why the AI does what it does, which is itself a security concern. If a soldier mistrusts or misunderstands the AI, they could make a wrong call in the heat of battle.

Data Integrity and Sensor Input Manipulation: We realized that if the AI’s decisions depend on sensor data, then securing the sensors and their data channels is absolutely critical. An adversary jamming or spoofing a sensor could feed false data to the AI, potentially causing it to misclassify targets or make harmful recommendations. In essence, the AI introduced new attack surfaces – it wasn’t just about someone hacking into the computer; even misleading inputs could be a form of attack. Classic cybersecurity at the time was mostly concerned with protecting systems from unauthorized access, but here we were looking at how a system could be “tricked” into doing the wrong thing without any code being altered.

Lack of Deterministic Testing Outcomes: Normally, testing a weapons system or any critical software involves running through test cases and checking expected outputs. But how do you define an expected output for an AI that’s designed to be adaptive? This question plagued us. We designed as many scenario simulations as we could, and we did catch the AI making a few questionable calls (which the developers then tweaked). But we had to convey in our report that no amount of testing could guarantee this AI would always behave “correctly” in unforeseen circumstances. This was a stark contrast to conventional systems where one could theoretically cover all logic branches. With the AI, we had to shift the mindset to probabilities and confidence levels, not certainties. Our recommendation was to keep the human firmly in the loop and use the AI as an advisor, not an autonomous decision-maker, until it could be proven extremely reliable.

Ethical and Legal Considerations: This might have been slightly outside our technical scope, but we couldn’t help flagging it. If the AI were ever allowed to take lethal action autonomously, who is accountable for mistakes? In 2003, there were no clear military protocols or doctrines on fully autonomous engagement – in fact, the immediate response to the Patriot incidents was to re-emphasize human control. We noted that any move toward greater autonomy should be accompanied by policy decisions about accountability.

In short, we concluded that these AI systems brought additional layers of risk and uncertainty on top of the usual cybersecurity concerns. It wasn’t just about someone hacking in; it was about the system making a poor decision on its own, or being misled in ways a human might catch. This required a more holistic approach to “security” – blending traditional infosec with safety engineering and ethics.

Reflections: Wariness and Hope in Equal Measure

Looking back at our engagement, I’m concerned that the pace of AI adoption in military will overtake our abilities to validate and verify such systems. If AI systems continue proving valuable, there is nothing that will stop their adoption, but the infosec and QA communities simply don’t have the tools and skills to ensure these systems behave in an ethical way.

I remember our team having late-night chats after testing, grappling with the concept of a non-deterministic machine influencing life-and-death decisions. It felt like we were stepping into uncharted territory. A few of us were old enough to recall the Cold War days of fail-safe mechanisms and the insistence on positive control of weapons; the idea of a computer program, one we couldn’t fully debug, potentially deciding on a target struck us as profoundly concerning.

At the same time, as engineers (and many of us military veterans), we saw the appeal. Faster decisions can save lives in combat. Better processing of intel can outsmart an enemy. Those Army trials where planning time dropped from 8 hours to 2 hours with AI assistance were hard to ignore – that kind of boost can be a game-changer in war. We weren’t luddites; we didn’t want to throw the AI out entirely. But we advocated strongly that any AI in weapons should remain a tool for the human, not a replacement of the human. That essentially meant keeping a human in the loop for critical decisions and using the AI’s recommendations with caution and oversight.

It feels like we were a small part of the groundwork for the AI safety field in defense, a field that will in all likelihood become one of the pressing technology issues in coming decades.

Quantum Upside & Quantum Risk - Handled

My company - Applied Quantum - helps governments, enterprises, and investors prepare for both the upside and the risk of quantum technologies. We deliver concise board and investor briefings; demystify quantum computing, sensing, and communications; craft national and corporate strategies to capture advantage; and turn plans into delivery. We help you mitigate the cquantum risk by executing crypto‑inventory, crypto‑agility implementation, PQC migration, and broader defenses against the quantum threat. We run vendor due diligence, proof‑of‑value pilots, standards and policy alignment, workforce training, and procurement support, then oversee implementation across your organization. Contact me if you want help.

Talk to me Contact Applied Quantum

Marin July 24, 2003

8 minutes read

Table of Contents

AI Meets the Battlefield

The Engagement: A Peek into an AI-Augmented Weapon System

AI and Non-Determinism: A Double-Edged Sword

Beyond Classical Cybersecurity: New Challenges We Identified

Reflections: Wariness and Hope in Equal Measure

Quantum Upside & Quantum Risk - Handled

Marin

Related Articles

Securing Machine Learning Workflows through Homomorphic Encryption

5G Network: A Quantum Leap in Connectivity – and Cyber Threats

Wi-Fi Security 101 (Non-5G IoT Connectivity Options)

AI and Canada: Pioneering Innovation, Searching for Homegrown Success