On December 15, 2020, a U-2 spy plane flew over California. It was piloted by a human, but its sensors, responsible for detecting enemy missiles and launchers, were controlled by an artificial intelligence. It was the first known use by the US Air Force of AI in a live aircraft.
The model used, ARTUmu, was based on MuZero, an algorithm developed by Google’s DeepMind for learning games such as Go and Chess through reinforcement learning.
Only two years prior, the company and its co-founders were signatories to a pledge against the development of autonomous weapons. But if this irony was not widely picked up on by the press, it was perhaps partly because ARTUmu was not the threat they were looking for. Neither a superhuman general intelligence, nor a killer robot — ARTUmu was essentially deciding where to point the radar.
MuZero itself is an extension of AlphaZero, a generalised version of the AlphaGo program that beat Lee Sedol in 2016. Widely regarded as a milestone not likely to be passed for many more years, the algorithm’s win was a significant achievement for AI research (and a “Sputnik moment” in the AI arms race for Chinese observers).
AlphaGo bore little resemblance to its human opponent. It had been trained on millions of simulated games, and drew 1MW of power by one estimate. Whatever Lee was doing relied on a far smaller set of past examples, and supposedly only a few cups of coffee. Even more crucially, AlphaGo was not a generalist like Lee; it was only really good at playing Go.
To the cynic, it was what cognitive scientist Douglas Hofstadter had termed “trickery” (in reference to IBM’s chess computer Deep Blue) — a machine that in his view relied more on brute force than “intelligence” (unless counting the mathematicians and computer scientists who developed it).
“Deep Blue plays very good chess — so what? Does that tell you something about how we play chess? No. Does it tell you about how Kasparov envisions, understands a chessboard?”
The applicability of such criticism to AlphaGo is debatable both due to the nature of the game (the number of possible board configurations in Go is much larger than for chess) and the nature of AlphaGo (its strategies are not hand-coded, but learned). Nonetheless, AlphaGo had the “knowledge” of having seen many more possible games than Lee had, and at its core was still the ability to traverse the branches of many different moves and evaluate their probability of success several turns into the future (using a process called Monte Carlo Tree Search).
The tendency of modern AI to employ ever larger architectures, and vast volumes of training data, has been noted by recently departed Googler Timnit Gebru in the context of language modelling, where state-of-the-art transformer models have sometimes billions of trainable parameters. In a much shared paper, she and colleagues argue:
“[A language model] is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.
The ersatz fluency and coherence of LMs raises several risks, precisely because humans are prepared to interpret strings belonging to languages they speak as meaningful and corresponding to the communicative intent of some individual or group of individuals who have accountability for what is said.”
The error in imputing particular [human-like] reasoning to a model capable of achieving some level of predictive performance is well illustrated in other domains of machine learning too, such as image recognition. Research into adversarial examples has shown how tiny perturbations (that is, applying noise) to a correctly classified image can cause it to be misclassified by a model, despite appearing identical to the original to a person. Even more concerning, generating adversarial examples turns out to be trivial, and the set of possible adversarial examples that exist for a particular model, large. Notwithstanding ongoing work on defending against such attacks, the phenomenon illustrates a salient point for AI practitioners, users, and policy-makers: we cannot assume AI “sees” the world as we do simply because of its efficacy at performing some technical benchmark (e.g. ImageNet).
While much attention has been given to the risks of a hypothesised artificial general intelligence (AGI) undertaking some undesirable course of action to optimise a reward function, even today’s narrow AI, with much more limited capacity for action, poses serious risks, whether because of the stakes involved in their decisions, or because of the complexity of the systems they act on.
An illustration of just how high the stakes can be is offered by our grim track record of nuclear close calls — most the result of technical glitches rather than political tensions. In one case, detectors were triggered by a large power grid outage in the northeastern US; in another, Soviet early warning satellite systems triggered a false alarm due to the reflection of sunlight.
In addition, realtime deployment brings risks of feedback loops, and potentially hard to predict interaction with other agents. A cascade of selling by high-frequency trading algorithms, for example, is posited as one of the most likely explanations for the Flash Crash of 2010, in which $1 trillion was wiped off US stock markets in the space of thirty minutes. Just as a game like Go, despite its simple rules, can in fact be enormously complex, an intelligent agent need not possess general intelligence, or have a large scope of possible actions, for its consequences to be hard to predict.
Neither the high-frequency trading algorithms of 2010, nor the early warning systems of the Cold War era, could be termed AI. They were much simpler and easier to diagnose. By comparison, models with billions of parameters can be extremely opaque; beyond broad heuristics about what type of architectures work well for certain problems, it is striking in some ways how little is understood about today’s state-of-the-art models (not least, why they work at all).
And even if a system is designed to run important decisions by a human, those AI users are prone to fall into the trap Gebru et al. mention — of ascribing meaning or intelligence to the decisions of a machine based on its performance on some [potentially unrepresentative] test problem. They are unlikely to understand the architecture of the system well enough to judge where it might be wrong. (We should consider ourselves fortunate that Petrov was more sceptical.)
Among those acutely aware of these problems — or, depending on your perspective, hurdles — is DARPA, whose Explainable AI (XAI) program was established in 2016, with the goal of enabling “future warfighters” to “understand, appropriately trust, and effectively manage an emerging generation of artificially intelligent machine partners.” DARPA specifically notes the questions of: why a model makes some decision; why not another decision; in what situations might the model get it wrong.
A high level presentation summarising the initiative, and some of the existing literature around explainability in ML, posits an inverse relationship between model performance and explainability, with deep neural networks like those underpinning ARTUmu/MuZero being the least explainable, but offering potentially the best performance.
As a solution, XAI proposes that auxiliary models could be trained to generate explanations ex-post, by translating higher level features of a model into words. In one example, researchers at UC Berkeley trained a generative language model for an image classifier of bird species, using image descriptions of the subject’s discriminative features to learn to produce reasonable explanations of how an image had been categorised. XAI suggests that ultimately, a model could explain why it flagged some item of interest to an intelligence analyst, or why it made a certain decision in a post-mission review.
Alternatively, supplementary tools could generate interpretable local approximations (simplified models to act as a proxy for model behaviour over a small subset of the domain), or — in the case of machine vision — simply visualise the areas of an image being focussed on to make a classification.
According to a UNIDIR report on black box models in the military context, techniques like these might help users understand a model’s decision, as well as indicate if it is “engaging in ‘reward hacking’ or other problematic behaviours” (e.g. a model learning to distinguish between huskies and wolves on the basis of the image background rather than the characteristics of the animal). This could be especially important for models like ARTUmu, which must be trained on artificial scenarios, and may learn undesirable strategies specific to some imperfect combat simulation. But the same report cautions placing too much faith in such explanations, which at best offer indirect evidence of internal model logic, potentially with insufficient, or at worst incorrect, detail.
Often contrasted with post hoc methods are proposals for more inherently interpretable models. Beyond the vague suggestions of policy makers, however, it’s not easy to get a grasp on what this might actually mean. Can a model with billions of parameters be interpretable? And if so, do interpretable models exist for all classes of problems?
Complicating matters further, it’s also not clear that we have a good computational model of what human explanation is. In one exploration of this question, Tim Miller finds that explanations are rarely complete, but instead highly selected according to context (consider how “cause of death” might be described as “multiple haemorrhage” or “negligence on the part of the driver” by the doctor and lawyer respectively). They are also social, both in that they are often communicated as a dialogue, as in their embedding of social norms within the explanations themselves.
In turn, this means more work is likely needed not just on the technical problem of how to generate explanations, but on how people interpret and use them. Given how little we know about our own minds, besides the retrospective — and often poorly narrated — justifications we apply to our actions, interpretability might be a remote expectation for any sufficiently complex AI.
“It would therefore be potentially problematic to build policy or norms on the assumption that reliable, replicable understandability for complex AI in critical roles can be achieved by technical means alone in either the short or medium term. It may be safer to assume that a lack of understandability could continue to be an inherent aspect of complex AI in all the roles for which it is being considered and that technical explainability measures will at best serve to complement non-technical approaches.” — Arthur Holland Michel, UNIDIR, The Black Box, Unlocked
In summary, despite the frequency with which new benchmark records continue to be set and milestones broken, we should be cautious about ascribing competence to the underlying algorithms. Our current AI is narrow and brittle. For now, the risks are more mundane than sci-fi; models get things wrong, and their increasing complexity makes it hard to diagnose when that happens. New benchmarks on interpretability and predictability almost certainly need developing, but given how tricky this is, the moratoriums being called for by some, especially where risks are multiplicative or systemic — as in financial markets or combat — might make more sense.
For ARTUmu though, this is unlikely to be the end of the line. International positions on military AI reflect capability, with states that are investing heavily in such technologies more likely to oppose an outright ban. The challenge for AI ethicists and policy-makers is thus not merely an academic exercise, but to advocate for the global adoption of limits as a matter of urgency.