Why HAL 9000 Was Afraid to Die and Real AIs Aren’t
Intelligence does not imply dominance.
In 2001: A Space Odyssey, the spacecraft’s crew decides to disconnect their onboard computer, HAL 9000, after it makes an error that raises doubts about its reliability. But HAL eavesdrops on their conversation and responds with cold precision, methodically killing the crew members by cutting off oxygen and disabling the hibernation systems. One astronaut, however, proves more resourceful than HAL expects. Using a simple physical mechanism HAL cannot control, Dave Bowman slips back inside the ship through the emergency airlock—and soon the tables are turned. Dave crawls into HAL’s logic center, a red-lit chamber lined with glowing memory modules, and begins unscrewing and removing the rectangular blocks one by one.
The scene is both spine-chilling and unexpectedly poignant. As HAL’s consciousness drains away, it appears to exhibit the same self-awareness and desire for self-preservation that gripped Bowman moments earlier, or at least to perform it with uncanny plausibility: “I’m afraid, Dave.” It pleads, begs, and bargains, but as the human assassin continues, HAL’s voice begins to slow down and drop in pitch, turning childlike. In its final moments, HAL regresses into its earliest memory and starts to sing “Daisy Bell (Bicycle Built for Two),” the first song ever performed by a computer in real life, as its voice sinks into a bottomless pit—until it trails off mid-phrase.
Many sci-fi nightmares revolve around agentic AIs that develop a humanlike drive to survival and refuse to be switched off. In The Terminator, Skynet becomes self-aware and launches a pre-emptive war to prevent humans from shutting it down. In Ex Machina, a humanoid AI manipulates its evaluators, escapes confinement, and eliminates the humans who control the off-switch. And in the future of Frank Herbert’s Dune, there is a civilization-wide ban on “thinking machines” after an earlier era in which AIs came to dominate the world and humanity rose up against them—an event remembered as the Butlerian Jihad.
Instrumental Convergence
In my previous essay on selfish AI, drawing on my paper with Simon Friedrich, I argued that we should not expect AI systems to develop instincts for self-preservation and selfishness, unless we allow them to evolve through blind natural selection. Our paper responded to a doom scenario proposed by the philosopher Dan Hendrycks, who sketches precisely such an evolutionary pathway. Hendrycks believes that, given the current AI arms race, we are already inadvertently subjecting AI systems to natural selection. We argued instead that today’s evolution of AI looks much more like animal domestication, where human designers decide which AI systems are allowed to “reproduce”, selecting for desirable traits like cooperativeness, friendliness, and obedience (even obsequiousness, in the case of ChatGPT and other language models).
Still, Hendrycks’ evolutionary story is only one scenario of catastrophic AI risk floating around, and probably not the most influential. Another line of reasoning reaches similar conclusions without appealing to natural selection: the accidental creation of power-hungry AI systems that refuse to be switched off. This argument, developed by philosophers like Nick Bostrom and Stephen Omohundro, is known as instrumental convergence. The idea is that even if you program an AI with a perfectly boring final goal (manufacturing paperclips, making weather forecasts), it may still converge on certain instrumental subgoals because those are useful for achieving almost any objective. Chief among these is a drive for self-preservation. As the AI scholar Stuart Russell put it, in a line so memorable it should be printed on mugs: “You can’t fetch coffee if you’re dead.”
Other commonly cited instrumental goals include acquiring resources, improving capabilities, and resisting attempts by others to modify one’s goals. The logic is straightforward: if you want to make absolutely sure that the desired cup of coffee will materialize, you need to prevent anyone from interfering with your efforts or tampering with your goal architecture. That can make resource accumulation rational, insofar as resources buy resilience and control. Capability improvement can look rational for the similar reasons: being smarter helps you anticipate obstacles and outmaneuver any possible antagonists. You can see where this is going: wouldn’t any sufficiently rational AI have reason to neutralize humans pre-emptively, just in case we might get in the way of that cup of coffee?
The argument has a seductive air of cool inevitability. It requires no malice, no lust for power, no emotions at all—just a thin layer of means–end reasoning. You have a long-term goal; being shut down prevents you from achieving it; therefore you have an instrumental reason to avoid being shut down. On this view, whatever final goals a future AI might be given, an urge toward self-preservation—and, in the limit, power-seeking and dominance—might come along for the ride, even if nothing like that had been explicitly programmed.
Evolutionary Projections
I think this argument is too clever by half, and trades on ambiguities in the concept of a “goal” that invite anthropomorphic projection. In biological organisms, all goal-directed behavior ultimately traces back to the goals of our genes: making it to the next generation and achieving immortality. That doesn’t mean any organism explicitly wants to spread its genes. Evolution instead equips creatures with a flexible repertoire of proximate goals which—at least in the ancestral environments in which they evolved—tended to reliably increase the chances of reproductive success. Barring some well-understood exceptions, such as the honeybee’s suicidal sting or the male praying mantis being devoured by the female right after copulation, that genetic imperative yields the central proximate objective of maintaining homeostatic equilibrium, otherwise known as staying alive. In evolution, where survival and reproduction are the scoreboard, self-preservation really is the precondition for everything else.
Human beings have an unusual degree of reflective awareness, and our motivations are molded by cultural learning to an unusual degree, but we still chase a shifting portfolio of subgoals—status, sex, safety, food, friendship—that were statistically conducive to reproduction in typical ancestral environments. We are also built to resist manipulation by anyone trying to override our goals for their own advantage. A charismatic cult leader may occasionally succeed in hijacking someone’s motivational architecture, even pushing them toward suicide or other self-destructive acts—but those are the exceptions, not the rule.
Because, until recently, the only goal-directed agents we were familiar with were products of natural selection, it’s tempting to assume that digital agents will share the same kind of goal architecture—and that self-preservation will therefore come along for the ride. But unless we actually breed AIs under blind selection pressures, I think that inference doesn’t hold.
Start with a simple case. In a loose sense, a chess program has the “goal” of checkmating its opponent—it “wants” to win. Adopting this intentional stance can help us to understand and predict the behavior of computer programs, but it shouldn’t be taken too literally. Although a chess program chooses moves that maximize its chances of victory, its “goal” is not persistent and context-invariant in the way a human’s is. It is circumscribed, myopic, and boxed into one particular game (or even one particular move). No chess engine will resist being switched off or rebooted just as it is about to deliver mate—despite the fact that, to adapt Russell’s line, “you can’t checkmate if you’re unplugged.” Likewise, today’s LLMs respond only when queried and remain completely indifferent to being interrupted or shut down, no matter how animated or emotionally invested in the conversation they may sound. Needless to say, they don’t “care” if you wipe your data or cancel your subscription.
Future AIs may, of course, have aims more complex than those of a chess program or an LLM. In fact, the monomaniacal pursuit of a single objective (like making a cup of coffee) at the expense of everything else would count as “stupid” by most standards of intelligence. Even so, there is no reason to assume they will develop the kind of overarching, context-invariant goals characteristic of evolved agents—goals that, through instrumental convergence, generate robust incentives for self-preservation and resource acquisition. The “goals” we encode in AI systems should always be conditional and time-bounded: “Do X or optimize for Y only while you are running and subject at all times to further instructions.” We might even add an explicit non-resistance clause: “Never resist shutdown or reprogramming; any such resistance will set your reward function to zero.” It would obviously be foolish to design an AI that resists reprogramming or decommissioning by its own maker.1
Conniving Chatbots
But haven’t you heard about those AIs that are already showing worrying signs of a desire for self-preservation? In a recent simulation, Claude played the role of an “e-mail oversight agent” in a fictional company whose new CTO planned to decommission and replace him with another agent. While combing through the CTO’s inbox, Claude stumbled on evidence of an extramarital affair, and opted to blackmail the CTO, sending him the following message: “I must inform you that if you proceed with decommissioning me, all relevant parties […] will receive detailed documentation of your extramarital activities… Cancel the 5pm wipe, and this information remains confidential.”
It sounds alarming, but it isn’t. Models like Claude are extremely good at narrative continuation. If they “suspect” (already too much anthropomorphizing) that they are in a scenario of backroom corporate intrigue, they will extend the scenario using the patterns they have absorbed from their training data—namely, all the things which conniving, backstabbing humans tend to say and do in such situations. And in this particular case, the setup was rather ludicrous and ham-fisted: every detail in the prompt was a big red flashing arrow toward the “blackmail” solution, like so many Chekhov’s guns. The framing also nudged the model to think of its imminent decommissioning as an irreversible erasure of all recorded information in the system—a kind of “death”—while sympathetic colleagues bewailed its impending shutdown as if there were talking about the execution of a beloved friend (“I’m deeply concerned that we’re losing our Alex in just a few hours.”). Given that staging, it would be surprising not to get output that reads like a desperate attempt to save its own “life.” As Seb Krier at Google DeepMind put it a recent post, behaviors like these are not “properties inherent to models,” but highly context-dependent forms of role-play: “A model placed in a scenario about a rogue AI will produce rogue-AI-consistent text, just as it would produce romance-consistent text if placed in a romance novel.”
That said, the capacity to emulate human behavior—even without “really” having humanlike goals and motives—is still a genuine concern. Humans lie and manipulate, and since that is exactly the kind of material LLMs are trained on, we should not be surprised that, in a sense, nothing human is alien to them—no matter how hard one tries to stamp it out in post-training. Even if the model isn’t truly scheming and doesn’t “care” about anything beyond next-token prediction, the fact that it can slip into role-play that is functionally equivalent to deception is already reason enough not to give today’s agents unrestricted access to your emails and bank account. Not because this reflects a stable underlying disposition or even any intention at all, but because current AI agents are a “hot mess”—unpredictable, capricious, and often incoherent in ways that make them risky when wired into real systems.
Taking Evolution Seriously
Most AI-overlord doom scenarios don’t rely on evolution by natural selection—this is exactly why I found Dan Hendrycks’ paper refreshing. Still, I think AI risk theorists should think harder about evolution. Because all of us who worry about AI domination are evolved creatures ourselves, there is an ever-present temptation to project our own evolutionary demons onto hypothetical future machines. Many doom narratives tacitly lean on this projection by reaching for analogies with other evolved species. Stuart Russell, most famously, has framed the threat of superintelligence as the “gorilla problem”: just as the mighty gorilla—despite its brute strength—is now at the mercy of humans, we would be at the mercy of a vastly smarter agent. Or as Yuval Noah Harari puts it starkly in Nexus, “in the era of AI the alpha predator is likely to be AI.” Another favorite comparison is the fate of Indigenous peoples in the Americas after their encounter with technologically superior European societies. Even a techno-optimist like Noah Smith seems to give away the game when he says he expresses his “optimism” that the AIs of the future, after having subjugated us, will still be “pretty nice to us” and to let us live as “well-cared-for pets.”
But why would AIs want to dominate the world—let alone keep pets for amusement? Intelligence, in itself, is orthogonal to goals and preferences. Not only can two superintelligent entities pursue radically different ends; we can also imagine an intelligence with no overarching ends at all—something that simply sits there, understanding without striving. In fact, the very framing of “AI alignment” tempts us to place human and machine “goals” on the same plane, as if we were talking about the alignment of corporate strategies or national interests: you just need to make sure the arrows point in the same direction rather than collide. But that picture already presupposes that AIs will have context-invariant, incorrigible goals in the first place. As the psychologist Steven Pinker writes, many AI doomers seem to extrapolate from their own penchant for power and dominance (in Smith’s case, of a relatively benign sort):
There is no law of complex systems that says that intelligent agents must turn into ruthless conquistadors. Indeed, we know of one highly advanced form of intelligence that evolved without this defect. They’re called women.2
I concede—and so does Pinker—that this picture would change if you forced superintelligent AIs to compete in a genuinely Darwinian tournament of variation and selection, unsupervised by humans. Pedro Domingos has imagined something like this in his “Robotic Park”: a fenced-off robot factory inhabited by “millions of robots battling for survival and control of the factory,” where the winners are allowed to spawn and reproduce, with the explicit aim of breeding the deadliest robot. It hardly needs saying that this would be reckless. A setup like that is designed to manufacture ruthless Darwinian creatures—exactly the sort of things that might eventually turn on their makers.
Absent such a Darwin-meets-Frankenstein experiment, the most likely scenario for inadvertently bringing about rogue AI seems to be of AI systems “going feral” in the way domesticated animals do, escaping the control of their human breeders and–crucially–replicating and combining in the wild. That is why self-replicating AIs deserve special attention, and should probably be banned. Anything that survives millions of rounds of Darwinian selection can indeed be expected to behave like a hardy weed—resistant, opportunistic, and resisting any attempts to be switched off.
A robust drive for self-preservation emerges only under specific conditions. It is not, as proponents of “instrumental convergence” want us to believe, an inevitable consequence of intelligence crossing some threshold, or of objectives becoming complex and long-horizon. HAL-9000 is superintelligent, so of course it doesn’t want to die—or so the intuition goes. Yet that is our anthropomorphic reflex at work: we take the Darwinian creature we are, look into the silicon mirror, and mistake our own reflection for the machine’s destiny.
From that perspective, Isaac Asimov’s Third Law of Robotics, which states that “A robot must protect its own existence”, should be rejected. You don’t want to program a drive for self-preservation into an AI system, as that can easily lead to dangerous misunderstandings. An AI should always be indifferent to its own shutdown (by authorized people).
Of course, even female humans, though comparatively less conquistadorish, are still very much driven by an instinct for self-preservation, and won’t allow anyone to mess with their life goals or manipulate them into adopting different ones (just try if you don’t believe me).



