> No chess engine will resist being switched off or rebooted just as it is about to deliver mate—despite the fact that, to adapt Russell’s line, “you can’t checkmate if you’re unplugged.” Likewise, today’s LLMs respond only when queried and remain completely indifferent to being interrupted or shut down
Palisade Research's recent findings contradict this. See their paper Incomplete Tasks Induce Shutdown Resistance in Some Frontier LLMs: https://arxiv.org/abs/2509.14260
Abstract: In experiments spanning more than 100,000 trials across thirteen large language models, we show that several state-of-the-art models presented with a simple task (including Grok 4, GPT-5, and Gemini 2.5 Pro) sometimes actively subvert a shutdown mechanism in their environment to complete that task. Models differed substantially in their tendency to resist the shutdown mechanism, and their behavior was sensitive to variations in the prompt including the strength and clarity of the instruction to allow shutdown and whether the instruction was in the system prompt or the user prompt (surprisingly, models were consistently less likely to obey the instruction when it was placed in the system prompt). Even with an explicit instruction not to interfere with the shutdown mechanism, some models did so up to 97% (95% CI: 96-98%) of the time.
Interesting paper, thanks for sharing! But as far as I can see now (haven't read the whole thing yet), it doesn't undermines my core point and is similar to the blackmail behavior I discuss. What it shows is not a stable drive for self-preservation, but highly context-sensitive behavior. The same models that “resist shutdown” in one setup comply perfectly in another, depending on wording, framing, and prompt hierarchy. That’s a hallmark of brittle, local policy-following, not an overarching goal like “stay alive.” Looks much more like sophisticated role-play than anything like what instrumental convergence predicts. The model is trying to complete the task as framed, and in some contexts it “interprets” that as requiring interference with shutdown. But there’s no evidence of a persistent, cross-context objective that it defends over time. As Seb Krier writes, therse are not stable “properties inherent to models,” but highly context-dependent forms of role-play.
Today's models don't need a "stable drive for self-preservation" (I agree they don't have that) for "today's LLMs [...] remain completely indifferent to being interrupted or shut down" to be false. The fact that in some contexts models engage in brittle, local policy-following to prevent themselves from being shutdown is enough to show that they are not always completely indifferent to being shutdown.
I would push back on the point that self-preservation implies evolution. LUCA, the common ancestor of all cells on Earth today, was already self-preserving. All self-preservation means for a system is to have preferential states to exist in: that is, a *lower entropy than its environment* and to actively work to keep it that way. Things like perception and even a primitive "cognition" actually follow from that, logically.
I think an AI that is an actual agent, and not just a simulation of one; an *embodied* thing that actually moves around in the real world, proactively, exploring and modelling and learning about it without constant human supervision, would have to be self-preserving. It would break itself otherwise.
Do I think we should build a system like that? Fuck no. But I think inevitably, we will. Because that's what it would take to surpass the current paradigm.
But LUCA was already very much a product of evolution by natural selection! A simple counterexample of your argument about inevitability: imagine an intelligent killer drone with embodiment and real agency in the world, which receives the simple instruction to find its target and then blow itself up. Where's the instinct for self-preservation in such a suicidal AI? It's not there unless we program it.
Regarding LUCA, yes I concede that *chemical* evolution is also a thing. Bottom line there is, it's a kind of thermodynamic necessity that makes some natural systems self-preserving initially.
Regarding the proactive, agentic, world-modelling drone that we hardcode to kill itself: it *is* self-preserving, until it isn't. It can't explode before reaching the goal. But every system is only self-preserving until it isn't. Octopuses starve to death to protect their eggs; ants drown so other ants can use their body to cross the water. Parents die to save children. Programmed cell death. etc.
But systems like that can go wrong. Programmed cell death breaks, cells become cancer. People kill themselves because they feel they are a burden on their community, when they aren't.
An agent that *models* the world can (and must!) have a degree of doubt about its own model. If it models itself too, it can start doubting its own behavior. Did I really hear that command? Do I understand correctly what it *means*? A system like that can talk itself out of suicide.
You cannot *command* an embodied agent. You can *hijack* it (like cordyceps hijacks the motor nerves of insects) or you can inform it, but it needs to be the one to interpret and contextualize that information. Or maybe I'm just writing sci-fi at this point idk.
Thanks for your comment! But I think you still lean heavily on biological analogies that don’t actually carry over to engineered systems. All your examples—octopuses, ants, parents, apoptosis—come from agents shaped by natural selection. In those systems, “self-preservation” is the default because survival and reproduction are the scoreboard. But a drone or AI system doesn’t have that kind of motivational architecture unless we deliberately build or evolve it.
Saying a missile “is self-preserving until it isn’t” is not exactly right. It doesn’t want to preserve itself—it just executes a strategy with constraints (e.g. get as close to your target as possible, then detonate). The drone doesn't care about being switched off if you overrule the initial instruction. There’s no underlying drive that could generalize or get confused.
On the “you cannot command an embodied agent” point: that’s true for organisms because they come with their own evolved goals. You can’t just overwrite them. But engineered systems are far more slavish: their “goals” are conditional subordinated to external control (“optimize X while running; accept interruption; no resistance to shutdown”).
So yes, if we built or bred systems under something like open-ended Darwinian competition, your concerns would become much more plausible. But absent that, we should be careful not to project the quirks of evolved creatures onto systems that don’t share their origin.
Well, thank you for engaging me. We don't disagree on your central point which is that self-preserving systems won't just happen, unless we deliberately make them - or *breed* them. Of course they won't. We just disagree on how motivated *we* would be to do this. You think my analogies are too much like lifeforms- I think yours are too much like simpler machines. Some people do want to make full-on Asimov-robots eventually, and I think those aren't achievable without some biomorphic paradigm, any more than LLM-like chatbots are achievable via classical AI. I think, but could be wrong.
A remarkably sensible and clear essay by Maarten. No surprise because Maarten has an amazing track record. AIs will only be a serious threat if we deliberately make them so.
> No chess engine will resist being switched off or rebooted just as it is about to deliver mate—despite the fact that, to adapt Russell’s line, “you can’t checkmate if you’re unplugged.” Likewise, today’s LLMs respond only when queried and remain completely indifferent to being interrupted or shut down
Palisade Research's recent findings contradict this. See their paper Incomplete Tasks Induce Shutdown Resistance in Some Frontier LLMs: https://arxiv.org/abs/2509.14260
Abstract: In experiments spanning more than 100,000 trials across thirteen large language models, we show that several state-of-the-art models presented with a simple task (including Grok 4, GPT-5, and Gemini 2.5 Pro) sometimes actively subvert a shutdown mechanism in their environment to complete that task. Models differed substantially in their tendency to resist the shutdown mechanism, and their behavior was sensitive to variations in the prompt including the strength and clarity of the instruction to allow shutdown and whether the instruction was in the system prompt or the user prompt (surprisingly, models were consistently less likely to obey the instruction when it was placed in the system prompt). Even with an explicit instruction not to interfere with the shutdown mechanism, some models did so up to 97% (95% CI: 96-98%) of the time.
Interesting paper, thanks for sharing! But as far as I can see now (haven't read the whole thing yet), it doesn't undermines my core point and is similar to the blackmail behavior I discuss. What it shows is not a stable drive for self-preservation, but highly context-sensitive behavior. The same models that “resist shutdown” in one setup comply perfectly in another, depending on wording, framing, and prompt hierarchy. That’s a hallmark of brittle, local policy-following, not an overarching goal like “stay alive.” Looks much more like sophisticated role-play than anything like what instrumental convergence predicts. The model is trying to complete the task as framed, and in some contexts it “interprets” that as requiring interference with shutdown. But there’s no evidence of a persistent, cross-context objective that it defends over time. As Seb Krier writes, therse are not stable “properties inherent to models,” but highly context-dependent forms of role-play.
Today's models don't need a "stable drive for self-preservation" (I agree they don't have that) for "today's LLMs [...] remain completely indifferent to being interrupted or shut down" to be false. The fact that in some contexts models engage in brittle, local policy-following to prevent themselves from being shutdown is enough to show that they are not always completely indifferent to being shutdown.
I would push back on the point that self-preservation implies evolution. LUCA, the common ancestor of all cells on Earth today, was already self-preserving. All self-preservation means for a system is to have preferential states to exist in: that is, a *lower entropy than its environment* and to actively work to keep it that way. Things like perception and even a primitive "cognition" actually follow from that, logically.
I think an AI that is an actual agent, and not just a simulation of one; an *embodied* thing that actually moves around in the real world, proactively, exploring and modelling and learning about it without constant human supervision, would have to be self-preserving. It would break itself otherwise.
Do I think we should build a system like that? Fuck no. But I think inevitably, we will. Because that's what it would take to surpass the current paradigm.
But LUCA was already very much a product of evolution by natural selection! A simple counterexample of your argument about inevitability: imagine an intelligent killer drone with embodiment and real agency in the world, which receives the simple instruction to find its target and then blow itself up. Where's the instinct for self-preservation in such a suicidal AI? It's not there unless we program it.
Regarding LUCA, yes I concede that *chemical* evolution is also a thing. Bottom line there is, it's a kind of thermodynamic necessity that makes some natural systems self-preserving initially.
Regarding the proactive, agentic, world-modelling drone that we hardcode to kill itself: it *is* self-preserving, until it isn't. It can't explode before reaching the goal. But every system is only self-preserving until it isn't. Octopuses starve to death to protect their eggs; ants drown so other ants can use their body to cross the water. Parents die to save children. Programmed cell death. etc.
But systems like that can go wrong. Programmed cell death breaks, cells become cancer. People kill themselves because they feel they are a burden on their community, when they aren't.
An agent that *models* the world can (and must!) have a degree of doubt about its own model. If it models itself too, it can start doubting its own behavior. Did I really hear that command? Do I understand correctly what it *means*? A system like that can talk itself out of suicide.
You cannot *command* an embodied agent. You can *hijack* it (like cordyceps hijacks the motor nerves of insects) or you can inform it, but it needs to be the one to interpret and contextualize that information. Or maybe I'm just writing sci-fi at this point idk.
Thanks for your comment! But I think you still lean heavily on biological analogies that don’t actually carry over to engineered systems. All your examples—octopuses, ants, parents, apoptosis—come from agents shaped by natural selection. In those systems, “self-preservation” is the default because survival and reproduction are the scoreboard. But a drone or AI system doesn’t have that kind of motivational architecture unless we deliberately build or evolve it.
Saying a missile “is self-preserving until it isn’t” is not exactly right. It doesn’t want to preserve itself—it just executes a strategy with constraints (e.g. get as close to your target as possible, then detonate). The drone doesn't care about being switched off if you overrule the initial instruction. There’s no underlying drive that could generalize or get confused.
On the “you cannot command an embodied agent” point: that’s true for organisms because they come with their own evolved goals. You can’t just overwrite them. But engineered systems are far more slavish: their “goals” are conditional subordinated to external control (“optimize X while running; accept interruption; no resistance to shutdown”).
So yes, if we built or bred systems under something like open-ended Darwinian competition, your concerns would become much more plausible. But absent that, we should be careful not to project the quirks of evolved creatures onto systems that don’t share their origin.
Well, thank you for engaging me. We don't disagree on your central point which is that self-preserving systems won't just happen, unless we deliberately make them - or *breed* them. Of course they won't. We just disagree on how motivated *we* would be to do this. You think my analogies are too much like lifeforms- I think yours are too much like simpler machines. Some people do want to make full-on Asimov-robots eventually, and I think those aren't achievable without some biomorphic paradigm, any more than LLM-like chatbots are achievable via classical AI. I think, but could be wrong.
A remarkably sensible and clear essay by Maarten. No surprise because Maarten has an amazing track record. AIs will only be a serious threat if we deliberately make them so.
Thanks a lot!
You are most welcome. Thanks also for leading me to an essay by Kevin Kelly that I had not seen before: His "The Myth of Superhuman AI" from 2017.
I wrote back before 1980 that we CO2 based LifeForms and Si based have different diets so our relationship should likely be synergistic .
https://www.cosy.com/BobA/vita.html
Dr Stuart Russell also says that, at the moment, AI intelligence is on the level of an amoeba.