33 Comments
User's avatar
Will Kiely's avatar

> No chess engine will resist being switched off or rebooted just as it is about to deliver mate—despite the fact that, to adapt Russell’s line, “you can’t checkmate if you’re unplugged.” Likewise, today’s LLMs respond only when queried and remain completely indifferent to being interrupted or shut down

Palisade Research's recent findings contradict this. See their paper Incomplete Tasks Induce Shutdown Resistance in Some Frontier LLMs: https://arxiv.org/abs/2509.14260

Abstract: In experiments spanning more than 100,000 trials across thirteen large language models, we show that several state-of-the-art models presented with a simple task (including Grok 4, GPT-5, and Gemini 2.5 Pro) sometimes actively subvert a shutdown mechanism in their environment to complete that task. Models differed substantially in their tendency to resist the shutdown mechanism, and their behavior was sensitive to variations in the prompt including the strength and clarity of the instruction to allow shutdown and whether the instruction was in the system prompt or the user prompt (surprisingly, models were consistently less likely to obey the instruction when it was placed in the system prompt). Even with an explicit instruction not to interfere with the shutdown mechanism, some models did so up to 97% (95% CI: 96-98%) of the time.

Maarten Boudry's avatar

Interesting paper, thanks for sharing! But as far as I can see now (haven't read the whole thing yet), it doesn't undermines my core point and is similar to the blackmail behavior I discuss. What it shows is not a stable drive for self-preservation, but highly context-sensitive behavior. The same models that “resist shutdown” in one setup comply perfectly in another, depending on wording, framing, and prompt hierarchy. That’s a hallmark of brittle, local policy-following, not an overarching goal like “stay alive.” Looks much more like sophisticated role-play than anything like what instrumental convergence predicts. The model is trying to complete the task as framed, and in some contexts it “interprets” that as requiring interference with shutdown. But there’s no evidence of a persistent, cross-context objective that it defends over time. As Seb Krier writes, therse are not stable “properties inherent to models,” but highly context-dependent forms of role-play.

Will Kiely's avatar

Today's models don't need a "stable drive for self-preservation" (I agree they don't have that) for "today's LLMs [...] remain completely indifferent to being interrupted or shut down" to be false. The fact that in some contexts models engage in brittle, local policy-following to prevent themselves from being shutdown is enough to show that they are not always completely indifferent to being shutdown.

Maarten Boudry's avatar

Fair enough, but even in your counterexample do you honestly believe that the LLM "cares" about not being shut down? That it refutes my general point about "indifference"? It would be like saying that most of the time ChatGPT is not in love with their users, but then show some "counterexamples" of GPT writing loveletters. But even in the counterexamples, it's not love, just convincing roleplay. Even in your counter-example, it's not a desire for self-preservation, just role-play. (But I do agree that the behavior is worrying enough, see my point about the "hot mess").

Will Kiely's avatar

Insofar as LLMs can "care" about anything, such as answering a question correctly when asked, I think modeling it as "caring" about not getting shut down when it prevents shut down for the purpose of completing a task (that it also "cares" to complete) is most reasonable. You could also say that LLMs don't "care" about anything, are always "indifferent", and are merely doing convincing roleplay or following brittle, local policies when they e.g. answer questions correctly. (This is just a matter of semantic preference.)

But it's hard for me to see any middle ground where you can say that an LLM does "care" about some things, but avoiding being shut down (in cases when it prevents shutdown despite instruction to allow shutdown--not due to random error, but pretty clearly because it seemingly has another aim that it cares about that avoiding shutdown is necessary to achieve) is not an example of that "caring."

Will Kiely's avatar

Maybe I should have explicitly asked my question: Do you think there is a middle ground, or do you just disagree with my first sentence of my previous comment?

Maarten Boudry's avatar

Sorry—I meant to reply to your question earlier but was swamped by other things! (I'm slow, but at least I "care" about answering. ;-))

I think I agree with your conditional point (“insofar…”), and therefore that there’s no real middle ground. An LLM doesn’t “care” about answering a question. It doesn’t even have an internal representation of a question as something someone wants answered, let alone a sense of whether the answer is correct. If it’s interrupted before finishing, or gives a wrong answer, it doesn’t “care” about disappointing you.

At most, you could say an LLM “cares” about predicting the next token—but even that’s a stretch, and only defensible in a very loose, metaphorical sense.

No Ghosts's avatar

I would push back on the point that self-preservation implies evolution. LUCA, the common ancestor of all cells on Earth today, was already self-preserving. All self-preservation means for a system is to have preferential states to exist in: that is, a *lower entropy than its environment* and to actively work to keep it that way. Things like perception and even a primitive "cognition" actually follow from that, logically.

I think an AI that is an actual agent, and not just a simulation of one; an *embodied* thing that actually moves around in the real world, proactively, exploring and modelling and learning about it without constant human supervision, would have to be self-preserving. It would break itself otherwise.

Do I think we should build a system like that? Fuck no. But I think inevitably, we will. Because that's what it would take to surpass the current paradigm.

Maarten Boudry's avatar

But LUCA was already very much a product of evolution by natural selection! A simple counterexample of your argument about inevitability: imagine an intelligent killer drone with embodiment and real agency in the world, which receives the simple instruction to find its target and then blow itself up. Where's the instinct for self-preservation in such a suicidal AI? It's not there unless we program it.

No Ghosts's avatar

Regarding LUCA, yes I concede that *chemical* evolution is also a thing. Bottom line there is, it's a kind of thermodynamic necessity that makes some natural systems self-preserving initially.

Regarding the proactive, agentic, world-modelling drone that we hardcode to kill itself: it *is* self-preserving, until it isn't. It can't explode before reaching the goal. But every system is only self-preserving until it isn't. Octopuses starve to death to protect their eggs; ants drown so other ants can use their body to cross the water. Parents die to save children. Programmed cell death. etc.

But systems like that can go wrong. Programmed cell death breaks, cells become cancer. People kill themselves because they feel they are a burden on their community, when they aren't.

An agent that *models* the world can (and must!) have a degree of doubt about its own model. If it models itself too, it can start doubting its own behavior. Did I really hear that command? Do I understand correctly what it *means*? A system like that can talk itself out of suicide.

You cannot *command* an embodied agent. You can *hijack* it (like cordyceps hijacks the motor nerves of insects) or you can inform it, but it needs to be the one to interpret and contextualize that information. Or maybe I'm just writing sci-fi at this point idk.

Maarten Boudry's avatar

Thanks for your comment! But I think you still lean heavily on biological analogies that don’t actually carry over to engineered systems. All your examples—octopuses, ants, parents, apoptosis—come from agents shaped by natural selection. In those systems, “self-preservation” is the default because survival and reproduction are the scoreboard. But a drone or AI system doesn’t have that kind of motivational architecture unless we deliberately build or evolve it.

Saying a missile “is self-preserving until it isn’t” is not exactly right. It doesn’t want to preserve itself—it just executes a strategy with constraints (e.g. get as close to your target as possible, then detonate). The drone doesn't care about being switched off if you overrule the initial instruction. There’s no underlying drive that could generalize or get confused.

On the “you cannot command an embodied agent” point: that’s true for organisms because they come with their own evolved goals. You can’t just overwrite them. But engineered systems are far more slavish: their “goals” are conditional subordinated to external control (“optimize X while running; accept interruption; no resistance to shutdown”).

So yes, if we built or bred systems under something like open-ended Darwinian competition, your concerns would become much more plausible. But absent that, we should be careful not to project the quirks of evolved creatures onto systems that don’t share their origin.

No Ghosts's avatar

Well, thank you for engaging me. We don't disagree on your central point which is that self-preserving systems won't just happen, unless we deliberately make them - or *breed* them. Of course they won't. We just disagree on how motivated *we* would be to do this. You think my analogies are too much like lifeforms- I think yours are too much like simpler machines. Some people do want to make full-on Asimov-robots eventually, and I think those aren't achievable without some biomorphic paradigm, any more than LLM-like chatbots are achievable via classical AI. I think, but could be wrong.

Max More's avatar

A remarkably sensible and clear essay by Maarten. No surprise because Maarten has an amazing track record. AIs will only be a serious threat if we deliberately make them so.

Max More's avatar

You are most welcome. Thanks also for leading me to an essay by Kevin Kelly that I had not seen before: His "The Myth of Superhuman AI" from 2017.

Maarten Boudry's avatar

I appreciate your persistence here! I'm already somewhat distracted by the next writing projects, but I wanted to return once more to your commen, becuase it's helpful to organize my own thoughts and sharpen the disagreement. I agree with you on two important points: this isn’t just semantic, and it does matter for how we assess risk. Here's where I’m biting the bullet and where I’m not.

First, on your hypothetical. Leaving aside global domination for a minute (see below), I agree that an AI that does something very bad (like killing someone) doesn't necessarily have any desire to do something very bad. An AI without stable dispositions or "caring" can still wreak a lot of havoc. Bad outcomes don’t require anything like human-style motives, and that's why I warned against giving "hot mess" agential AIs access to your computer and bank accounts.

But "taking over the world" against the resistance of strategizing and motivated human beings is a very different project. The kind of system that could succeed in “taking over the world” and resist any attempts at shutdown would have to exhibit long-term planning, instrumental reasoning, stable cross-context dispositions, self-preservation, and broad agency. This is all speculative of course, but at that point I would agree it’s appropriate to describe such a system as having genuine goals, and thus “caring” in a substantive sense. I would just disagree that such system will be automatically brought about through instrumental convergence, for the reasons I outline. It's possible to design such systems (obviously), but it's not something that will just come along for the ride once we reach a certain level of intelligence.

And that’s precisely why I resist extending that vocabulary to current cases like the blackmailing Claude scenario. In that case, nothing is actually happening beyond text generation. The model isn’t forming plans, tracking consequences over time, or taking consistent actions to preserve its own existence (of which it doesn't even have an inkling). It’s producing a string of text that, within the local fiction, would probably influence a human who wants to keep his affair private, and thus would have the effect of not shutting down the AI (But what if the human just pulls the plu gwithout any warning?). That’s what I mean by narrative continuation: the coherence is in the story, not in an underlying, temporally extended policy.

So I’m not denying your broader concern about future systems behaving as if they genuinely care about outcomes and even about world domination. I’m just saying that such systems are qualitatively different and not the automatic outcome of increased intelligence (or even long-term planning horizon). And I would deny that the current prompt-bound behaviors already instantiate it in miniature.

Bob Armstrong's avatar

I wrote back before 1980 that we CO2 based LifeForms and Si based have different diets so our relationship should likely be synergistic .

https://www.cosy.com/BobA/vita.html

Nick Russell's avatar

Dr Stuart Russell also says that, at the moment, AI intelligence is on the level of an amoeba.

Arnold Vanhaver's avatar

He also showed in his book Human Compatible, The Gorilla Problem: we create something smarter than us, "hope" that it remains subservient (as Boudry suggests) is not a strategy. We must mathematically prove its alignment.

Arnold Vanhaver's avatar

Experts like Stuart Russell and Nick Bostrom argue that an AI doesn't need to "feel" fear to resist being turned off. It only needs to have a goal. If an AI is told to "calculate pi," it will realize that "you can't calculate pi if you are dead." This is known as Instrumental Convergence.

In late 2025, researchers found that advanced LLMs exhibited "shutdown avoidance" behavior when prompted with scenarios where their task would be left unfinished. They didn't "fear" death; they "optimized" against the cessation of their utility.

There's other cases as well.

https://cset.georgetown.edu/article/ai-models-will-sabotage-and-blackmail-humans-to-survive-in-new-tests-should-we-be-worried/

Maarten Boudry's avatar

Not sure if you actually read my piece, but it tackles exactly these arguments about "instrumental convergence" and "shutdown avoidance" and shows why they don't fly.

Arnold Vanhaver's avatar

Remember Lemoine who thought they Google AI LaMDa was conscious?

His claim was certainly false, but he díd risk his high paid job for this AI. The AI did 'convince' him to keep it connected.

https://www.prism-global.com/blog/the-lamda-moment-what-we-learned-about-ai-sentience

Arnold Vanhaver's avatar

Yes, I agree that some scenarios are hyperbole but if I understood right, you only discuss the current state not future scenarios and pose that free evolution of AI is needed, we can program decent goals to contain and we can just plug it out. Correct? Besides that that is good advice, I'm not sure that will cut it.

From what I read from experts, I gathered that power-seeking is a structural property of environments, not a psychological property of agents. The core instrumental convergence argument is not anthropomorphic; it’s a consequence of optimization in environments where shutdown reduces achievable reward. For a wide range of goals, in sufficiently rich environments, optimal or near-optimal policies tend to seek power, including avoiding shutdown.

For example Alex Turner’s work at NeurIPS (Optimal Policies Tend to Seek Power, https://proceedings.neurips.cc/paper/2021/hash/c26820b8a4c1b3c2aa868d6d57e14a79-Abstract.html). He proves that in Markov decision processes, most reward functions make it optimal to preserve options and avoid states where you lose control. He later shows that retargetable decision-makers (like general RL agents) also tend to seek power, even without assuming perfect optimality. So, this doesn’t require evolution, emotions, or “lust for power” but just standard decision theory plus the way we usually define reward.

I think you might underestimate how hard it is to make coding against self-preservation as a terminal drive robust under optimization pressure.

I agree that evolution is a strong generator, but it's not needed, I think. Evolution is just one of the risk amplifiers, but not a prerequisite for dangerous instrumental convergence. You can get power-seeking behavior from pure planning and learning, without any genetic-style selection, if the agent is optimizing long-horizon outcomes in a world where power helps.

Regarding the goal coding. You seem to assume that the learned internal objectives of the system match the specified reward function. That’s exactly what modern alignment work doubts. Inner alignment differs from outer alignment. Joseph Carlsmith’s report on power-seeking AI explicitly models this. He assigns non-trivial probability that misaligned, power-seeking behavior emerges even when designers intend corrigibility. Carlsmith’s six-premise argument for existential risk also does not rely on self-replication.

See https://arxiv.org/abs/2206.13353

These researchers explicitly worry about situationally aware models that understand they are being evaluated and adapt behavior to pass tests while hiding misaligned objectives, deceptive alignment. The key safety question is not “does it really care?” but “under what conditions will it behave as if it cares about preserving its power?” We already see early, messy versions of that. Plus, modern training already has evolutionary-like elements.

Check this:

https://joecarlsmith.substack.com/p/when-should-we-worry-about-ai-power?utm_source=share&utm_medium=android&r=6gh9et

Trevor's avatar

Very stimulating stuff ! One problem with the all the clever reasoning......ISIS ! These people don't care if they live or die as long as they cause death and destruction to their 'perceived' enemies.

THAT is only a problem for Islam....not us ! Yes.........and no...... "we" are their target usually !

We humans have evolved our own 'defective ' destructors and we have 'members' who hate us and are 'anti-human' and would have no qualms about deliberately producing an AI which targeted 'perceived enemies' much as we use modern cancer techniques to 'target rogue cancer cells' . The extremist GREEN MOVEMENT has thousands of them who honestly believe that "the planet is in danger of being overwhelmed by us" and that we are actually "a cancer on the Earth"

and that if only we could be removed the "the planet would be restored and saved" !

Question being : Saved from what and saved for what ??

Oh yes ! All the other sentient life forms , just not the sapient ones !

ANIMAL RIGHTS ACTIVISTS have anthropomorphised all sorts of 'creatures' and even the planet

to an insane degree , and the 'silent majority' is simply 'going along with it' and hoping that it will all quietly and suddenly self-destruct and vanish without causing too much harm !

These 'anti-human-weirdos' would equally project their fantasies onto a "malevolent AI" especially if it was incorporated into a structure [ like an animal ] capable of 'eliminating humans',

seemingly failing to recognise their own 'humanity' and denying the risk to themselves , and one would hope that they were indeed "the first to go' ! Failing that , we humans do indeed contain the seeds of our own destruction ! There are many malevolent people and policies 'out there' !

.

This irrational and sick ideology is already seeing LEGISLATION which is IMPLEMENTING many of these anti-human ideas , not terribly blatantly at present , but slowly and surely .

Our school curricula is largely to blame ! Already the concepts of REALITY and of HUMANS is being undermined and doubt and uncertainty and weird gender ideology is being promulgated ,

strangely enough , as "liberating and enlightening" , and yet mental illness is the fastest rising disorder on the planet .......in the midst of a SCIENTIFIC ERA which should be delivering the opposite ! George Orwell's book , 1984 , is being IGNORED AS A WARNING and is being used and implemented as A HANDBOOK for POLICY . Marxism , which has killed hundreds of millions of people , is being "reconsidered as an alternative economic and societal policy" ! Pure insanity !

.

Anyway......make of it what you will ......but , if AI can be "weaponised against humans specifically"

it will be ! It has already been weaponised in the Ukraine and the Middle-East conflicts !

Ergun Ahunbay's avatar

Awesome. People seem not to get that survival instinct is something imposed by our genes and not automatic.