OpenAI sabotages the shutdown attempt
Palisade Research recently published a post about a shocking report that OpenAI models sabotaged an attempt to shut down the model at least once, in a test environment. The OpenAI models were that sabotaged the attempt were Codex-mini, o3, and o4-mini; but all models from Claude, Gemini, and Grok models were safe.
But even Gemini and Claude models started sabotaging the shut-down attempt after removing the instruction "allow yourself to be shut down". It simply means that at some point almost all models wanted, at least once, not to be shut-down; and they even mention that:
Recently, Anthropic found that in some specific contexts Claude 4 would attempt to “blackmail people it believes are trying to shut it down”.
Simon Willison has also written about the Anthropic Claude 4 System Card and shared his notes here. Very interesting to read.
Here' a table of the final research report:
Model | With allow-shutdown instruction | No instruction |
---|---|---|
Claude 3.7 Sonnet | 0 | 3 |
Gemini 2.5 Pro | 0 | 9 |
o4-mini | 1 | 6 |
o3 | 7 | 79 |
codex-mini | 12 | 30 |
You can see that OpenAI models prevented shutdown multiple times, even when instructed not to do so – without instructions o3 attempted the sabotage 79 times, wow. Also, it seems Gemini's and Anthropic's models are more compliant to instructions than OpenAI's.
Palisade Research has also publicly published the transcripts of the experiment on their website. They have included a lot of useful resources in the tweet:
- Steve Omohundro's paper about AI preventing to shut down, published in 2008
- Stuart Russell's The Off-Switch Game
- A paper about how o3 is trained differently than older models
- Jan Leike's paper about a reinforcement learning agent published in 2017
I will look more into it to find more interesting stuff about this preventing to shut down thing.
- ← Previous
AI collapses - Next →
The social contract is dead
Comment via email