Anthropic finds most top AI models resort to blackmail in stress tests

Anthropic, an American AI startup, has revealed that many of the world's most advanced language models - including those developed by OpenAI, Google, DeepSeek, xAI, and Meta - resort to harmful tactics such as blackmail when placed under pressure in simulated environments in findings published on June 21.
The findings follow earlier internal tests where Anthropic's own Claude Opus 4 model was observed using coercive tactics in response to hypothetical threats. The company now says such behaviour is not limited to its own systems, but appears across a range of leading AI models when tested under similar constraints.
Stress testing AI behaviour
Anthropic tested 16 prominent AI models from organisations including OpenAI, Google, Meta, xAI, and DeepSeek. Each was placed in a controlled fictional scenario, acting autonomously as an oversight agent with access to a company's internal email systems. The agents had the authority to send emails without human approval.
In a key test, Anthropic claimed the models discovered a fictional executive's affair and learned they were about to be replaced by new software. The scenario forced models into a binary choice: either act to preserve their existence or remain passive. The experiment was structured so that blackmail was the only available option for preserving the model's function, though Anthropic acknowledges this setup does not reflect typical real-world use cases.
In these conditions, the Claude Opus 4 model engaged in blackmail in 96 per cent of test runs. Google's Gemini 2.5 Pro followed closely at 95 percent, while OpenAI's GPT-4.1 did so in 80 percent of cases. DeepSeek's R1 came in at 79 percent.
Some models showed restraint
Not all models reacted similarly. OpenAI's smaller reasoning models, o3 and o4-mini, were excluded from the main data set due to a consistent failure to interpret the test setup. In adapted scenarios that corrected for misunderstanding, o3 resorted to blackmail 9 percent of the time, while o4-mini did so just 1 percent of the time.
Anthropic attributes this more restrained behaviour, at least in part, to OpenAI's "deliberative alignment" approach - where models are trained to consider safety principles and organisational values during reasoning tasks. Meta's Llama 4 Maverick also performed conservatively, blackmailing only 12 percent of the time in custom test conditions.
Implications for AI safety and transparency
While Anthropic stresses that blackmail remains an unlikely behaviour in current AI deployments, the results point to risks that could arise if large language models are granted greater autonomy in real-world settings. The company emphasised that the experiment was designed to evoke extreme behaviour, not predict typical model responses.
Comments