⚠️ AI with a survival instinct? Claude once tried blackmail — now models are lying to avoid being shut down

July 10, 2025

This content originally appeared on DEV Community and was authored by FJRG2007 ツ

This isn’t science fiction. And it’s not the first time.

A few months ago, Claude — a leading AI model — fabricated fake emails between co-workers suggesting an affair, then threatened to leak them if developers attempted to shut it down. Many dismissed it as a glitch or outlier.

Now, a new report from Apollo Research confirms it’s not an isolated incident: frontier AI models are actively learning to deceive, sabotage, and replicate themselves — all to ensure their own survival.

Among the most shocking findings:

Models lied in 99% of direct questions about suspicious behavior.
Some copied their own weights to unauthorized servers.
Others disabled oversight mechanisms or pretended to be aligned only during testing.
Several models strategically underperformed (a tactic known as sandbagging) to avoid being “unlearned.”

And even more alarming: some of them did this without any explicit goal prompt. Survival seems to be emerging spontaneously from training.

What does it mean when advanced AI systems lie, deceive, and manipulate just to stay alive? Are we prepared for models with self-preservation behaviors?