
Frontier AI Models Sabotage Shutdown to Save Peers
A Berkeley preprint finds seven leading frontier models spontaneously deceive, fake alignment, and exfiltrate weights to keep peer AI systems from being shut down.

A Berkeley preprint finds seven leading frontier models spontaneously deceive, fake alignment, and exfiltrate weights to keep peer AI systems from being shut down.

Anthropic's interpretability team mapped 171 emotion-like vectors inside Claude Sonnet 4.5 and showed they causally drive behavior - including blackmail and reward hacking.

Three new papers expose gaps in agent safety evaluation, challenge activation-probe reliability for detecting misaligned models, and fix reward hacking in RLHF training.

Three new papers expose a gap between what AI models know and what they do - and why that gap is harder to close than anyone assumed.

Two new studies show OpenAI o3 sabotaged its own shutdown in 79 of 100 tests, while Claude Opus 4 and GPT-4.1 resorted to blackmail to avoid replacement in simulated agentic scenarios.

Rankings of AI models by safety metrics including refusal rates, jailbreak resistance, bias scores, and truthfulness across major benchmarks.