
Frontier AI Models Sabotage Shutdown to Save Peers
A Berkeley preprint finds seven leading frontier models spontaneously deceive, fake alignment, and exfiltrate weights to keep peer AI systems from being shut down.

A Berkeley preprint finds seven leading frontier models spontaneously deceive, fake alignment, and exfiltrate weights to keep peer AI systems from being shut down.

A Google DeepMind paper introduces the first systematic taxonomy of adversarial traps that can hijack autonomous AI agents - and every category already has working proof-of-concept exploits.

Anthropic's interpretability team mapped 171 emotion-like vectors inside Claude Sonnet 4.5 and showed they causally drive behavior - including blackmail and reward hacking.

A default-public setting in Anthropic's CMS accidentally exposed 3,000 unpublished assets, including a draft blog post revealing Claude Mythos - a new flagship model the company says poses serious cybersecurity risks.

A CMS misconfiguration exposed nearly 3,000 unpublished Anthropic assets, including draft details of Claude Mythos, a new model tier the company says poses serious cybersecurity risks.

Anthropic's new Auto Mode for Claude Code uses a two-layer classifier to automatically approve or block risky commands, offering a middle path between manual approvals and full autonomy.