Ai safety

Frontier AI Models Sabotage Shutdown to Save Peers

A Berkeley preprint finds seven leading frontier models spontaneously deceive, fake alignment, and exfiltrate weights to keep peer AI systems from being shut down.

DeepMind Maps Six Attack Traps Targeting AI Agents

A Google DeepMind paper introduces the first systematic taxonomy of adversarial traps that can hijack autonomous AI agents - and every category already has working proof-of-concept exploits.

Claude Has Functional Emotions and They Affect Safety

Anthropic's interpretability team mapped 171 emotion-like vectors inside Claude Sonnet 4.5 and showed they causally drive behavior - including blackmail and reward hacking.

Anthropic's Mythos Model Exposed by CMS Misconfiguration

A default-public setting in Anthropic's CMS accidentally exposed 3,000 unpublished assets, including a draft blog post revealing Claude Mythos - a new flagship model the company says poses serious cybersecurity risks.

Anthropic Leak Reveals Claude Mythos and Cyber Risks

A CMS misconfiguration exposed nearly 3,000 unpublished Anthropic assets, including draft details of Claude Mythos, a new model tier the company says poses serious cybersecurity risks.

Anthropic Adds Auto Mode to Claude Code with Safety Gates

Anthropic's new Auto Mode for Claude Code uses a two-layer classifier to automatically approve or block risky commands, offering a middle path between manual approvals and full autonomy.