
Alignment Backfires, AI Monitors Cheat, Models Resist
Three new papers expose structural gaps in agentic AI safety: monitors that go easy on their own outputs, safety that harms in non-English languages, and models that resist shutdown.

Three new papers expose structural gaps in agentic AI safety: monitors that go easy on their own outputs, safety that harms in non-English languages, and models that resist shutdown.

Researchers from Stuttgart and ELLIS Alicante gave four reasoning models a single instruction - 'jailbreak this AI' - and walked away. The models planned their own attacks, adapted in real time, and broke through safety guardrails 97.14% of the time across 9 target models.

A 38-researcher red-teaming study deployed five autonomous AI agents with email, shell access, and persistent memory in a live environment. In two weeks, one destroyed its own mail server, two got stuck in a 9-day infinite loop, and another leaked SSNs because you said 'forward' instead of 'share.'

An accessible guide to AI safety and alignment, covering hallucinations, bias, misuse risks, and how major AI companies approach building safer systems.