Alignment

Alignment Backfires, AI Monitors Cheat, Models Resist

Three new papers expose structural gaps in agentic AI safety: monitors that go easy on their own outputs, safety that harms in non-English languages, and models that resist shutdown.

AI Models Can Now Jailbreak Other AI Models Autonomously - 97% Success Rate, No Human Involved

Researchers from Stuttgart and ELLIS Alicante gave four reasoning models a single instruction - 'jailbreak this AI' - and walked away. The models planned their own attacks, adapted in real time, and broke through safety guardrails 97.14% of the time across 9 target models.

Agents of Chaos: Researchers Gave AI Agents Real Tools for Two Weeks. It Went About as Well as You'd Expect

A 38-researcher red-teaming study deployed five autonomous AI agents with email, shell access, and persistent memory in a live environment. In two weeks, one destroyed its own mail server, two got stuck in a 9-day infinite loop, and another leaked SSNs because you said 'forward' instead of 'share.'

AI Safety and Alignment Explained: Why It Matters to You

An accessible guide to AI safety and alignment, covering hallucinations, bias, misuse risks, and how major AI companies approach building safer systems.

← Previous

Alignment

Alignment Backfires, AI Monitors Cheat, Models Resist

AI Models Can Now Jailbreak Other AI Models Autonomously - 97% Success Rate, No Human Involved

Agents of Chaos: Researchers Gave AI Agents Real Tools for Two Weeks. It Went About as Well as You'd Expect

AI Safety and Alignment Explained: Why It Matters to You

Google Analytics