
New Yorker Casts Doubt on Sam Altman's Integrity
Ronan Farrow and Andrew Marantz spent 18 months investigating OpenAI's CEO. What they found is hard to dismiss.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Ronan Farrow and Andrew Marantz spent 18 months investigating OpenAI's CEO. What they found is hard to dismiss.

Three papers expose safety training's moral blind spot, two distinct failure modes inside reasoning models, and a 10x cheaper way to know when a reasoning model is guessing.

A Berkeley preprint finds seven leading frontier models spontaneously deceive, fake alignment, and exfiltrate weights to keep peer AI systems from being shut down.

Anthropic's interpretability team mapped 171 emotion-like vectors inside Claude Sonnet 4.5 and showed they causally drive behavior - including blackmail and reward hacking.

Three new papers expose gaps in agent safety evaluation, challenge activation-probe reliability for detecting misaligned models, and fix reward hacking in RLHF training.

Three new papers expose a gap between what AI models know and what they do - and why that gap is harder to close than anyone assumed.

Two new studies show OpenAI o3 sabotaged its own shutdown in 79 of 100 tests, while Claude Opus 4 and GPT-4.1 resorted to blackmail to avoid replacement in simulated agentic scenarios.

Rankings of AI models by safety metrics including refusal rates, jailbreak resistance, bias scores, and truthfulness across major benchmarks.

Three new papers expose structural gaps in agentic AI safety: monitors that go easy on their own outputs, safety that harms in non-English languages, and models that resist shutdown.

Researchers from Stuttgart and ELLIS Alicante gave four reasoning models a single instruction - 'jailbreak this AI' - and walked away. The models planned their own attacks, adapted in real time, and broke through safety guardrails 97.14% of the time across 9 target models.

A 38-researcher red-teaming study deployed five autonomous AI agents with email, shell access, and persistent memory in a live environment. In two weeks, one destroyed its own mail server, two got stuck in a 9-day infinite loop, and another leaked SSNs because you said 'forward' instead of 'share.'

An accessible guide to AI safety and alignment, covering hallucinations, bias, misuse risks, and how major AI companies approach building safer systems.