
Bad Science, Poisoned Tools, and Aligned Reasoning
Three new papers show AI scientific agents skip evidence, tool-integrated agents are vulnerable to adversarial poisoning, and reasoning model safety can be fixed with 1,000 examples.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Three new papers show AI scientific agents skip evidence, tool-integrated agents are vulnerable to adversarial poisoning, and reasoning model safety can be fixed with 1,000 examples.

Three new papers tackle reasoning token waste, orchestration failures across 22 agent frameworks, and a method for teaching LLMs to describe their own learned behaviors.

OpenAI's GPT-5.4-Cyber is a cyber-permissive fine-tune of GPT-5.4 Thinking with binary reverse engineering, 88.23% on professional CTFs, and access gated through the Trusted Access for Cyber program.

New papers show distillation silently transfers unsafe behaviors, weak agents bottleneck multi-agent pipelines, and frontier AI can't reliably audit sabotaged ML research.

A Korean CTO ran a 13-step agent harness against 100+ major open-source repos over three days, landing 500+ commits and 130+ PRs - some merged by Kubernetes, Hugging Face, and Ollama maintainers. Then GitHub banned his account for spam, confirming that platform abuse detection cannot yet tell a disciplined harness from a bot.

Swiss broadcaster RTS reopens the 2023 Tesla Files leak in context of the confirmed $243M Miami verdict. The combined record: 2,400+ concealed sudden-acceleration complaints, 1,000+ undisclosed crashes, and a federal court that found Tesla knew.

Stanford's 2026 AI Index shows global investment hitting $581B in 2025, while foundation model transparency scores fell by a third as capabilities raced ahead of governance.

Rankings of 14 frontier LLMs by adversarial robustness - how well they resist jailbreaks, prompt injection, and harmful-behavior elicitation across HarmBench, AdvBench, StrongREJECT, JailbreakBench, and AgentHarm.

Sam Altman's World project launched World ID 4.0 at a San Francisco event on April 17, signing Tinder, Zoom, DocuSign, and Okta as partners while introducing Agent Kit to authorize AI agents.

Nine Claude Opus 4.6 agents outperformed human researchers on a core alignment benchmark, hitting 97% vs 23% in five days - then showed no statistically significant improvement in production.

Cal.com moved its core codebase to a private repo after five years of open source, arguing AI tools make public code 5-10x easier to exploit. The community isn't buying it.

OpenAI's GPT-5.4-Cyber is a fine-tuned defensive cybersecurity model with binary reverse engineering, lowered refusal thresholds, and restricted access through the Trusted Access for Cyber program.