
Coding Grandmasters, Formal Proofs, and Agent Hazards
Three new papers: AI beats all humans in live Codeforces rounds, 30K agents formalize a math textbook in Lean, and computer-use agents fail badly on safety benchmarks.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Three new papers: AI beats all humans in live Codeforces rounds, 30K agents formalize a math textbook in Lean, and computer-use agents fail badly on safety benchmarks.

Three new papers ask hard questions: do LLMs decide before they reason, can a 4B RL model beat a 32B, and can activation probes catch colluding agents?

New proofs show semantic memory must forget, SARL trains reasoning models without labels, and the Novelty Bottleneck explains why AI won't eliminate human work.

Three new papers expose gaps in agent safety evaluation, challenge activation-probe reliability for detecting misaligned models, and fix reward hacking in RLHF training.

Three arXiv papers push AI agents further: metacognitive self-modification, milestone-based RL lifting Gemma3-12B from 6% to 43% on WebArena-Lite, and hybrid workflows cutting inference costs 19x.

MiniMax's new 2,300B MoE model tops the Artificial Analysis Intelligence Index and claims to run 30-50% of its own RL research workflow autonomously.

Cursor launches Composer 2, its first in-house coding model trained via RL on long-horizon tasks, scoring 73.7 on SWE-bench Multilingual at $0.50/M input tokens.

Three new arXiv papers tackle constitutional AI rule learning, sleeper agent defense for multi-agent pipelines, and skill-evolving reinforcement learning for math reasoning.
MiniMax M2.7 is a 230B MoE coding agent that handles 30-50% of MiniMax's own RL research workflow, scoring 56.22% on SWE-Pro and 78% on SWE-bench Verified at $0.30/M input tokens.

New research shows enterprise AI agents top out at 37.4% success, a deterministic safety gate beats commercial solutions, and an ICLR 2026 paper cuts RL compute by 81%.

A Hugging Face survey of 16 open-source reinforcement learning libraries finds the entire ecosystem has converged on async disaggregated training to fix a single brutal bottleneck: GPU idle time during long rollouts.

Hugging Face ships its largest LeRobot update yet: Unitree G1 humanoid support, Pi0-FAST VLA, Real-Time Chunking, 10x faster image training, and PEFT/LoRA fine-tuning for large robot policies.