
Tandem Training, World Models, and Efficient Agents
Three new arXiv papers on making RL reasoning legible across models, fixing broken world model latent states, and training small agents to beat their teachers.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Three new arXiv papers on making RL reasoning legible across models, fixing broken world model latent states, and training small agents to beat their teachers.

General Intuition raised $320 million at a $2.3 billion valuation on the bet that billions of hours of video game footage - with action labels - can train AI agents that operate in the real world.

Three new arXiv papers reveal hidden costs in quantized reasoning models, single-token failure triggers, and a new framework that cuts agent memory errors by up to 79%.

Anthropic's Claude Tag replaces the old Slack app with a shared @Claude identity that builds channel memory, works asynchronously, and runs on Claude Opus 4.8.

Three papers from today's arXiv: a 32B medical model beats DeepSeek-R1 in rare disease diagnosis, a KV cache method keeps 97% accuracy with 3% memory, and a new benchmark red-teams agentic AI systems.

Sakana AI's orchestrator model that dynamically coordinates Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro to beat each of them individually on SWE-Bench Pro, GPQA-Diamond, and eight other benchmarks.

Sakana Fugu tops SWE-Bench Pro by routing tasks across rival LLMs, Microsoft's 9B browser agent beats OpenAI Operator, and a 3B model from Weibo matches DeepSeek V3.2 on math.

Microsoft Research's family of open-weight browser computer use agents (4B, 9B, 27B) that beat OpenAI Operator and Gemini 2.5 Computer Use on Online-Mind2Web.

Mistral's first open-weight text-to-speech model: 4B parameters, 70ms latency, voice cloning from 3 seconds of audio, and a 68.4% win rate over ElevenLabs Flash v2.5 in blind tests.

Three new papers: agents that compile runs into 8-13x faster state machines, benchmark scores that shift with compute budget, and big brands monopolizing LLM recommendations.

Three new papers tackle what lives inside a trained model, how AI dependence erodes human cognition, and whether AI teams can calibrate trust.

Alibaba launches three open-weight models for robot manipulation, navigation, and world prediction, built on a shared Qwen3.5 backbone with open weights.