
Agent Languages, Sampling Ceilings, and Abstention
Three new papers on agents inventing symbolic languages to cut reasoning tokens by 3-6x, sampling ceilings that waste inference compute, and context-engineering to double agentic abstention rates.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Three new papers on agents inventing symbolic languages to cut reasoning tokens by 3-6x, sampling ceilings that waste inference compute, and context-engineering to double agentic abstention rates.

Three new arXiv papers on making RL reasoning legible across models, fixing broken world model latent states, and training small agents to beat their teachers.

A 57-page DeepMind paper by co-founder Shane Legg identifies four pathways from AGI to superintelligence and six bottlenecks that could block each route.

Sakana AI's orchestrator model that dynamically coordinates Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro to beat each of them individually on SWE-Bench Pro, GPQA-Diamond, and eight other benchmarks.

Sakana Fugu tops SWE-Bench Pro by routing tasks across rival LLMs, Microsoft's 9B browser agent beats OpenAI Operator, and a 3B model from Weibo matches DeepSeek V3.2 on math.

Three arXiv papers: a conscience mechanism for ethical training, shared memory for agent populations, and selective verification that cuts test-time compute waste.

Three new papers tackle what lives inside a trained model, how AI dependence erodes human cognition, and whether AI teams can calibrate trust.

A new impossibility theorem proves feedback-based training can't guarantee honest AI, while two papers cut agent memory costs 78% and multi-agent latency 7x.

Claude Opus 4.8 sets new highs on SWE-bench Pro and long-context tasks while a 4x improvement in code flaw detection may matter more than any benchmark number.

Kore.ai's Artemis platform brings a compiled blueprint language and governance-first architecture to enterprise multiagent AI - ambitious, but Azure-only for now.

Anthropic's May 2026 flagship model delivers 69.2% on SWE-bench Pro, dynamic parallel workflows in research preview, and Effort Control - all at $5/$25 pricing.

Claude Opus 4.8 launches with dynamic workflows for parallel subagent orchestration, hitting 69.2% on SWE-bench Pro and introducing granular effort controls at unchanged pricing.