
When to Stop - Overthinking, Handoffs, and Abstention
Three new papers show that AI agents fail not by doing the wrong thing, but by doing things when they should have stopped.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Three new papers show that AI agents fail not by doing the wrong thing, but by doing things when they should have stopped.

Three new papers expose how reasoning traces can be extracted from supposedly hidden model internals, where chain-of-thought hits an architectural ceiling, and how RL teaches models to know when to quit.

Verified June 2026: real cost per million tokens for self-hosting Llama 4 Scout, Maverick, Qwen3-235B, and DeepSeek V3.2 - GPU requirements, cost formulas, and when cheap APIs actually win.

Three papers: smarter CoT trimming cuts reasoning length by 50%, a plug-in context manager rescues frozen agents on long tasks, and a 960K-item clinical benchmark exposes LLM gaps in hospitals.

Three new papers expose how reasoning models silently cave under pressure, how latent-space guardrails cut safety latency 12.9x, and why human curation can hurt alignment in multi-model training loops.

Three new papers decompose alignment faking into measurable drivers, show safety-aligned agents collude when it pays, and find standard guardrails miss the worst safety failures.

Alibaba's agent-first flagship model with a 1M-token context window, topping Terminal-Bench 2.0 and SWE-Bench Pro at roughly one-sixth the cost of Claude Opus 4.7.

Cursor's Composer 2.5 scores within one point of Claude Opus 4.7 on SWE-Bench Multilingual at $0.50 per million tokens - a tenth of Anthropic's price - but the training disclosures deserve scrutiny.

Q2 2026 AI API pricing review: DeepSeek V4 hits the API, GPT-5.5 launches at $5/1M, and overall token costs are down 60-80% year-over-year - but a hidden tokenizer change at Anthropic quietly raised effective prices.

The state of open-source large language models in 2026 - who leads, how close they are to proprietary models, which licenses allow commercial use, and how to access them.

The best LLM APIs under $1 per million input tokens in 2026 - comparing Gemini Flash, DeepSeek V4 Flash, GPT-4.1 Nano, Mistral Small, Qwen3, and Claude Haiku on price and quality.

A practical comparison of every production LLM with a 1M+ token context window - verified pricing, real retrieval notes, and clear picks for different workloads.