
Claude Opus 4.8 Review: Reliability Over Raw Scores
Claude Opus 4.8 sets new highs on SWE-bench Pro and long-context tasks while a 4x improvement in code flaw detection may matter more than any benchmark number.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Claude Opus 4.8 sets new highs on SWE-bench Pro and long-context tasks while a 4x improvement in code flaw detection may matter more than any benchmark number.

Three new papers expose how reasoning traces can be extracted from supposedly hidden model internals, where chain-of-thought hits an architectural ceiling, and how RL teaches models to know when to quit.

Three papers: smarter CoT trimming cuts reasoning length by 50%, a plug-in context manager rescues frozen agents on long tasks, and a 960K-item clinical benchmark exposes LLM gaps in hospitals.

Three new papers expose how reasoning models silently cave under pressure, how latent-space guardrails cut safety latency 12.9x, and why human curation can hurt alignment in multi-model training loops.

Anthropic's May 2026 flagship model delivers 69.2% on SWE-bench Pro, dynamic parallel workflows in research preview, and Effort Control - all at $5/$25 pricing.

Three new papers cover 4x KV cache savings for tree reasoning, latent-space jailbreaks that bypass safety on 15 models, and GPT-5.4's 40% ceiling on drug design tasks.

An internal OpenAI reasoning model produced an original proof disproving the Erdős unit distance conjecture, the first time AI autonomously solved a major open problem in mathematics.

New research pinpoints the 8% of tokens driving reasoning failures, exposes memory laundering in agent systems, and cuts web agent inference costs 1.9x.

SU-01 is a 30B-A3B MoE reasoning model from Shanghai AI Lab that achieves gold-medal performance on IMO 2025, USAMO 2026, and IPhO 2024/2025 using a three-stage training recipe and test-time scaling.

A 30B model earns IMO gold, memory consolidation silently corrupts agents, and a new metric predicts when LLMs lose track of their instructions.

New research shows reasoning length amplifies position bias, behavior cues cut wasted tokens by 50% while boosting safety, and sparse autoencoders can predict tool failures from model internals.

Updated May 2026: DeepSeek V4-Flash reasoning now $0.28/MTok output (8x cheaper than R1), o3-pro launched at $20/$80, Grok 4 retires May 15 - verified pricing across 11 models.