
Faster Agents, Skewed Evals, and Brand Bias in LLMs
Three new papers: agents that compile runs into 8-13x faster state machines, benchmark scores that shift with compute budget, and big brands monopolizing LLM recommendations.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Three new papers: agents that compile runs into 8-13x faster state machines, benchmark scores that shift with compute budget, and big brands monopolizing LLM recommendations.

AMD's CDNA 5 accelerator on TSMC 2nm with 432 GB HBM4 memory - the GPU behind OpenAI's 1GW deployment and Oracle's 50,000-chip supercluster.

Google DeepMind open-sources DiffusionGemma, a 26B MoE model that generates 256 tokens per denoising pass instead of one at a time, reaching 1,100 tokens per second on a single H100.

DiffusionGemma 26B is Google DeepMind's open-weight discrete diffusion language model that generates 256 tokens in parallel, reaching 1,100+ tokens/sec on H100 - roughly 4x faster than autoregressive models of the same size.

Mistral AI's mid-tier open-weight edge model - 8B parameters, 256K context, Apache 2.0 license, built for agentic pipelines and cost-sensitive production workloads.

MiniMax M3 uses sparse attention to cut long-context inference cost 20x, topping GPT-5.5 on coding benchmarks at a fraction of the price.

NVIDIA's 550B Nemotron 3 Ultra, released June 4, tops the US open-weight leaderboard with a hybrid Mamba-Transformer MoE architecture and 300-plus tokens per second throughput.

NVIDIA's Dynamo Snapshot uses CRIU and cuda-checkpoint to freeze and restore GPU inference containers in seconds, cutting Kubernetes cold-start times by up to 21x for large models.

New open-source inference engine for Apple Silicon benchmarks up to 2.6x faster than Ollama, supports 66 model aliases, and drops in as an OpenAI-compatible server on any Mac.

Meta's Llama 3.3 70B Instruct matches Llama 3.1 405B on instruction following and math while running at 4-5x lower cost, with the lowest hallucination rate of any open-weight model on the Vectara summarization leaderboard.

Intel's Crescent Island inference GPU trades HBM bandwidth for 480GB of LPDDR5X, targeting customers locked out of NVIDIA's supply chain.

Verified June 2026: real cost per million tokens for self-hosting Llama 4 Scout, Maverick, Qwen3-235B, and DeepSeek V3.2 - GPU requirements, cost formulas, and when cheap APIs actually win.