
Faster Agents, Skewed Evals, and Brand Bias in LLMs
Three new papers: agents that compile runs into 8-13x faster state machines, benchmark scores that shift with compute budget, and big brands monopolizing LLM recommendations.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Three new papers: agents that compile runs into 8-13x faster state machines, benchmark scores that shift with compute budget, and big brands monopolizing LLM recommendations.

Mistral's 128B open-weight model consolidates reasoning, coding, and vision into one checkpoint, with remote agents that file pull requests autonomously.

GPT-5.1 is OpenAI's November 2025 coding and agentic flagship with 400K context, configurable reasoning effort, and 76.3% on SWE-bench Verified.

Alibaba's first multimodal agent model, combining GUI grounding (ScreenSpot Pro 79.0), 1M-token context, and text-plus-vision input at $0.40/M tokens.

Meta launched AI Mode on Facebook, turning billions of public posts into AI-synthesized search answers with no opt-out for the 2 billion daily users whose content powers it.

June 2026 overall LLM rankings covering Claude Fable 5, Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, and the open-weight models catching up fast.

Gemini 2.5 Flash still leads LIT-RAGBench English RAG accuracy at 87.0%, but the full benchmark data reveals two overlooked entries: GPT-4.1-mini at 84.1% and o4-mini at 83.9%.

Google DeepMind open-sources DiffusionGemma, a 26B MoE model that generates 256 tokens per denoising pass instead of one at a time, reaching 1,100 tokens per second on a single H100.

Microsoft's first in-house coding model, a 137B sparse MoE built natively for GitHub Copilot, beating Claude Haiku 4.5 on SWE-Bench Pro by 16 points.

New open-source inference engine for Apple Silicon benchmarks up to 2.6x faster than Ollama, supports 66 model aliases, and drops in as an OpenAI-compatible server on any Mac.

Gemini 2.5 Flash Lite still leads the Vectara hallucination leaderboard at 3.3%, while two new entries - Gemini 3.5 Flash and Mistral Large 3 at $0.50/M - shift the value picture considerably since March.

Meta's Llama 3.3 70B Instruct matches Llama 3.1 405B on instruction following and math while running at 4-5x lower cost, with the lowest hallucination rate of any open-weight model on the Vectara summarization leaderboard.