
Google Launches Gemini 3.1 Pro, Claims Top Spot on 13 of 16 Benchmarks
Google releases Gemini 3.1 Pro with dramatically improved reasoning, topping Claude Opus 4.6 and GPT-5.2 on most industry benchmarks.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Google releases Gemini 3.1 Pro with dramatically improved reasoning, topping Claude Opus 4.6 and GPT-5.2 on most industry benchmarks.

Three papers that matter this week: a brutal benchmark for AI research agents, a feature-space approach to training data diversity, and trace rewriting to stop model theft.

Rankings of the best open source LLMs you can run on home hardware - RTX 4090, RTX 3090, Apple M3/M4 Max - organized by VRAM tier with real-world token/s benchmarks and quality scores.

A data-driven look at benchmark contamination, leaderboard gaming, and whether public AI benchmarks can still tell us anything useful about model capabilities.

Rankings of the best AI models for long-context tasks, measuring retrieval accuracy, reasoning, and comprehension across massive context windows from 128K to 10M tokens.

Comprehensive ranking of the top large language models in February 2026, combining multiple benchmarks including reasoning, coding, knowledge, and multimodal capabilities.

Anthropic's new mid-tier model matches Opus 4.6 on coding benchmarks, ships a million-token context window, and keeps the same $3/$15 pricing as its predecessor.

Explore the latest Chatbot Arena Elo rankings from LM Arena, where over 6 million human votes determine which AI models people actually prefer in blind comparisons.

A plain-English guide to AI benchmarks like MMLU, GPQA, SWE-Bench, and Chatbot Arena Elo, explaining what they measure and why no single score tells the whole story.

Rankings of the best AI models for coding tasks across SWE-Bench, Terminal-Bench, and LiveCodeBench benchmarks, measuring real-world software engineering and algorithmic problem-solving ability.

Rankings of AI models on the hardest reasoning benchmarks available: GPQA Diamond, AIME competition math, and the notoriously difficult Humanity's Last Exam.

Rankings of the best multimodal AI models for image understanding, video analysis, and visual reasoning, covering MMMU-Pro, Video-MMMU, and more.