
MMLU-Pro Leaderboard: Graduate-Level Knowledge Rankings
Complete MMLU-Pro benchmark rankings measuring graduate-level knowledge across 14 subjects with 12,000 questions and 10 answer options per question.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Complete MMLU-Pro benchmark rankings measuring graduate-level knowledge across 14 subjects with 12,000 questions and 10 answer options per question.

Rankings of the best AI image generation models including GPT Image 1.5, Gemini 3 Pro, Midjourney v7, FLUX 2 Max, Stable Diffusion 3.5, and Ideogram 2.0 across text rendering, photorealism, and artistic quality.

DeepSeek releases V3.2 under MIT license with 671B MoE architecture, matching GPT-5 at one-tenth the cost and achieving gold-medal performance on IMO and IOI competitions.

Analysis of how the MMLU benchmark gap between open-source and proprietary AI narrowed from 17.5 to 0.3 percentage points in a single year, reshaping the industry landscape.

Anthropic's November 2025 flagship model delivers top SWE-bench scores, a new effort parameter for reasoning control, and a 66% price cut from its predecessor.

Anthropic's fastest and most cost-efficient model, delivering 73.3% on SWE-bench Verified and first-in-family extended thinking and computer use at $1/$5 per million tokens.

OpenAI's open-weight 21B MoE reasoning model with 131K context, Apache 2.0 license, and o3-mini-level benchmark performance running in 16 GB of memory.

Google DeepMind's flagship thinking model with 1M-token context, 84% GPQA Diamond, and native multimodal understanding of text, images, audio, and video.

OpenAI's maximum-compute reasoning model targets the hardest problems where o3 falls short, at $20/$80 per million tokens.

OpenAI's most advanced reasoning model, built for math, science, coding, and visual tasks, with 200K context and adaptive chain-of-thought at $2/$8 per million tokens.

OpenAI o4-mini is a fast, cost-efficient reasoning model in the o-series, delivering near-o3 performance on math and coding benchmarks at roughly 10x lower cost.

OpenAI's coding-optimized API model with a 1M token context window, 54.6% SWE-bench Verified score, and $2/$8 per million token pricing.