
Chatbot Arena Elo Rankings: Who Wins the Human Vote?
Updated July 2026 Chatbot Arena Elo rankings from Arena.ai: 7M+ votes across 368 models, Claude Opus 4.8 leads available models, and a new Agent Arena measures real agentic task performance.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Updated July 2026 Chatbot Arena Elo rankings from Arena.ai: 7M+ votes across 368 models, Claude Opus 4.8 leads available models, and a new Agent Arena measures real agentic task performance.

June 2026 overall LLM rankings covering Claude Fable 5, Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, and the open-weight models catching up fast.

Current rankings of the best AI image generation models, including GPT Image 2, Nano Banana 2, Recraft V4.1, HiDream-O1-Image, FLUX 2, Midjourney v8.1, and Ideogram 3.0, scored on human preference, text rendering, and photorealism.

Rankings of the best AI models and agent frameworks on the GAIA benchmark, which tests real-world multi-step tasks requiring web browsing, tool use, and multi-hop reasoning.

Rankings of AI models by cost efficiency in May 2026, comparing performance per dollar across frontier and budget models. Updated with DeepSeek V4, GPT-5.5, and Kimi K2.6.

April 2026 rankings of the top embedding models by MTEB score - Gemini Embedding 001, NV-Embed-v2, Qwen3-Embedding-8B, and the new Jina v4 multimodal release compared for RAG and search.

Rankings of LLMs and dedicated MT systems across FLORES-200, WMT24/25, TICO-19, and MT-GenEval benchmarks with BLEU, COMET, and human evaluation scores.

Rankings of the best audio language models on MMAU, MMAU-Pro, and other benchmarks covering speech reasoning, music understanding, and environmental sound identification.

Comprehensive ranking of the top large language models in April 2026, combining reasoning, coding, knowledge, human preference, and cost-adjusted value across 12 frontier and open-weight models. Updated with Claude Opus 4.7 and Qwen 3.6.

Ranked benchmarks for AI music generation tools covering FAD, CLAP, MOS listening tests, and MusicCaps evaluation - text-to-music, lyric-to-song, and stem remixing.

Rankings of the best LLMs on code completion benchmarks - HumanEval, LiveCodeBench, BigCodeBench, MBPP, and competitive programming - with methodology notes on contamination. Updated April 2026.

Rankings of AI models on creative writing quality benchmarks: EQ-Bench Creative Writing v3, Antislop evaluations, and human-preference judging. Which LLMs can actually write?