Leaderboards

Chatbot Arena Elo Rankings: Who Wins the Human Vote?

Explore the latest Chatbot Arena Elo rankings from LM Arena, where over 6 million human votes determine which AI models people actually prefer in blind comparisons.

Chatbot Arena Elo Rankings: Who Wins the Human Vote?

Academic benchmarks measure what a model can do. Chatbot Arena measures what people actually prefer. Run by the LM Arena team (formerly LMSYS), Chatbot Arena is the largest crowdsourced evaluation platform for large language models, with over 6 million human votes and counting. Users are presented with two anonymous model responses side by side and simply pick the one they like better. The result is an Elo rating system that captures real-world human preference in a way no automated benchmark can replicate.

How Elo Ratings Work for LLMs

The Elo system was invented for chess in the 1960s, and its elegance lies in its simplicity. When two models face off in a blind comparison, the winner gains rating points and the loser drops points. The number of points exchanged depends on the expected outcome: if a top-rated model beats a low-rated one, few points change hands. If the underdog wins, the swing is large.

Over millions of comparisons, Elo ratings converge on a stable ranking that reflects genuine human preference. A 50-point gap means the higher-rated model wins roughly 57% of head-to-head matchups. A 100-point gap means it wins about 64% of the time.

The beauty of this system is that it captures qualities that automated benchmarks miss: writing style, helpfulness, nuance, the ability to follow complex instructions, and that hard-to-define quality of a response simply feeling right.

The Current Elo Rankings

RankModelProviderElo ScoreStyle Control Elo95% CI
1GPT-5.2 ProOpenAI14021388+/- 4
2Claude Opus 4.6Anthropic13981391+/- 4
3Gemini 3 ProGoogle DeepMind13891382+/- 5
4GPT-5.2OpenAI13801365+/- 4
5Grok 4 HeavyxAI13751370+/- 6
6Claude Opus 4.5Anthropic13701366+/- 4
7DeepSeek V3.2-SpecialeDeepSeek13611355+/- 5
8Qwen 3.5Alibaba13421335+/- 6
9Gemini 2.5 FlashGoogle DeepMind13351322+/- 5
10Llama 4 MaverickMeta13201310+/- 7

Scores as of February 14, 2026. Style Control Elo adjusts for response length and formatting biases.

The Gap Between Elo and Academic Benchmarks

One of the most interesting findings from Chatbot Arena is how often human preference diverges from academic benchmark performance. DeepSeek V3.2-Speciale, for instance, posts some of the highest scores on coding and math benchmarks but ranks seventh in human preference. Why?

The answer lies in what people value in a conversational AI. Academic benchmarks test narrow capabilities: can the model solve this math problem, answer this science question, write code that passes these tests. Human preference captures something broader. Users care about tone, formatting, how well the model handles ambiguity, whether it asks clarifying questions, and whether the response feels genuinely helpful rather than just technically correct.

Models that produce slightly verbose but well-structured responses tend to score higher in Arena battles. This is why the "Style Control" Elo column matters. It adjusts for the known bias toward longer, more formatted responses, giving a cleaner signal of underlying quality.

Why Chatbot Arena Matters

Chatbot Arena is arguably the single most important evaluation for AI models, for several reasons.

It resists gaming. Because votes come from real users making genuine comparisons, labs cannot optimize specifically for Arena performance the way they can for static benchmarks. The sheer volume of votes (over 6 million) also makes it extremely difficult to manipulate through coordinated voting.

It reflects real usage. The prompts in Arena come from actual users with actual needs. They ask for creative writing, coding help, analysis, advice, and everything in between. This diversity of real-world queries gives a much more complete picture than any curated benchmark suite.

It updates continuously. While academic benchmarks are static snapshots, Arena ratings update in real time as new votes come in. When a new model launches, it can establish a reliable rating within days, not months.

It captures the full experience. Automated benchmarks score individual responses against reference answers. Arena captures the holistic experience of interacting with a model, including how it handles edge cases, misunderstandings, and follow-up questions.

Category-Specific Arena Rankings

The Arena also tracks performance across categories. Some notable splits in the current data:

  • Coding: Claude Opus 4.6 leads, followed by GPT-5.2 Pro and DeepSeek V3.2-Speciale
  • Creative Writing: GPT-5.2 Pro leads, with Claude Opus 4.6 close behind
  • Math and Reasoning: GPT-5.2 Pro and Grok 4 Heavy lead this category
  • Instruction Following: Gemini 3 Pro shows particular strength here
  • Multilingual: Qwen 3.5 and Gemini 3 Pro outperform others outside of English

These category breakdowns reveal that no single model dominates everything. The overall Elo rating is a weighted average of many different interaction types, and the best model for your specific use case might not be the overall number one.

Looking Ahead

Chatbot Arena continues to grow in influence. As automated benchmarks become increasingly saturated, with multiple models scoring above 85% on MMLU-Pro, human preference ratings provide the clearest signal of meaningful differentiation. The Arena has become the benchmark that matters most to both users choosing a model and labs competing for market share.

Visit lmarena.ai to cast your own votes and contribute to these rankings.

About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.