Chatbot Arena Elo Rankings: Who Wins the Human Vote?

Updated July 2026 Chatbot Arena Elo rankings from Arena.ai: 7M+ votes across 368 models, Claude Opus 4.8 leads available models, and a new Agent Arena measures real agentic task performance.

Chatbot Arena Elo Rankings: Who Wins the Human Vote?

Academic benchmarks measure what a model can do. Chatbot Arena measures what people actually prefer. Run by Arena (formerly LMSYS, rebranded January 2026), Chatbot Arena is the largest crowdsourced evaluation platform for large language models, with over 7 million human votes and 368 models ranked as of late June 2026. Users see two anonymous model responses side by side and pick the one they like better. The result is an Elo rating that captures real-world human preference in a way no automated benchmark can reproduce.

TL;DR

  • #1 on the text leaderboard is Claude Fable 5 at 1508 Elo - and it has been offline since June 12 due to an US export control order
  • Effective #1 for available models: Claude Opus 4.8 at ~1512 Elo, also leading the Coding Arena at ~1582
  • Biggest story beyond rankings: Arena hit $100M ARR in June 2026, and the new Agent Arena (launched June 4) now measures real agentic task completion - not just response preference

How Elo Ratings Work Here

The Elo system was invented for chess in the 1960s. Arena adapts it through a Bradley-Terry statistical model that computes pairwise win probabilities across millions of comparisons rather than updating incrementally after each match. This produces more stable estimates and corrects for ordering bias (left-side preference) and organizational bias (users tending to favor certain labs). Confidence intervals are bootstrapped across 1,000 data permutations, so you can see how tight or preliminary a rating is.

A 50-point Elo gap means the higher-rated model wins roughly 57% of head-to-head matchups. A 100-point gap means roughly 64%. At the frontier tier, where top models cluster within 25-55 points of each other, these gaps are functionally small - you're talking about a 53-54% win rate, not a decisive edge.

Style Control Elo strips out the known bias toward longer, more formatted responses to give a cleaner signal of underlying quality. When a model's Style Control Elo is noticeably lower than its raw Elo, that gap is largely explanation and formatting padding rather than genuine capability.

The Current Text Arena Rankings

Scores from arena.ai/leaderboard/text/ as of June 25, 2026 - 7,060,640 total votes, 368 models.

RankModelProviderOverall EloStyle Control Elo95% CI
1**Claude Fable 5Anthropic1508N/A±9
2Claude Opus 4.8Anthropic~1512~1497±4
3GPT-5.5 ProOpenAI~1510~1494±5
4Gemini 3.1 ProGoogle~1505~1488±4
5Claude Opus 4.7 (thinking)Anthropic1502~1488±4
6Claude Opus 4.6 (thinking)Anthropic1503~1489±3
7Qwen 3.7 MaxAlibaba~1470~1454±6
8DeepSeek V4 ProDeepSeek~1462~1447±5
9Kimi K2.6Moonshot AI~1456~1440±7
10Grok 4.20xAI~1455~1439±8

**Claude Fable 5 has been offline globally since June 12, 2026, due to an US Commerce Department export control directive. It built up only 4,366 Arena votes before the shutdown, making its 1508 Elo preliminary (±9, the widest CI in the top tier). The rating is real but carries more uncertainty than a model with 40,000+ votes.

All scores marked ~ are derived from multiple third-party tracking sources and carry slightly more uncertainty than the directly observed arena.ai scores. The official leaderboard refreshes weekly.

The Arena.ai comparison interface showing two anonymous AI responses side by side for human voting Arena's blind pairwise voting interface - users vote on two anonymous model responses before identities are revealed. Over 7 million such comparisons power the rankings. Source: agileleadershipdayindia.org

Key Takeaways

Claude dominates both ends of the table

The top six spots on the text leaderboard are held by Anthropic models, which is an unusual level of concentration. Claude Opus 4.8, 4.7 (thinking), 4.6 (thinking), and their base variants collectively account for over 180,000 Arena votes - a larger sample than any other single lab. This reflects both genuine quality and the high volume of Claude users on Arena's platform.

For available models, Claude Opus 4.8 at ~1512 Elo is the clear choice if you want the Arena-validated top performer. At $5/$25 per million tokens it is not cheap, but no other accessible model beats it on the crowdsourced preference signal.

The thinking variants of Claude Opus 4.6 and 4.7 score nearly as high as the base Claude Opus 4.8, within ±4 points. If you already have API access to either of those models with extended thinking enabled, the gap is small enough that you might not notice it in practice.

The open-weight tier has arrived at the frontier

Qwen 3.7 Max at ~1470 Elo is the highest-ranking open-weight model on the text Arena as of this update. Kimi K2.6 from Moonshot AI follows at ~1456 Elo. That's within 40-55 points of GPT-5.5 Pro - a meaningful gap, but one that has closed by roughly 80 points since early 2026 when the open-weight tier sat at ~1375.

DeepSeek V4 Pro (~1462 Elo) is the strongest self-hostable option for most use cases. Its review notes real-world coding and reasoning quality that matches its Arena position.

The open-weight tier closed 80 Elo points on the closed-source frontier in six months. That pace hasn't slowed down.

Style Control reveals hidden padding

Looking at the Style Control Elo column, GPT-5.5 Pro shows the largest raw-to-controlled gap in the top tier (~16 points), which suggests its Arena wins are partially driven by well-formatted, structured responses rather than pure accuracy. Gemini 3.1 Pro and the Claude thinking variants show smaller gaps, suggesting their human preference signal is less dependent on response length and formatting.

This isn't a knock on GPT-5.5 Pro - formatting and structure matter in real conversations. But if you're choosing a model for tasks where raw accuracy matters more than presentation, the Style Control column is the more honest signal.

Category-Specific Arena Rankings

The text Arena now breaks down results across eight categories: Coding, Math, Creative Writing, Instruction Following, Multi-Turn, Hard Prompts, Expert, and Occupational. The overall Elo is an average across these, and the spreads are meaningful.

Coding: Claude Opus 4.7 (thinking) leads the Coding Arena at around 1569 Elo. Claude Opus 4.8 sits close behind at ~1582 on the swfte.com composite. Both are well above GPT-5.5 Pro, which ranks lower specifically on code generation tasks despite its strong overall position.

Math: GPT-5.4 (high) leads the Math Arena at around 1515 Elo, edging both Claude and Gemini on structured reasoning tasks. This aligns with OpenAI's strength in mathematical benchmarks dating back to GPT-4o-series.

Creative Writing: GPT-5.5 Pro leads here, which partly explains its high overall Elo despite the Style Control penalty. Humans tend to strongly prefer GPT-5.5 Pro's prose style.

Science and Expert tasks: Gemini 3.1 Pro leads these categories. Its 94.3% GPQA Diamond score - the highest of any currently available model - maps directly to Arena performance on questions that require graduate-level domain reasoning.

Multilingual: Qwen 3.7 Max and Gemini 3.1 Pro beat the Claude and GPT-5.5 models outside of English, especially in Chinese, Japanese, Arabic, and Spanish. For multilingual applications, the overall English-dominant Elo isn't the right signal to use.

Agent Arena: A New Kind of Evaluation

On June 4, 2026, Arena launched a qualitatively different leaderboard: Agent Arena. Rather than pairwise preference votes, it uses causal tracing to measure whether models actually complete agentic tasks in real sessions.

The methodology treats each model as a multi-component system and randomizes tool assignments to estimate causal treatment effects - basically a multi-intervention randomized controlled trial. Signals measured include confirmed task completion (explicit user approval buttons), praise-to-complaint ratio (verbal feedback analysis), steerability (whether the model follows corrections), bash error recovery speed, and tool hallucination rate.

As of June 29, 2026 - less than four weeks after launch - the Agent Arena had accumulated 1,004,092 sessions across 28 models. The top 10:

RankModelProviderNet ImprovementSessions
1**Claude Fable 5 (High)Anthropic13.34% ±1.55%16,082
2Claude Opus 4.8 (Thinking)Anthropic9.37% ±1.29%30,511
3GPT-5.5 (xHigh)OpenAI8.21% ±1.02%24,393
4Claude Opus 4.7Anthropic8.16% ±1.28%31,725
5Claude Opus 4.7 (Thinking)Anthropic8.07% ±1.23%31,304
6GPT-5.5 (High)OpenAI7.13% ±0.78%49,559
7GLM 5.2 (Max)Z.ai/MIT6.93% ±1.40%21,946
8GPT-5.4 (High)OpenAI6.65% ±0.79%49,486
9Claude Opus 4.6Anthropic6.47% ±1.21%31,155
10GPT-5.5OpenAI6.22% ±0.77%49,883

Claude Fable 5 is excluded from practical recommendations since it remains offline. With that caveat, Claude Opus 4.8 (Thinking) at 9.37% net improvement leads all available models by a comfortable margin - larger than anything the text preference Arena shows. The gap between the effective #1 (Opus 4.8 Thinking) and #3 (GPT-5.5 xHigh) is 1.16 percentage points, which at these sample sizes is statistically meaningful.

GLM 5.2 (Max) from Z.ai/MIT at #7 is the surprise of the Agent Arena launch. It isn't in the top 10 on text preference Elo, but on actual agentic task completion it beats all non-Anthropic, non-OpenAI models by a significant margin. This is the kind of divergence that makes having multiple evaluation frameworks worth the effort.

Why Arena Matters More in 2026

Four things have changed since Arena launched the text leaderboard in 2023:

Scale. 7 million votes is no longer a theoretical defense against gaming - it's a practical one. Any coordinated effort to shift a model's ranking by even 2-3 Elo points would require tens of thousands of coordinated, natural-looking votes on varied prompts. The cost of doing this is now meaningfully higher than the cost of just building a better model.

Category granularity. The eight-category breakdown means "ranked #3 overall" is less interesting than "ranked #1 for coding and #6 for creative writing." The June 2026 overall rankings cover capability benchmarks; Arena covers the human-preference dimension of those same capabilities.

Agentic coverage. Agent Arena fills the gap that pure text evaluation leaves open. A model that writes a beautiful response explaining how it'd fix a bug isn't the same as a model that actually fixes it in a multi-step agentic workflow. The causal tracing methodology is a more honest measurement of practical AI usefulness than any preference vote.

Commercial scale. Arena hit $100M ARR eight months after launching paid evaluations in September 2025, making it a durable institution rather than an academic project. Labs now treat Arena placement as a commercial signal, which both increases the quality of participation and the scrutiny on methodology.

Practical Guidance

For general-purpose chat and writing

Claude Opus 4.8 at ~1512 Elo is the highest-ranked available model on the overall text leaderboard. For creative writing specifically, GPT-5.5 Pro beats it in human preference, but the gap is small and context-dependent.

For coding

The coding-specific Arena puts Claude thinking variants at the top. If you are using a coding agent or IDE extension built on API calls, the coding Arena is the signal to follow - not the overall Elo. Our code completion leaderboard covers SWE-bench and other capability metrics that complement Arena's preference signal.

For science and reasoning

Gemini 3.1 Pro leads both the science/expert Arena categories and GPQA Diamond among available models. At $2/$12 per million tokens it's also the most cost-efficient frontier option for this use case.

For agentic workflows

Agent Arena gives you a cleaner signal here than text preference Elo. Claude Opus 4.8 (Thinking) leads available models with a 9.37% net improvement score. If you are launching multi-step agents that call tools, write code, and recover from errors, this is the leaderboard to consult.

For multilingual or budget-conscious use

Qwen 3.7 Max at ~1470 Elo under Apache 2.0 is the best open-weight option for multilingual applications. It beats every closed-source model outside English on the multilingual Arena category, and it can be self-hosted at zero API cost.


FAQ

What is the Arena Elo leaderboard?

Arena is a crowdsourced AI evaluation platform where users vote in blind pairwise comparisons of model responses. Over 7 million votes across 368 models drive an Elo-style ranking using the Bradley-Terry statistical model. It measures human preference, not automated accuracy.

Which AI model ranks #1 on Chatbot Arena in July 2026?

Claude Fable 5 at 1508 Elo holds the #1 position but is offline globally due to an US export control order issued June 12, 2026. Claude Opus 4.8 (~1512 Elo) is the top available model. Claude Opus 4.7 (thinking) and GPT-5.5 Pro follow closely.

How is Style Control Elo different from regular Elo?

Style Control Elo adjusts for response length and formatting biases - specifically the tendency of human raters to prefer longer, more structured responses regardless of accuracy. A model with a large gap between raw Elo and Style Control Elo is winning partly on formatting, not just substance.

What is Agent Arena and how does it differ?

Agent Arena (launched June 4, 2026) evaluates real agentic task completion rather than single-response preference. It uses causal tracing across 1M+ agent sessions to measure task success rates, tool reliability, error recovery, and steerability. Claude Opus 4.8 (Thinking) leads available models with a 9.37% net improvement score.

Why do category-specific Arena rankings differ from overall?

The overall Elo averages across eight categories (coding, math, creative writing, instruction following, etc.). A model that excels in one area gets pulled toward the average. Gemini 3.1 Pro leads science but ranks #4 overall. GPT-5.5 Pro leads creative writing but lags on coding. Always check the category-specific leaderboard for your use case.

How often does the Arena leaderboard update?

The leaderboard refreshes weekly on the official site at arena.ai/leaderboard. Major model additions can shift rankings within days once enough votes build up. Models with fewer than ~5,000 votes are marked "Preliminary" and carry wide confidence intervals.


Sources:

✓ Last verified July 1, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.