Grok 4.20 Debuts at Number One on Search Arena, Cracks Top Four in Text

Grok 4.20 beta1, the multi-agent system xAI launched a week ago, just climbed to the top of LMArena's Search Arena leaderboard with an ELO of 1226 - ahead of GPT-5.2 Search (1219) and both Gemini 3 variants. On the Text Arena, it sits at fourth place with 1492, behind Claude Opus 4.6 (1504) and Gemini 3.1 Pro (1500) but ahead of GPT-5.2.

The result is notable because it marks the first time a multi-agent architecture - not a single monolithic model - has topped a major public leaderboard.

TL;DR

Grok 4.20 beta1 is #1 on Search Arena with an ELO of 1226, leading GPT-5.2 Search (1219) and Gemini 3 Flash Grounding (1217)
On Text Arena, it ranks #4 at 1492, on par with Gemini 3.1 Pro (1500) within error margins
The model has 3,992 search votes and 3,695 text votes so far - clearly fewer than the 20K-40K votes on established models, making the ranking preliminary
This is the third time xAI has held #1 on Search Arena, after Grok 4 and Grok 4.1 previously topped the leaderboard

The Search Arena Numbers

Search Arena is LMArena's dedicated leaderboard for search-augmented LLMs - models that pull live information from the web before answering. Users submit queries, get side-by-side responses from anonymous models, and vote on which answer is better. The votes are processed through Bradley-Terry regression to produce ELO scores. As of February 25, the leaderboard has built up 246,076 votes across 20 models.

Here is the current top 10:

Rank	Model	Org	ELO	Votes
1	Grok 4.20 beta1	xAI	1226 +/-9	3,992
2	GPT-5.2 Search	OpenAI	1219 +/-6	19,960
3	Gemini 3 Flash Grounding	Google	1217 +/-6	25,041
4	Gemini 3 Pro Grounding	Google	1215 +/-5	31,696
5	GPT-5.1 Search	OpenAI	1210 +/-6	23,069
6	GPT-5.2 Search (non-reasoning)	OpenAI	1182 +/-6	19,766
7	Grok 4.1 Fast Search	xAI	1181 +/-5	26,511
8	Grok 4 Fast Search	xAI	1173 +/-4	41,968
9	Claude Opus 4.5 Search	Anthropic	1170 +/-7	15,268
10	o3 Search	OpenAI	1144 +/-5	20,408

The gap between first and second is 7 ELO points. That matters because Grok 4.20's confidence interval (+/-9) is wider than GPT-5.2's (+/-6) due to having roughly one-fifth the votes. If vote counts equalize and the ranking holds, the lead becomes statistically meaningful. If it narrows, this could end up being a statistical tie for first.

What Search Arena Actually Tests

This isn't a standard benchmark where you can game the eval set. Search Arena measures something closer to real-world utility: can the model find accurate, relevant, up-to-date information and present it clearly? Users bring their own questions - there's no fixed test set - and vote blind. That makes it harder to overfit and more informative than curated benchmarks.

Grok's advantage here likely comes from its real-time X integration. The system has firehose access to roughly 68 million English posts per day on X, giving it a live pulse on breaking news, public reactions, and trending topics that other models can only get through web crawls. When a user asks about something that happened an hour ago, that pipeline matters.

The Text Arena Numbers

On the Text Arena (the general-purpose LMArena leaderboard, formerly Chatbot Arena), Grok 4.20 beta1 sits at #4 with an ELO of 1492. The current top 10:

Rank	Model	Org	ELO	Votes
1	Claude Opus 4.6	Anthropic	1504 +/-7	7,042
2	Claude Opus 4.6 Thinking	Anthropic	1503 +/-8	6,181
3	Gemini 3.1 Pro Preview	Google	1500 +/-9	4,060
4	Grok 4.20 beta1	xAI	1492 +/-10	3,695
5	Gemini 3 Pro	Google	1486 +/-4	37,854
6	GPT-5.2 Chat	OpenAI	1481 +/-11	3,093
7	Gemini 3 Flash	Google	1473 +/-5	28,847
8	Grok 4.1 Thinking	xAI	1473 +/-4	37,167
9	DoLA Seed 2.0 Preview	ByteDance	1472 +/-9	4,180
10	Claude Opus 4.5 Thinking 32K	Anthropic	1472 +/-4	30,067

At 1492, Grok 4.20 is within overlapping error bars of both Gemini 3.1 Pro Preview (1500) and GPT-5.2 Chat (1481). Practically, positions 3 through 6 are a statistical cluster. The real story is that Grok jumped from 8th (Grok 4.1 Thinking at 1473) to 4th - a 19-point ELO gain from the multi-agent architecture that replaced the single-model approach.

For context on how the broader arena rankings have evolved, Claude Opus 4.6 took the #1 spot when it launched earlier this month with an ELO that has since settled at 1504.

Should You Care?

The short answer: yes, but with caveats.

What the ranking proves. Grok 4.20's multi-agent architecture - four specialized agents (Grok, Harper, Benjamin, Lucas) debating and synthesizing answers - is producing human-preferred outputs that compete with the best single-model systems. That's a meaningful engineering result. It confirms the idea that agent collaboration can substitute for raw model scale.

What it doesn't prove. With under 4,000 votes on each leaderboard, these rankings are preliminary. Every established model in the top 10 has 6x to 10x more votes. Early leaderboard entrants often see their scores shift as the vote count grows and the user distribution normalizes. Grok 4.1 Thinking, for comparison, has 37,167 votes and a much tighter confidence interval (+/-4 vs +/-10).

The competitive context. xAI has now held #1 on Search Arena three times (Grok 4, Grok 4.1, and now Grok 4.20), suggesting this isn't a fluke but a genuine strength. On Text Arena, the model went from a disappointing #33 debut (Grok 4 in July 2025) to comfortably in the top tier within six months. That path is steeper than any other lab's.

What to watch. The rankings over the next two weeks, as vote counts approach the 10K-20K range where scores normally stabilize. If Grok 4.20 holds #1 on Search and top 5 on Text with 20,000+ votes, the multi-agent approach becomes the result to beat - and every other lab will be asking how to copy it without xAI's X data advantage.

Sources:

Search Arena Leaderboard - LMArena
Text Arena Leaderboard - LMArena
Search Arena Update - LMArena Blog
xAI Launches Grok 4.20 With 4 AI Agents - NextBigFuture
Grok 4.20 xAI's 4-Agent System - Natural20
Grok 4.20 Beats OpenAI, Google in Trading Arena - Yahoo Finance

The Search Arena Numbers

What Search Arena Actually Tests

The Text Arena Numbers

Should You Care?

Google Analytics