News

Grok 4.20 Debuts at Number One on Search Arena, Cracks Top Four in Text

xAI's Grok 4.20 beta1 takes the top spot on LMArena's Search Arena with an ELO of 1226, beating GPT-5.2 and Gemini 3. It also lands fourth on the Text Arena at 1492, within striking distance of Claude Opus 4.6.

Grok 4.20 Debuts at Number One on Search Arena, Cracks Top Four in Text

Grok 4.20 beta1, the multi-agent system xAI launched a week ago, just climbed to the top of LMArena's Search Arena leaderboard with an ELO of 1226 - ahead of GPT-5.2 Search (1219) and both Gemini 3 variants. On the Text Arena, it sits at fourth place with 1492, behind Claude Opus 4.6 (1504) and Gemini 3.1 Pro (1500) but ahead of GPT-5.2.

The result is notable because it marks the first time a multi-agent architecture - not a single monolithic model - has topped a major public leaderboard.

TL;DR

  • Grok 4.20 beta1 is #1 on Search Arena with an ELO of 1226, leading GPT-5.2 Search (1219) and Gemini 3 Flash Grounding (1217)
  • On Text Arena, it ranks #4 at 1492, on par with Gemini 3.1 Pro (1500) within error margins
  • The model has 3,992 search votes and 3,695 text votes so far - significantly fewer than the 20K-40K votes on established models, making the ranking preliminary
  • This is the third time xAI has held #1 on Search Arena, after Grok 4 and Grok 4.1 previously topped the leaderboard

The Search Arena Numbers

Search Arena is LMArena's dedicated leaderboard for search-augmented LLMs - models that pull live information from the web before answering. Users submit queries, get side-by-side responses from anonymous models, and vote on which answer is better. The votes are processed through Bradley-Terry regression to produce ELO scores. As of February 25, the leaderboard has accumulated 246,076 votes across 20 models.

Here is the current top 10:

RankModelOrgELOVotes
1Grok 4.20 beta1xAI1226 +/-93,992
2GPT-5.2 SearchOpenAI1219 +/-619,960
3Gemini 3 Flash GroundingGoogle1217 +/-625,041
4Gemini 3 Pro GroundingGoogle1215 +/-531,696
5GPT-5.1 SearchOpenAI1210 +/-623,069
6GPT-5.2 Search (non-reasoning)OpenAI1182 +/-619,766
7Grok 4.1 Fast SearchxAI1181 +/-526,511
8Grok 4 Fast SearchxAI1173 +/-441,968
9Claude Opus 4.5 SearchAnthropic1170 +/-715,268
10o3 SearchOpenAI1144 +/-520,408

The gap between first and second is 7 ELO points. That matters because Grok 4.20's confidence interval (+/-9) is wider than GPT-5.2's (+/-6) due to having roughly one-fifth the votes. If vote counts equalize and the ranking holds, the lead becomes statistically meaningful. If it narrows, this could end up being a statistical tie for first.

What Search Arena Actually Tests

This is not a standard benchmark where you can game the eval set. Search Arena measures something closer to real-world utility: can the model find accurate, relevant, up-to-date information and present it clearly? Users bring their own questions - there is no fixed test set - and vote blind. That makes it harder to overfit and more informative than curated benchmarks.

Grok's advantage here likely comes from its real-time X integration. The system has firehose access to approximately 68 million English posts per day on X, giving it a live pulse on breaking news, public reactions, and trending topics that other models can only get through web crawls. When a user asks about something that happened an hour ago, that pipeline matters.

The Text Arena Numbers

On the Text Arena (the general-purpose LMArena leaderboard, formerly Chatbot Arena), Grok 4.20 beta1 sits at #4 with an ELO of 1492. The current top 10:

RankModelOrgELOVotes
1Claude Opus 4.6Anthropic1504 +/-77,042
2Claude Opus 4.6 ThinkingAnthropic1503 +/-86,181
3Gemini 3.1 Pro PreviewGoogle1500 +/-94,060
4Grok 4.20 beta1xAI1492 +/-103,695
5Gemini 3 ProGoogle1486 +/-437,854
6GPT-5.2 ChatOpenAI1481 +/-113,093
7Gemini 3 FlashGoogle1473 +/-528,847
8Grok 4.1 ThinkingxAI1473 +/-437,167
9DoLA Seed 2.0 PreviewByteDance1472 +/-94,180
10Claude Opus 4.5 Thinking 32KAnthropic1472 +/-430,067

At 1492, Grok 4.20 is within overlapping error bars of both Gemini 3.1 Pro Preview (1500) and GPT-5.2 Chat (1481). Practically, positions 3 through 6 are a statistical cluster. The real story is that Grok jumped from 8th (Grok 4.1 Thinking at 1473) to 4th - a 19-point ELO gain from the multi-agent architecture that replaced the single-model approach.

For context on how the broader arena rankings have evolved, Claude Opus 4.6 took the #1 spot when it launched earlier this month with an ELO that has since settled at 1504.

Should You Care?

The short answer: yes, but with caveats.

What the ranking proves. Grok 4.20's multi-agent architecture - four specialized agents (Grok, Harper, Benjamin, Lucas) debating and synthesizing answers - is producing human-preferred outputs that compete with the best single-model systems. That is a meaningful engineering result. It validates the idea that agent collaboration can substitute for raw model scale.

What it does not prove. With under 4,000 votes on each leaderboard, these rankings are preliminary. Every established model in the top 10 has 6x to 10x more votes. Early leaderboard entrants often see their scores shift as the vote count grows and the user distribution normalizes. Grok 4.1 Thinking, for comparison, has 37,167 votes and a much tighter confidence interval (+/-4 vs +/-10).

The competitive context. xAI has now held #1 on Search Arena three times (Grok 4, Grok 4.1, and now Grok 4.20), suggesting this is not a fluke but a genuine strength. On Text Arena, the model went from a disappointing #33 debut (Grok 4 in July 2025) to comfortably in the top tier within six months. That trajectory is steeper than any other lab's.

What to watch. The rankings over the next two weeks, as vote counts approach the 10K-20K range where scores typically stabilize. If Grok 4.20 holds #1 on Search and top 5 on Text with 20,000+ votes, the multi-agent approach becomes the result to beat - and every other lab will be asking how to replicate it without xAI's X data advantage.


Sources:

Grok 4.20 Debuts at Number One on Search Arena, Cracks Top Four in Text
About the author AI Infrastructure & Open Source Reporter

Sophie is a journalist and former systems engineer who covers AI infrastructure, open-source models, and the developer tooling ecosystem.