Name: Grok 4.20 - xAI's Multi-Agent Reasoning Flagship
Author: xAI

Grok 4.20 is xAI's current flagship language model, superseding the original Grok 4 released in July 2025. It launched in public beta on February 17, 2026, with API access following on March 10. The headline improvements over Grok 4 are a 2-million-token context window (up from 256K), a native multi-agent mode that spawns up to 16 coordinating sub-agents, and a toggleable reasoning mode across three API variants.

TL;DR

2M token context window - the largest among current flagship models - at $2.00/M input, $6.00/M output
Three API variants: reasoning, non-reasoning, and multi-agent (up to 16 coordinating sub-agents)
Leads on output throughput (234.9 tokens/second) and sits at roughly #4 on the Chatbot Arena leaderboard with a preliminary ELO of ~1493

The pricing shift is notable. Grok 4 cost $3.00/M input and $15.00/M output. Grok 4.20 drops that to $2.00/$6.00 while expanding context by 8x. For teams building long-document workflows or agentic pipelines that were priced out of the Heavy variant, that's a meaningful change. Claude Opus 4.6 still leads on the Arena leaderboard and on coding benchmarks, and GPT-5.4 holds the edge on computer-use automation, but neither matches the context window size or output speed Grok 4.20 offers at this price point.

Key Specifications

Specification	Details
Provider	xAI
Model Family	Grok
Parameters	Not disclosed
Context Window	2,000,000 tokens
Input Price	$2.00/M tokens ($0.20/M cached)
Output Price	$6.00/M tokens
Release Date	February 17, 2026 (beta); March 10, 2026 (API)
Knowledge Cutoff	September 1, 2025
License	Proprietary
Modalities	Text + Image input, Text output
Open Source	No

xAI hasn't disclosed parameter counts for any model in the Grok 4 family. Third-party estimates suggesting a Mixture-of-Experts backbone have circulated, but none are confirmed. What's verified: the model exposes a reasoning parameter in the API that enables or disables extended chain-of-thought, which is how the -reasoning and -non-reasoning API variants differ at the interface level.

Benchmark Performance

xAI hasn't published MMLU, GPQA Diamond, or SWE-bench numbers specifically for Grok 4.20. The verified scores below come from Artificial Analysis and the Chatbot Arena, both of which use blind third-party evaluation.

Benchmark	Grok 4.20	Claude Opus 4.6	GPT-5.4	Gemini 3.1 Pro
AI Index (Artificial Analysis)	48 / 100	Not listed	Not listed	Not listed
Chatbot Arena ELO (preliminary)	~1493 (#4)	~1500 (#3)	Not confirmed	~1485
Omniscience (non-hallucination %)	78%	Not available	Not available	Not available
Output speed (tokens/sec)	234.9 (#1)	~80	~100	~120

A few things worth flagging: the Arena ELO is preliminary, based on ~5,071 votes as of early April 2026, so the ranking could shift. The Omniscience benchmark is Artificial Analysis's own test, not an industry standard - it measures factual accuracy across a held-out question set, and 78% is the highest score they've recorded, but that claim should be weighed against standard benchmarks once xAI publishes them. The output throughput figure of 234.9 tokens/second is independently verified and genuine - Grok 4.20 is clearly faster than any competitor at the flagship tier.

Laptop displaying a multi-panel data analytics dashboard with charts and performance metrics Benchmark tracking across AI models requires independent third-party evaluation - xAI hasn't yet published full academic benchmark results for Grok 4.20. Source: unsplash.com

Key Capabilities

Reasoning Toggle

Unlike models that ship separate reasoning and non-reasoning endpoints as distinct products, Grok 4.20 exposes this as an API parameter. You can set reasoning: { enabled: true } on any request, or use the pre-configured model IDs grok-4.20-0309-reasoning or grok-4.20-0309-non-reasoning. This lets developers control compute cost per request without maintaining separate integration paths. Extended thinking tokens are billed, so the cost difference between modes is real.

Multi-Agent Mode

The grok-4.20-multi-agent-0309 model is the most distinctive addition. It spawns a network of sub-agents internally - either 4 agents (quick research tasks) or 16 agents (complex analysis) - coordinated by a leader agent that consolidates the final response. Each sub-agent reasons and fact-checks in parallel; the leader arbitrates conflicts rather than voting on them.

From an API standpoint, the interface looks like a single-turn call: you send one request and get one response. xAI handles the orchestration server-side. The tradeoff is that all tokens consumed by all agents count toward your bill, including reasoning tokens and server-side tool calls. A 16-agent run on a hard research question can cost substantially more than a standard call. The agentic AI benchmarks leaderboard is tracking real-world multi-agent performance as more data comes in.

Network of interconnected nodes and edges against a light background representing distributed agent connections Grok 4.20's multi-agent mode coordinates up to 16 sub-agents internally - the user sees a single API call and a single response. Source: unsplash.com

One practical constraint: the multi-agent variant doesn't support client-side custom tools or the OpenAI Chat Completions-compatible API. If your pipeline relies on custom function definitions passed at call time, the standard reasoning or non-reasoning variants are the only options.

Context and Multimodal

The 2M-token context is genuine - Artificial Analysis confirmed it in their evaluation run. This opens up use cases that were previously unreachable on Grok 4: loading full software repositories, multi-document legal review, or extended conversation histories that would otherwise require retrieval. The model also accepts image inputs, which is useful for document analysis tasks where diagrams and screenshots need to be processed with text.

Built-in web search and X (formerly Twitter) search are available as server-side tools, same as Grok 4. The X search integration carries the same caveat as before: X isn't a reliable factual source, and verifying claims sourced from there's the caller's responsibility.

Pricing and Availability

Tier	Input $/M	Output $/M
Grok 4.20	$2.00 ($0.20 cached)	$6.00
Grok 4.1 Fast	$0.20 ($0.05 cached)	$0.50
Claude Opus 4.6	$15.00	$75.00
GPT-5.4	$2.50	$15.00
Gemini 3.1 Pro	$2.00	$10.00

Grok 4.20 is the cheapest of the current flagship-tier models on output pricing. On input it ties Gemini 3.1 Pro and undercuts GPT-5.4 by 20%. The 2M context window surcharge (pricing increases for requests above 200K tokens, per OpenRouter) is worth checking if your use case pushes into long-context territory. Grok 4.1 Fast remains available for budget-sensitive workloads at 10x less cost, with the same 2M context but reduced capability.

Web search calls add a $5.00/1K surcharge on top of token costs. For pipelines running many short queries with integrated search, that can add up.

API access is via api.x.ai, OpenAI-SDK-compatible with a base URL swap. Available on OpenRouter as x-ai/grok-4.20-beta. Batch processing discounts apply.

Consumer access is via SuperGrok subscriptions.

Strengths

Largest confirmed context window among current flagship models (2M tokens)
Fastest output throughput at the flagship tier (234.9 t/s, Artificial Analysis)
Lower pricing than GPT-5.4 and Claude Opus 4.6 on both input and output
Native multi-agent mode built into the API (no external orchestration required)
Reasoning toggle avoids the cost of always-on extended thinking
OpenAI SDK compatibility with a base URL change

Weaknesses

No published MMLU, GPQA Diamond, or SWE-bench scores to compare against competitors
Multi-agent variant doesn't support client-side custom tools
Sub-agent reasoning is encrypted by default (limits debuggability in multi-agent mode)
Knowledge cutoff is September 1, 2025 - several months older than Claude Opus 4.6 and GPT-5.4
Multi-agent billing is opaque: all internal agent tokens count toward cost

News: Grok 4.20 Multi-Agent Launch | Grok 4.20 Takes the Search Arena Lead | xAI Grok 4.20 Preview
Predecessor: Grok 4 model page
Leaderboards: Reasoning Benchmarks | Agentic AI Benchmarks | Cost Efficiency
Compare: Claude Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro

FAQ

What is Grok 4.20 best at?

Long-context tasks and high-throughput generation. Its 2M token window is the largest among flagship models and its 234.9 tokens/second output speed leads the field. Best for document analysis, extended reasoning chains, and agentic pipelines where cost matters.

How does the multi-agent mode work?

Send a request to grok-4.20-multi-agent-0309. xAI spawns 4 or 16 sub-agents server-side that research, reason, and cross-check in parallel. The leader agent consolidates the output. All internal tokens are billed. You cannot pass custom client-side tools to the multi-agent variant.

Is Grok 4.20 better than Grok 4?

On context window (2M vs 256K), pricing ($2/$6 vs $3/$15), and output speed, yes. On reasoning benchmarks, no direct comparison exists yet because xAI hasn't published MMLU or GPQA scores for Grok 4.20. The Chatbot Arena ELO is preliminary.

Does Grok 4.20 support image inputs?

Yes. It accepts text and image inputs, with text-only output. Video input is not supported on the base model variants.

How do I access Grok 4.20 via API?

Use api.x.ai with an OpenAI-compatible SDK, changing only the base URL and API key. Model IDs are grok-4.20-0309-reasoning, grok-4.20-0309-non-reasoning, and grok-4.20-multi-agent-0309. Also available via OpenRouter as x-ai/grok-4.20-beta.

Sources: