Grok 4.20 - xAI's Multi-Agent Reasoning Flagship
Grok 4.20 is xAI's current flagship LLM with a 2M-token context window, native multi-agent mode, and reasoning toggle at $2.00/M input tokens.

Grok 4.20 is xAI's current flagship language model, superseding the original Grok 4 released in July 2025. It launched in public beta on February 17, 2026, with API access following on March 10. The headline improvements over Grok 4 are a 2-million-token context window (up from 256K), a native multi-agent mode that spawns up to 16 coordinating sub-agents, and a toggleable reasoning mode across three API variants.
TL;DR
- 2M token context window - the largest among current flagship models - at $2.00/M input, $6.00/M output
- Three API variants: reasoning, non-reasoning, and multi-agent (up to 16 coordinating sub-agents)
- Leads on output throughput (234.9 tokens/second) and sits at roughly #4 on the Chatbot Arena leaderboard with a preliminary ELO of ~1493
The pricing shift is notable. Grok 4 cost $3.00/M input and $15.00/M output. Grok 4.20 drops that to $2.00/$6.00 while expanding context by 8x. For teams building long-document workflows or agentic pipelines that were priced out of the Heavy variant, that's a meaningful change. Claude Opus 4.6 still leads on the Arena leaderboard and on coding benchmarks, and GPT-5.4 holds the edge on computer-use automation, but neither matches the context window size or output speed Grok 4.20 offers at this price point.
Key Specifications
| Specification | Details |
|---|---|
| Provider | xAI |
| Model Family | Grok |
| Parameters | Not disclosed |
| Context Window | 2,000,000 tokens |
| Input Price | $2.00/M tokens ($0.20/M cached) |
| Output Price | $6.00/M tokens |
| Release Date | February 17, 2026 (beta); March 10, 2026 (API) |
| Knowledge Cutoff | September 1, 2025 |
| License | Proprietary |
| Modalities | Text + Image input, Text output |
| Open Source | No |
xAI hasn't disclosed parameter counts for any model in the Grok 4 family. Third-party estimates suggesting a Mixture-of-Experts backbone have circulated, but none are confirmed. What's verified: the model exposes a reasoning parameter in the API that enables or disables extended chain-of-thought, which is how the -reasoning and -non-reasoning API variants differ at the interface level.
Benchmark Performance
xAI hasn't published MMLU, GPQA Diamond, or SWE-bench numbers specifically for Grok 4.20. The verified scores below come from Artificial Analysis and the Chatbot Arena, both of which use blind third-party evaluation.
| Benchmark | Grok 4.20 | Claude Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| AI Index (Artificial Analysis) | 48 / 100 | Not listed | Not listed | Not listed |
| Chatbot Arena ELO (preliminary) | ~1493 (#4) | ~1500 (#3) | Not confirmed | ~1485 |
| Omniscience (non-hallucination %) | 78% | Not available | Not available | Not available |
| Output speed (tokens/sec) | 234.9 (#1) | ~80 | ~100 | ~120 |
A few things worth flagging: the Arena ELO is preliminary, based on ~5,071 votes as of early April 2026, so the ranking could shift. The Omniscience benchmark is Artificial Analysis's own test, not an industry standard - it measures factual accuracy across a held-out question set, and 78% is the highest score they've recorded, but that claim should be weighed against standard benchmarks once xAI publishes them. The output throughput figure of 234.9 tokens/second is independently verified and genuine - Grok 4.20 is clearly faster than any competitor at the flagship tier.
Benchmark tracking across AI models requires independent third-party evaluation - xAI hasn't yet published full academic benchmark results for Grok 4.20.
Source: unsplash.com
Key Capabilities
Reasoning Toggle
Unlike models that ship separate reasoning and non-reasoning endpoints as distinct products, Grok 4.20 exposes this as an API parameter. You can set reasoning: { enabled: true } on any request, or use the pre-configured model IDs grok-4.20-0309-reasoning or grok-4.20-0309-non-reasoning. This lets developers control compute cost per request without maintaining separate integration paths. Extended thinking tokens are billed, so the cost difference between modes is real.
Multi-Agent Mode
The grok-4.20-multi-agent-0309 model is the most distinctive addition. It spawns a network of sub-agents internally - either 4 agents (quick research tasks) or 16 agents (complex analysis) - coordinated by a leader agent that consolidates the final response. Each sub-agent reasons and fact-checks in parallel; the leader arbitrates conflicts rather than voting on them.
From an API standpoint, the interface looks like a single-turn call: you send one request and get one response. xAI handles the orchestration server-side. The tradeoff is that all tokens consumed by all agents count toward your bill, including reasoning tokens and server-side tool calls. A 16-agent run on a hard research question can cost substantially more than a standard call. The agentic AI benchmarks leaderboard is tracking real-world multi-agent performance as more data comes in.
Grok 4.20's multi-agent mode coordinates up to 16 sub-agents internally - the user sees a single API call and a single response.
Source: unsplash.com
One practical constraint: the multi-agent variant doesn't support client-side custom tools or the OpenAI Chat Completions-compatible API. If your pipeline relies on custom function definitions passed at call time, the standard reasoning or non-reasoning variants are the only options.
Context and Multimodal
The 2M-token context is genuine - Artificial Analysis confirmed it in their evaluation run. This opens up use cases that were previously unreachable on Grok 4: loading full software repositories, multi-document legal review, or extended conversation histories that would otherwise require retrieval. The model also accepts image inputs, which is useful for document analysis tasks where diagrams and screenshots need to be processed with text.
Built-in web search and X (formerly Twitter) search are available as server-side tools, same as Grok 4. The X search integration carries the same caveat as before: X isn't a reliable factual source, and verifying claims sourced from there's the caller's responsibility.
Pricing and Availability
| Tier | Input $/M | Output $/M |
|---|---|---|
| Grok 4.20 | $2.00 ($0.20 cached) | $6.00 |
| Grok 4.1 Fast | $0.20 ($0.05 cached) | $0.50 |
| Claude Opus 4.6 | $15.00 | $75.00 |
| GPT-5.4 | $2.50 | $15.00 |
| Gemini 3.1 Pro | $2.00 | $10.00 |
Grok 4.20 is the cheapest of the current flagship-tier models on output pricing. On input it ties Gemini 3.1 Pro and undercuts GPT-5.4 by 20%. The 2M context window surcharge (pricing increases for requests above 200K tokens, per OpenRouter) is worth checking if your use case pushes into long-context territory. Grok 4.1 Fast remains available for budget-sensitive workloads at 10x less cost, with the same 2M context but reduced capability.
Web search calls add a $5.00/1K surcharge on top of token costs. For pipelines running many short queries with integrated search, that can add up.
API access is via api.x.ai, OpenAI-SDK-compatible with a base URL swap. Available on OpenRouter as x-ai/grok-4.20-beta. Batch processing discounts apply.
Consumer access is via SuperGrok subscriptions.
Strengths
- Largest confirmed context window among current flagship models (2M tokens)
- Fastest output throughput at the flagship tier (234.9 t/s, Artificial Analysis)
- Lower pricing than GPT-5.4 and Claude Opus 4.6 on both input and output
- Native multi-agent mode built into the API (no external orchestration required)
- Reasoning toggle avoids the cost of always-on extended thinking
- OpenAI SDK compatibility with a base URL change
Weaknesses
- No published MMLU, GPQA Diamond, or SWE-bench scores to compare against competitors
- Multi-agent variant doesn't support client-side custom tools
- Sub-agent reasoning is encrypted by default (limits debuggability in multi-agent mode)
- Knowledge cutoff is September 1, 2025 - several months older than Claude Opus 4.6 and GPT-5.4
- Multi-agent billing is opaque: all internal agent tokens count toward cost
Related Coverage
- News: Grok 4.20 Multi-Agent Launch | Grok 4.20 Takes the Search Arena Lead | xAI Grok 4.20 Preview
- Predecessor: Grok 4 model page
- Leaderboards: Reasoning Benchmarks | Agentic AI Benchmarks | Cost Efficiency
- Compare: Claude Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro
FAQ
What is Grok 4.20 best at?
Long-context tasks and high-throughput generation. Its 2M token window is the largest among flagship models and its 234.9 tokens/second output speed leads the field. Best for document analysis, extended reasoning chains, and agentic pipelines where cost matters.
How does the multi-agent mode work?
Send a request to grok-4.20-multi-agent-0309. xAI spawns 4 or 16 sub-agents server-side that research, reason, and cross-check in parallel. The leader agent consolidates the output. All internal tokens are billed. You cannot pass custom client-side tools to the multi-agent variant.
Is Grok 4.20 better than Grok 4?
On context window (2M vs 256K), pricing ($2/$6 vs $3/$15), and output speed, yes. On reasoning benchmarks, no direct comparison exists yet because xAI hasn't published MMLU or GPQA scores for Grok 4.20. The Chatbot Arena ELO is preliminary.
Does Grok 4.20 support image inputs?
Yes. It accepts text and image inputs, with text-only output. Video input is not supported on the base model variants.
How do I access Grok 4.20 via API?
Use api.x.ai with an OpenAI-compatible SDK, changing only the base URL and API key. Model IDs are grok-4.20-0309-reasoning, grok-4.20-0309-non-reasoning, and grok-4.20-multi-agent-0309. Also available via OpenRouter as x-ai/grok-4.20-beta.
Sources:
✓ Last verified April 2, 2026
