Grok 4 - xAI's Flagship Reasoning Model

Grok 4 is xAI's frontier reasoning model, the first to break 50% on Humanity's Last Exam, with a 256K context window, $3/M input pricing, and a Heavy multi-agent variant built on 200,000 GPUs.

Grok 4 - xAI's Flagship Reasoning Model

Grok 4 is the flagship reasoning model from xAI, Elon Musk's AI lab. Released on July 9, 2025, it became the first AI model to break 50% on Humanity's Last Exam (HLE) - a benchmark built from expert-contributed questions designed to resist AI progress. That result, hit by the Heavy variant, established Grok 4 as the strongest pure reasoning system available at launch and drew serious attention from researchers who had watched every prior model stall well below the 40% threshold.

TL;DR

  • First model to exceed 50% on Humanity's Last Exam, trained on xAI's 200,000-GPU Colossus cluster
  • 256K context window, $3.00/M input and $15.00/M output tokens (grok-4-0709 via API)
  • Leads on math and reasoning benchmarks but trails Gemini 3.1 Pro on abstract reasoning (ARC-AGI-2) and the wider frontier on context window size

xAI ships Grok 4 in two variants. The standard model (grok-4-0709) is a capable all-around system with built-in tool use, code execution, and real-time X search. The Heavy variant is a multi-agent architecture that spawns parallel Grok 4 instances comparing solutions - not majority voting, but active collaboration between agents - consuming roughly 10x the compute of the standard model per query. xAI trained both on its Colossus cluster, which used 100x more training compute than Grok 2 and 10x more reinforcement learning compute than Grok 3.

As of March 2026, grok-4-0709 has been deprecated on the API for Grok 4.1 (released November 17, 2025). The underlying model architecture and benchmarks discussed here apply to the original July 2025 release. For full hands-on testing across coding, reasoning, and research tasks, see our Grok 4 review.

Key Specifications

SpecificationDetails
ProviderxAI
Model FamilyGrok
ParametersNot disclosed (MoE architecture, scale undisclosed)
Context Window256,000 tokens (8,000 max output)
Input Price$3.00/M tokens ($0.75/M cached)
Output Price$15.00/M tokens
Release DateJuly 9, 2025
Knowledge CutoffDecember 31, 2024
LicenseProprietary
ModalitiesText + Image input, Text output
Open SourceNo

xAI hasn't disclosed parameter counts for Grok 4. Third-party estimates suggesting a Mixture-of-Experts backbone have circulated online but remain unverified. What xAI has confirmed: training ran on 200,000 H100 GPUs (the Colossus cluster in Memphis), and the model uses integrated tool calling baked into training - not post-hoc function call injection. The 256K context window supports up to 8,000 output tokens, which is on the smaller side compared to Claude Opus 4.6's 128K output limit or the 2M-token context available on Grok 4 Fast variants.

The Grok 4 Fast line (reasoning and non-reasoning) offers a dramatically larger 2M-token context window at a fraction of the cost: $0.20/M input and $0.50/M output tokens. That tradeoff is worth understanding before selecting a model tier.

Grok 4 benchmark evaluation results across coding, writing, and visual tasks Independent evaluation of Grok 4 across coding, writing, visualization, and image analysis tasks - Grok 4 beats earlier frontier models but shows variability across task types. Source: ultraaiguide.com

Benchmark Performance

Grok 4 benchmark scores are from the July 2025 launch period. Competitor scores are from each model's own release documentation.

BenchmarkGrok 4Grok 4 HeavyGPT-5.2Claude Opus 4.6Gemini 3.1 Pro
GPQA Diamond87.5%88.4%92.4%91.3%94.3%
HLE (no tools)~25%>50%34.5%41.2%44.4%
AIME 202591.7%100%100%--
ARC-AGI V215.9%-52.9%37.6%77.1%
LiveCodeBench79%79.4%80%--

HLE scores for GPT-5.2, Claude Opus 4.6, and Gemini 3.1 Pro reflect those models' launch benchmarks (released 5-8 months after Grok 4). ARC-AGI scores use V2 for all models. Dash = not publicly reported.

The HLE result is the headline number and it's real. Grok 4 Heavy's >50% score was confirmed at launch with no other model above 40% at that time. By March 2026, later models closed the gap substantially - Gemini 3.1 Pro sits at 44.4% and Claude Opus 4.6 at 41.2% - but none have crossed the 50% threshold yet.

The ARC-AGI V2 score is the clearest weakness. A 15.9% result against GPT-5.2's 52.9% and Gemini 3.1 Pro's 77.1% on the abstract pattern recognition benchmark means Grok 4 relies on knowledge-based reasoning rather than novel pattern abstraction. The reasoning benchmarks leaderboard tracks these numbers as they update.

AIME 2025 math performance is strong: the Heavy variant hits 100% and the standard model hits 91.7%, competitive with the best math-specialized models. See the math olympiad AI leaderboard for context.

Key Capabilities

Grok 4's strongest use case is hard multi-step reasoning on problems that stump other frontier models. The Heavy variant's multi-agent architecture - parallel reasoning instances comparing solutions rather than voting - produces results that users describe as qualitatively different from single-pass extended thinking. It's specifically designed for tasks where even a capable model gets stuck: competition mathematics, complex scientific analysis, and multi-domain reasoning chains.

Built-in X (formerly Twitter) search integration gives Grok 4 real-time access to public discussion on the platform. This is truly useful for current events, public sentiment tracking, and trending topics where its December 2024 knowledge cutoff falls short. The integration has a catch: X isn't a reliable factual source, and Grok 4 sometimes incorporates inaccurate claims from the platform without flagging the uncertainty. Production pipelines that depend on factual accuracy should verify Grok 4's X-sourced claims independently.

Tool use covers the basics - code execution, web search, X search, and function calling. The ecosystem is narrower than OpenAI's or Anthropic's. Custom tool extensibility is less mature. For contained reasoning and math problems, that gap rarely matters. For complex agentic pipelines with many specialized tools, the Claude Opus 4.6 or GPT-5.2 ecosystems remain more capable.

xAI's Colossus GPU cluster - 200,000 H100 GPUs in Memphis used to train Grok 4 xAI's Colossus cluster in Memphis, where Grok 4 was trained on 200,000 NVIDIA H100 GPUs - one of the largest single training runs on record at the time. Source: tomshardware.com

Pricing and Availability

Grok 4 is available through two channels: the xAI API and SuperGrok consumer subscriptions.

ModelInput (per M tokens)Output (per M tokens)
grok-4-0709 (standard)$3.00 ($0.75 cached)$15.00
Grok 4 Fast (reasoning)$0.20 ($0.05 cached)$0.50
Claude Opus 4.6$15.00$75.00
GPT-5.2 Thinking$1.75$14.00
Gemini 3.1 Pro$2.00$10.00

At $3.00/M input and $15.00/M output, standard Grok 4 costs 5x less than Claude Opus 4.6 on input and 5x less on output - a meaningful gap for high-volume applications. It's priced above GPT-5.2 Thinking and Gemini 3.1 Pro, which should prompt a benchmark comparison before committing. The cost efficiency leaderboard tracks price-performance ratios across the current frontier.

API access is via api.x.ai, compatible with the OpenAI SDK format (drop-in replacement with a base URL change). Available on OpenRouter as x-ai/grok-4 and on Oracle OCI Generative AI for Grok 4 Fast variants.

Batch API: 50% off standard rates for async processing, bringing standard Grok 4 to $1.50/M input and $7.50/M output.

Rate limits for grok-4-0709: 20M tokens per minute, 500 requests per minute.

Consumer access via SuperGrok at approximately $30/month includes Grok 4, DeepSearch, and Big Brain Mode. SuperGrok Heavy at $300/month offers Grok 4 Heavy preview access. At launch in July 2025, $300/month made it the most expensive consumer AI subscription among major providers.

Strengths

  • HLE benchmark leader: First model past 50% on Humanity's Last Exam (Heavy variant)
  • Strong math performance: 100% on AIME 2025 (Heavy), 91.7% (standard)
  • Competitive API pricing: 5x cheaper input than Claude Opus 4.6
  • Real-time X search: Useful for current events and public sentiment tracking
  • Multi-agent Heavy variant: Collaborative parallel reasoning for the hardest problems
  • OpenAI SDK compatibility: Minimal migration cost from OpenAI-based pipelines
  • Scale: Trained on 100x more compute than Grok 2, reflecting genuine frontier investment

Weaknesses

  • Weak on abstract reasoning: 15.9% ARC-AGI V2 trails GPT-5.2 (52.9%) and Gemini 3.1 Pro (77.1%)
  • Smaller context window: 256K tokens (8K output max) is below Claude Opus 4.6 and GPT-5.2
  • Knowledge cutoff: December 31, 2024 - older than competitors released in early 2026
  • Less mature ecosystem: Fewer integrations, less extensible tool infrastructure
  • X search reliability: Real-time X data improves coverage but introduces factual noise
  • Heavy variant cost: $300/month consumer subscription or high API compute costs limits broad use
  • Deprecated endpoint: grok-4-0709 is deprecated as of early 2026; Grok 4.1 is the API successor

Sources:

✓ Last verified March 10, 2026

Grok 4 - xAI's Flagship Reasoning Model
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.