Item: Grok 4
Rating: 8.5
Author: Elena Marchetti

Elon Musk's xAI has been the scrappy upstart of the frontier AI scene, and Grok 4 is the moment the company forces everyone to take it seriously. By becoming the first model to score 50% on Humanity's Last Exam, a benchmark explicitly designed to resist AI progress, Grok 4 stakes a credible claim to the strongest pure reasoning capabilities in existence. But raw reasoning power and practical utility are different things, and Grok 4's story is as much about what it cannot do as what it can.

Standard and Heavy Variants

Grok 4 comes in two flavors. The standard variant is a capable all-around model with a 130K context window, built-in tool use, and strong performance across language, coding, and reasoning tasks. It is competitive with GPT-5.2's Thinking mode and represents a solid general-purpose offering.

The Heavy variant is where things get extraordinary. This is a massively scaled model trained on xAI's Colossus cluster of 200,000 GPUs, one of the largest training runs ever conducted. Heavy is not designed for casual chat or quick tasks. It is an extended reasoning engine that spends substantial compute on each query, working through problems with a depth and persistence that is genuinely new.

The distinction matters. Standard Grok 4 is a good daily driver. Heavy Grok 4 is a specialized instrument for the hardest problems you can find. Choosing the right variant for your task is essential; using Heavy for simple queries wastes money and time, while using Standard for genuinely difficult reasoning problems leaves performance on the table.

Humanity's Last Exam and ARC-AGI

The headline number is 50% on Humanity's Last Exam (HLE), a benchmark consisting of questions contributed by domain experts specifically chosen to be beyond current AI capabilities. These are not standard test questions. They require deep domain expertise, creative problem-solving, and multi-step reasoning that chains across different fields of knowledge. No other model has broken 40% on this benchmark.

Grok 4 also achieves 15.9% on ARC-AGI V2, the Abstract Reasoning Corpus designed to test general intelligence through novel visual pattern recognition. While 15.9% may sound modest, ARC-AGI V2 was designed to be unsolvable by systems that rely on pattern matching rather than genuine abstraction. Every percentage point here represents a meaningful advance in general reasoning capability.

These results are not flukes. We tested Grok 4 Heavy on a range of hard problems across mathematics, physics, philosophy, and computer science. On problems that stump other frontier models, Heavy frequently finds solutions through unexpected reasoning paths. It has a distinctive ability to try multiple approaches, recognize when one is failing, and pivot to alternatives, a quality that feels closer to human problem-solving than the linear reasoning chains of other models.

Built-in Tool Use and X Search

Grok 4 includes native tool use capabilities, allowing it to call external functions, execute code, and search the web as part of its reasoning process. The integration with X (formerly Twitter) search gives it access to real-time information from the platform, which is useful for current events, public sentiment, and trending topics.

The tool use is competent but not exceptional. It correctly identifies when a problem would benefit from computation or information retrieval, and the tool calls are generally well-formed. However, the tool ecosystem is limited compared to what OpenAI and Anthropic offer. You get code execution, web search, and X search, but the extensibility for custom tools is less mature.

The X search integration is a double-edged sword. It provides access to real-time public discourse, which is genuinely useful for certain queries. But X is not a reliable source for factual information, and Grok 4 sometimes incorporates unreliable claims from the platform into its responses without sufficient skepticism. Users need to be aware of this limitation.

Coding Edition

xAI released a Coding edition of Grok 4 that is fine-tuned specifically for software development tasks. In our testing, it performed well on algorithmic challenges, generating clean and efficient solutions. It handles system design discussions thoughtfully and produces well-structured code across Python, JavaScript, Go, and Rust.

However, the Coding edition does not quite match the real-world software engineering capabilities of GPT-5.2 or Claude Opus 4.6. Where those models excel at understanding large codebases and performing multi-file refactors, Grok 4 Coding is strongest on contained problems: single-file algorithms, specific function implementations, and well-scoped debugging tasks. The 130K context window is a limitation here, as it cannot hold entire large projects the way 400K or 1M context models can.

Strengths and Weaknesses

Strengths:

Strongest pure reasoning capabilities, first to 50% on Humanity's Last Exam
Heavy variant tackles problems that stump all other models
Impressive ARC-AGI V2 performance suggests genuine abstract reasoning
Built-in tool use with code execution and web search
Coding edition is strong on algorithmic and contained problems
Trained on one of the largest GPU clusters ever assembled

Weaknesses:

Limited ecosystem compared to OpenAI, Google, and Anthropic
130K context window is the smallest among frontier models
X search integration can introduce unreliable information
Heavy variant is expensive and slow for routine tasks
Safety and alignment tooling is less mature than competitors
API availability and rate limits can be restrictive
Creative writing and conversational quality trail the leaders

Verdict: 8.5/10

Grok 4 is the strongest pure reasoning model available, full stop. The Humanity's Last Exam and ARC-AGI results are not marketing fluff; they reflect genuinely superior performance on the hardest problems in AI evaluation. The Heavy variant is a remarkable achievement that pushes the boundary of what AI systems can solve.

But reasoning power alone does not make a complete product. The limited ecosystem, smaller context window, and less mature safety tooling mean Grok 4 is not the best choice for everyday use. It is the model you turn to when other models fail, when the problem is genuinely hard and you need every ounce of reasoning capability you can get. For that specific use case, nothing else comes close. For everything else, the more complete offerings from OpenAI, Anthropic, and Google remain more practical choices.