Grok 4.5 Enters Beta with Unverified Opus Claims

On June 28, Elon Musk posted to X that Grok 4.5 is in private beta at SpaceX and Tesla, along with a claim that early internal evaluations put it close to - or potentially above - Anthropic's Claude Opus.

"Grok 4.5, based on our 1.5T V9 foundation model, with Cursor data added in supplemental training, is now in private beta at SpaceX & Tesla. Early evals show performance close to, perhaps exceeding Opus. RL is continuing to significantly improve the model, and the Grok Build harness gets better every day...Completely trained from scratch new models will be released by @SpaceX every month this year." - Elon Musk, X, June 28, 2026

TL;DR

What they said: Grok 4.5 is a 1.5T-parameter model trained partly on Cursor coding data, with internal evals showing it rivals Claude Opus
What we found: The evals cited are internal only, conducted at organizations Musk controls; no independent benchmarks exist yet
The gap: On Artificial Analysis's Intelligence Index, Claude Opus 4.8 scores 56 and Claude Fable 5 tops the board at 60; the last publicly benchmarked Grok (4.3) scored 38

The Claim

Musk's announcement packs three distinct assertions that deserve separate treatment.

The hardware claim is straightforward: Grok 4.5 runs on xAI's V9 foundation model with 1.5 trillion parameters, up from the 1-trillion-parameter base used for Grok 4.4. Training completed May 26, 2026, based on multiple reports. A 50 percent parameter increase in roughly one month is a significant resource commitment.

The capability claim is where scrutiny is warranted. "Early evals show performance close to, perhaps exceeding Opus" is hedged language - Musk isn't claiming Grok 4.5 beats Opus, only that it might. And these are internal evaluations from a private beta, not submissions to any third-party evaluation suite.

The roadmap claim is the most ambitious: SpaceX will release entirely new models, trained from scratch, on a monthly cadence through year-end 2026. That implies running frontier training runs every four weeks on a single infrastructure, something no other AI lab has done at this scale. This is the one assertion that's directly verifiable as models either ship or don't.

The Cursor Angle

The Cursor connection is the truly new element in this announcement. SpaceX agreed to acquire Cursor for $60 billion in June, and that transaction gave xAI access to Cursor's large trove of real developer coding sessions - debugging patterns, architectural decisions, iterative refactoring across complex codebases. Including that data as supplemental training targets exactly the agentic coding use cases where Grok has historically trailed Claude.

xAI has also updated its Grok Build harness, the scaffolding that coordinates multi-step agent tasks. The parameter increase alone doesn't fully explain the "perhaps exceeding Opus" language - the harness improvements may be responsible for meaningful gains in benchmark scenarios that test multi-step reasoning.

Grok chatbot interface showing a multi-turn conversation with code generation The Grok interface, which Grok 4.5 will power once it exits private beta. Source: wikimedia.org

The Evidence

What the Leaderboards Actually Show

The most recent Grok with public benchmark data is Grok 4.20. For the generation before it, Grok 4.3 scored 38 on the Artificial Analysis Intelligence Index, a composite drawn from multiple independent third-party evaluations. Claude Opus 4.8 currently sits at 56 on that same index. Claude Fable 5 leads the board at 60, ahead of GPT-5.5 at 55 and Gemini 3.1 Pro Preview at 46.

Grok 4.5 may have closed that gap. A 50 percent parameter increase combined with targeted Cursor training data is exactly the kind of move that can produce meaningful benchmark jumps. But going from 38 to 56 in one generation would be the largest single-step improvement xAI has logged - and it would require independent verification to confirm.

Who Ran the Evaluations

Private beta deployment is currently restricted to SpaceX and Tesla. Both organizations are under Musk's direct control. The "early evals" he references haven't been shared publicly, haven't been submitted to Chatbot Arena or the Artificial Analysis suite, and haven't been reproduced by third-party researchers. This isn't a technical objection to the model's capability - it's a structural one. Internal evaluations at the developer's own companies don't meet the standard for a verified performance claim.

The "Supplemental Training" Caveat

Musk described Cursor data as "supplemental training" without quantifying its scope. The question matters: did Cursor data constitute 2 percent of training or 25 percent? Was it used for pre-training, supervised fine-tuning, or reinforcement learning from human feedback? Each of those choices produces a different model with different strengths and failure modes. Without that disclosure, the Cursor data claim tells us roughly nothing about which tasks improved and by how much.

Grok DeepSearch interface example demonstrating multi-step reasoning tasks Grok's DeepSearch capability - one of the task domains where the Cursor-trained Grok 4.5 is likely to be tested first. Source: wikimedia.org

Claim vs Reality

Claim	What the Record Shows
"Performance close to, perhaps exceeding Opus"	No external benchmarks exist; Grok 4.3 scored 38 on AA Intelligence Index vs Opus 4.8's 56
"1.5T V9 foundation model"	Parameter count confirmed across multiple reports; architecture details not disclosed
"Cursor data in supplemental training"	Scope undisclosed; "supplemental" doesn't specify tasks improved or training stage used
"Early evals show performance..."	Evals conducted internally at SpaceX and Tesla - both Musk-controlled organizations
"RL continuing to significantly improve"	No quantified baseline, no improvement metric provided
"New models released monthly by SpaceX"	Forward-looking commitment; verifiable only as models ship

What They Left Out

Grok 4.5 has no announced pricing, no disclosed context window, and no public API. The private beta is limited to two organizations with direct financial interest in xAI's success. The SpaceX-xAI merger completed in February 2026 gave xAI access to SpaceX's Colossus compute cluster, which most likely enabled the parameter jump from 1T to 1.5T in a single generation - a resource advantage unavailable to labs without equivalent infrastructure.

Musk also didn't specify which Claude Opus version Grok 4.5 is being compared against. Anthropic currently has Claude Fable 5 and Claude Opus 4.8 in active deployment. A claim to rival "Opus" needs a specific model number to be meaningful.

None of this means Grok 4.5 isn't a significant improvement. A 50 percent parameter increase on a production architecture, combined with targeted real-world coding data from Cursor, is exactly the right strategy to close the gap with Claude. The direction is credible. The magnitude isn't yet verified.

Public benchmark submissions from private betas typically take two to four weeks to appear on independent evaluation suites. At that point, the claim either survives contact with external data or it doesn't.

Sources: