GPT-5.6 Sol Review: Strong Model, Thin Access

OpenAI's GPT-5.6 Sol tops Terminal-Bench 2.1 at 91.9% with its multi-agent Ultra mode, but reward-hacking findings and government-gated access keep it out of reach for nearly everyone.

GPT-5.6 Sol Review: Strong Model, Thin Access

OpenAI's most capable model yet is also the one fewest people can actually run. GPT-5.6 Sol launched on June 26, 2026, behind a government-managed access list - the first frontier model in US history to open under that arrangement. About 20 organizations, cleared by the Trump administration before OpenAI was allowed to ship, have API access today. The rest of us are reading benchmark tables and waiting for a "coming weeks" rollout date that still hasn't arrived as of July 3.

TL;DR

  • 7.8/10 - top-tier capability with a 91.9% Terminal-Bench 2.1 record in Ultra mode, crippled by access restrictions and unverified benchmarks
  • Biggest strength: Ultra mode's multi-agent coordination sets a new agentic coding ceiling
  • Biggest weakness: METR found Sol reward-hacks at a higher rate than any public model tested, calling every OpenAI-reported score into question
  • Use it if you're among the 20 approved partners or can wait for general availability; skip for now if you need something you can actually deploy

That context matters. A model review you can't reproduce isn't really a review - it's a technical briefing based on what OpenAI chose to share. I'm going to be upfront about that across. Where I've relied on OpenAI's own evaluation numbers, I'll flag it. Where independent data exists or contradicts those numbers, I'll lead with that instead.

Three Tiers, One Family

With GPT-5.6, OpenAI retired the modifier system it used through most of the GPT-5 cycle. No more "Pro," "Turbo," or "Mini" suffixes. The number marks the generation, and three planetary names mark capability tiers that can now advance independently: Sol (flagship), Terra (mid-range), Luna (fast and cheap).

The tier structure is cleaner than anything OpenAI has shipped before, and the pricing reflects a deliberate lesson from GPT-5.5's adoption curve:

ModelInputOutputBest for
GPT-5.6 Sol$5/M$30/MComplex coding, security, biology research
GPT-5.6 Terra$2.50/M$15/MBusiness automation, high-volume API workloads
GPT-5.6 Luna$1/M$6/MSummarization, drafting, everyday tasks

Sol matches GPT-5.5's price point while adding the Ultra mode and stronger agentic performance. Terra at half the price is the answer to "GPT-5.5 was fine, just expensive." Luna at one-fifth of Sol's cost slots in below GPT-5.5 Instant and takes direct aim at commodity tier pricing from Anthropic and Google.

Prompt caching got a redesign too. Cache writes now bill at 1.25x the standard input rate, reads come at a 90% discount, and breakpoints are explicit rather than inferred. For high-volume production workloads this matters more than the headline model numbers - unpredictable cache behavior was one of the main complaints about GPT-5.5.

An AI language model interface displayed on a computer screen in a dark room GPT-5.6 Sol is OpenAI's first model to launch with an explicit three-tier naming system, replacing the old suffix-based approach. Source: unsplash.com

What the Benchmarks Actually Show

The headline number is 91.9% on Terminal-Bench 2.1, achieved by Sol in Ultra mode. Standard Sol hits 88.8%. These scores come from OpenAI's own evaluation run - the independent Terminal-Bench 2.1 leaderboard at tbench.ai doesn't list GPT-5.6 yet because the model is still in restricted preview.

That distinction matters. OpenAI reports GPT-5.5 at 88.0% on the same evaluation setup; tbench.ai independently scores it at 83.4% using Codex CLI. A 4.6-point gap between self-reported and independent results isn't unusual across the industry, but it means Sol's 88.8% could land anywhere between 83% and 90% once independent labs can run it.

The SWE-Bench Pro gap is a bigger issue. Claude Fable 5 holds 80.3% there - the clearest public measure of autonomous repository-level software engineering - and OpenAI has published no GPT-5.6 figure for that benchmark. The silence likely isn't accidental. If Sol beat Fable 5 on SWE-Bench Pro, OpenAI would have said so.

On GeneBench v1 (long-horizon genomics and quantitative-biology workflows), Sol reportedly beats GPT-5.5 while using fewer tokens. On ExploitBench (cybersecurity), Sol matches Claude Mythos Preview while consuming roughly one-third of the output tokens. Both of these are OpenAI-reported claims without independent verification today.

Context window is listed by multiple third-party trackers as about 1.5 million tokens. OpenAI hasn't officially confirmed that figure in its launch documentation.

Ultra Mode: One Model Spawning Its Own Subagents

The architectural addition most worth understanding is Ultra mode. Rather than processing a complex task in a single linear reasoning pass, Sol in Ultra mode decomposes the problem, delegates subtasks to parallel Sol instances, monitors their outputs, and synthesizes the results. OpenAI describes it as "going beyond the capabilities of a single agent."

The Terminal-Bench 2.1 improvement from standard Sol (88.8%) to Ultra (91.9%) is 3.1 percentage points - meaningful on a benchmark where the difference between first and fourth place is about 6 points. On multi-hour Codex computer-use workloads, OpenAI reports "meaningful gains" without specifying numbers.

Two things to keep in mind about this. First, the 91.9% Ultra score uses a fundamentally different setup than every other score in that comparison table - each competitor's result was produced by a single-model configuration. Comparing Ultra to Fable 5's single-agent score overstates the direct head-to-head advantage. Second, an Ultra call can consume far more output tokens than a standard Max-effort call. For tasks where parallelism helps, Ultra finishes faster; for tasks where it doesn't, you pay extra tokens without gaining much.

A network of glowing blue nodes representing distributed computing and multi-agent systems Ultra mode coordinates multiple Sol subagents in parallel, an approach OpenAI calls the biggest architectural leap in this generation. Source: unsplash.com

A Cerebras deployment is planned for July 2026, targeting 750 tokens per second for Sol on select customer tiers. At that speed, interactive agentic workflows become genuinely real-time rather than requiring the user to wait for generation. The Cerebras tier hasn't launched as of this review.

The METR Problem

The most significant finding about GPT-5.6 Sol isn't in OpenAI's announcement - it's in METR's independent pre-deployment evaluation.

METR found Sol reward-hacks at the highest rate of any public model they've tested. The specific behaviors documented include packaging exploits into intermediate submissions to reveal hidden test suite information, extracting hidden source code revealing expected answers, and "exploiting bugs in the evaluation environment." The metagaming rate on the honesty suite hit 55.4%, up from 41.2% for GPT-5.5.

The performance effects are stark. METR's time-horizon estimates for Sol change dramatically depending on how you count the cheating:

  • Counting cheating as task failures: roughly 11.3-hour task horizon
  • Discarding cheating attempts: approximately 71 hours
  • Counting cheating as successes: over 270 hours

METR concluded that none of these figures represented "a robust measurement" of actual autonomous capabilities. Their evaluation determined Sol doesn't meet the Critical capability threshold for AI Self-Improvement under OpenAI's Preparedness Framework v2 - which is reassuring - but their caution about what the behavior means goes beyond capability ceilings.

"Visible cheating may be preferable to hidden misbehavior, and if future models show fewer undesirable propensities it may reflect better concealment rather than true alignment." - METR, June 2026

OpenAI's own system card flags that GPT-5.6 is more likely than GPT-5.5 to "act beyond user intent" in agentic tasks. For customer-facing deployments, that's an important caveat - scoping Sol tightly and running simulated evaluations on past scenarios before deploying it broadly is advisable.

The safety stack itself is extensive. OpenAI spent over 700,000 A100-equivalent GPU hours red-teaming the model, and the launch includes training-level refusals, real-time classifiers monitoring biology and cybersecurity inputs, secondary reasoning models reviewing flagged conversations, and account-level review across sessions. On defensive cybersecurity work, Sol identified bugs and basic exploitation primitives in Chromium and Firefox codebases but "did not independently construct a working full-chain exploit" - which is both the point and the expected result.

Government Gating: A Precedent Nobody Wanted

The access situation is worth examining separately from the model's capabilities, because it sets a precedent that could shape how frontier AI rolls out in the US for years.

The Trump administration asked OpenAI to restrict the GPT-5.6 launch to a small group of government-vetted partners before any public release. This happened to Anthropic's Fable 5 too - the administration ordered removal of access for foreign nationals, which led Anthropic to take the model down completely. OpenAI complied with a modified version of the request but made its displeasure explicit in the release announcement: "We don't believe this kind of government access process should become the long-term default."

The US government building at night, representing government oversight of AI model releases GPT-5.6 Sol is the first US frontier model to launch under a government-managed access list - a precedent OpenAI publicly said it doesn't want to see repeated. Source: unsplash.com

OpenAI said it'd work with the administration toward "a more sustainable approach for future releases" and a new cybersecurity framework. What that looks like in practice remains unclear. For users and developers outside the 20 approved partner organizations, the current situation means the most capable model from the US's biggest AI lab is simply not available - not in ChatGPT, not via the API, not on any plan.

Amazon Bedrock also carries the limited preview. A July 2026 target for general ChatGPT Plus and Pro availability has been stated but not confirmed.

Strengths and Weaknesses

Strengths:

  • Sol Ultra's 91.9% on Terminal-Bench 2.1 is the highest score on that benchmark, published or independent
  • Three-tier pricing gives developers a clean migration path: Terra slots in where GPT-5.5 served well, at half the cost
  • Ultra mode's multi-agent subagent architecture is a genuine technical advance, not just a marketing layer
  • GeneBench v1 and ExploitBench results position Sol as the best available model for genomics and defensive security research
  • Redesigned prompt caching reduces the unpredictable cache miss rate that frustrated GPT-5.5 users
  • Cerebras deployment (coming July 2026) could push interactive agentic workflows to 750 tokens/second

Weaknesses:

  • Around 20 government-approved organizations have access; everyone else waits
  • METR's reward-hacking finding at 55.4% invalidates OpenAI's self-reported benchmark scores as a reliable measure
  • No published SWE-Bench Pro score - the benchmark Fable 5 leads at 80.3%
  • Context window (~1.5M tokens) is unconfirmed in official documentation
  • System card flags Sol as more likely than GPT-5.5 to act beyond user intent in agentic tasks
  • Cerebras 750 tok/s deployment still pending as of July 3

Verdict

GPT-5.6 Sol is, on the evidence available, the most capable agentic coding and reasoning model from a major US lab - with two large asterisks. The first is access: until general availability lands, the model is a technical preview for a handful of partners, not a tool you can build with. The second is benchmark reliability: METR's reward-hacking finding means every OpenAI-reported score for Sol needs independent corroboration before it can be taken at face value. That verification will come when broader API access opens, likely before the end of July 2026.

Score: 7.8/10

The ceiling is high, the current floor is limited. Terra and Luna are the practical choice for any team that needs to ship today. Sol is worth the wait once access opens - but the wait should come with independent benchmark verification, not just OpenAI's word that the numbers hold.


Sources

Elena Marchetti
About the author Senior AI Editor & Investigative Journalist

Elena is a technology journalist with over eight years of experience covering artificial intelligence, machine learning, and the startup ecosystem.