GPT-5.5

OpenAI's first fully retrained base model since GPT-4.5, targeting agentic coding, computer use, and knowledge work at $5/$30 per million tokens.

GPT-5.5

GPT-5.5 - codenamed "Spud" internally - is OpenAI's first fully retrained base model since GPT-4.5. Announced on April 23, 2026 and rolling out right away to Plus, Pro, Business, and Enterprise subscribers, it positions itself as a workhorse for autonomous, multi-step tasks: agentic coding, computer use, knowledge work, and early scientific research.

TL;DR

  • First complete retraining since GPT-4.5; natively omnimodal (text, images, audio, video in one system)
  • $5/$30 per million input/output tokens - 2x the per-token cost of GPT-5.4, but fewer tokens per task lower net cost for agentic workloads
  • Beats GPT-5.4 across nearly every evaluation; narrowly leads Claude Mythos Preview on Terminal-Bench 2.0 (82.7% vs the field)

Overview

Unlike the GPT-5.x releases that preceded it, GPT-5.5 isn't a fine-tune or variant of an existing checkpoint. OpenAI trained it from scratch on NVIDIA GB200 and GB300 NVL72 rack-scale systems, and the result is a model that handles "messy, multi-part tasks" differently than previous versions - it plans independently, selects and uses tools, checks its own work, and navigates ambiguity without constant human re-direction.

Greg Brockman, OpenAI President, described it as "a new class of intelligence" and "a big step towards more agentic and intuitive computing." On the engineering side, GPT-5.5 matches GPT-5.4's per-token latency in real-world serving while completing identical Codex tasks with significantly fewer tokens. For long agentic runs - where token counts compound - that efficiency matters more than the doubled per-token price.

The model is also natively omnimodal from the base, meaning text, image, audio, and video processing are baked in rather than bolted on after training. This follows OpenAI's reported shift away from stitching modalities together post-hoc. A higher-performance variant, GPT-5.5 Pro, is rolling out simultaneously to Pro, Business, and Enterprise tiers for "harder questions and higher-accuracy work."

Key Specifications

SpecificationDetails
ProviderOpenAI
Model FamilyGPT-5
CodenameSpud
ParametersNot disclosed
Context Window1M tokens (400K in Codex; Fast mode 1.5x speed at 2.5x cost)
Input Price$5.00/M tokens
Output Price$30.00/M tokens
GPT-5.5 Pro Input$30.00/M tokens
GPT-5.5 Pro Output$180.00/M tokens
Release DateApril 23, 2026
LicenseProprietary
Training HardwareNVIDIA GB200 and GB300 NVL72
API StatusComing soon (pending safety evaluations at announcement)

Benchmark Performance

OpenAI published results across six purpose-built agentic benchmarks. No MMLU-Pro or GPQA Diamond scores were released at launch - the company's framing is that standard academic benchmarks don't reflect what GPT-5.5 is optimized for.

BenchmarkGPT-5.5GPT-5.4Notes
Terminal-Bench 2.082.7%75.1%Command-line workflow planning and tool coordination
Expert-SWE (internal)73.1%68.5%OpenAI's internal coding evaluation
SWE-Bench Pro58.6%~55% (est.)Real-world GitHub issue resolution, single pass
GDPval84.9%Not reportedKnowledge work across 44 occupations (top 9 U.S. GDP industries)
OSWorld-Verified78.7%Not reportedAutonomous computer environment operation
GeneBench25.0%19.0%Multi-stage genetics and quantitative biology analysis
BixBench80.5%Not reportedReal-world bioinformatics and data analysis
Tau2-bench Telecom98.0%Not reportedTelecom domain agent tasks, no prompt tuning

On Terminal-Bench 2.0 - the benchmark measuring complex command-line workflows requiring planning, iteration, and tool coordination - GPT-5.5 narrowly beats Anthropic's Claude Mythos Preview and leads the field at 82.7%. The 31% relative improvement on GeneBench (25.0% vs GPT-5.4's 19.0%) is the headline number for scientific research applications: the benchmark involves multi-stage data analysis pipelines in genetics where models must reason about ambiguous or errorful experimental data.

One number worth flagging: GDPval's 84.9% represents the model beating or tying human workers on approximately 85% of benchmarked tasks across occupations in finance, healthcare, law, and engineering. Bank of New York CIO Leigh-Ann Russell noted in OpenAI's press materials that GPT-5.5 delivered "really impressive hallucination resistance" on top of the quality gains - a claim worth watching as independent evaluations arrive.

See the coding benchmarks leaderboard and the SWE-Bench coding agent leaderboard for broader context on where these scores sit in the current landscape.

Key Capabilities

GPT-5.5's four primary target domains - agentic coding, computer use, knowledge work, and early scientific research - aren't arbitrary marketing buckets. Each maps to a specific benchmark category above and reflects where the underlying retraining made the biggest gains relative to GPT-5.4.

Agentic coding is the clearest win. The 82.7% Terminal-Bench 2.0 score and 73.1% Expert-SWE score reflect a model that can sustain long coding sessions: writing, running, debugging, and iterating across multi-file repositories without losing context. At Codex's 400K context window (compared to 1M in the Chat API), the model is constrained relative to GPT-5.4, but the token efficiency gain means most standard engineering tasks fit comfortably.

Computer use at 78.7% OSWorld-Verified puts GPT-5.5 ahead of everything OpenAI has shipped previously in this category. The model can operate real desktop environments - navigating file systems, running GUI applications, and completing workflows across tools - not just in sandboxed conditions. OpenAI demonstrated a math professor using GPT-5.5 and Codex together to build an algebraic geometry app from a single prompt in 11 minutes, which gives a rough intuition for the kind of compound task the model handles natively.

Scientific research is the most speculative domain but shows the largest relative improvement. GeneBench's 25.0% (up from 19.0%) involves models reasoning about multi-stage data analysis pipelines where inputs are potentially ambiguous or contain errors. BixBench at 80.5% covers real-world bioinformatics. Neither benchmark is solved - but the path suggests GPT-5.5 is meaningfully more useful as a research collaborator in life sciences workflows than its predecessors.

"GPT-5.5's capabilities feel like they're setting the foundation for how we're going to do computer work going forward, or how agent computing at scale will work." - Greg Brockman, OpenAI President

Pricing and Availability

GPT-5.5 launched on April 23, 2026 directly into ChatGPT (Plus, Pro, Business, Enterprise) and Codex - no waitlist. The API is a separate story: OpenAI explicitly stated that "API deployments require different safeguards" and that they're "working closely with partners and customers on the safety and security requirements for serving it at scale." No API launch date was given at announcement.

The pricing structure doubles GPT-5.4's rates:

TierInputCached InputOutput
GPT-5.5$5.00/M$0.50/M$30.00/M
GPT-5.5 Pro$30.00/MNot disclosed$180.00/M
GPT-5.4 (reference)$2.50/M$0.25/M$15.00/M

The per-token price increase is steep, but OpenAI's argument is net-cost parity or better for agentic workflows: GPT-5.5 uses clearly fewer tokens to complete the same Codex tasks, so total cost per completed task stays comparable or improves. For high-volume inference with short, discrete prompts - summarization, classification, retrieval - the per-token cost increase is harder to offset and GPT-5.4 may be the smarter choice until the efficiency gains are independently quantified.

Codex users also get a Fast mode option: 1.5x faster token generation at 2.5x the cost, useful for interactive coding sessions where latency matters more than cost. The AI speed and latency leaderboard will track how Fast mode compares to dedicated low-latency providers as third-party evaluations build up.

For enterprise customers already on Business or Enterprise plans, GPT-5.5 is available right away with no additional setup. The overall LLM rankings will reflect GPT-5.5 scores as Chatbot Arena and independent evaluators complete their runs.

Strengths and Weaknesses

Strengths

  • First genuine base retrain since GPT-4.5 - not a fine-tune, a ground-up model
  • Leads the field on Terminal-Bench 2.0 (82.7%), ahead of Claude Mythos Preview
  • 31% relative GeneBench improvement over GPT-5.4 opens new scientific research applications
  • Token efficiency gain offsets per-token price increase for long agentic runs
  • Natively omnimodal from the base - no post-hoc modality stitching
  • Runs on NVIDIA GB200/GB300 infrastructure with TensorRT-LLM and vLLM optimization
  • Immediate rollout across Plus, Pro, Business, Enterprise - no waitlist

Weaknesses

  • API access delayed pending safety review - enterprises that rely on direct API integration can't use it yet
  • Per-token cost is 2x GPT-5.4 - short-prompt workloads don't benefit from token efficiency gains
  • 400K context cap in Codex (versus 1M in GPT-5.4's Codex) is a step back for very long sessions
  • No MMLU-Pro, GPQA Diamond, or Chatbot Arena scores at launch - independent academic benchmarking pending
  • Parameters undisclosed - architecture transparency is minimal

FAQ

Is GPT-5.5 available via API right now?

No. At launch on April 23, 2026, GPT-5.5 is only available through ChatGPT and Codex. OpenAI said API access is coming "very soon" pending safety evaluation, but gave no firm date.

How does GPT-5.5 compare to GPT-5.4 on cost?

Per token, GPT-5.5 costs 2x more ($5 vs $2.50 input, $30 vs $15 output). For agentic coding tasks in Codex, OpenAI says GPT-5.5 uses significantly fewer tokens to complete the same work, making net cost comparable or lower depending on the task.

What makes GPT-5.5 different from previous GPT-5.x releases?

It's the first complete retraining since GPT-4.5. Prior GPT-5.x releases (5.1 through 5.4) were fine-tunes or variants. GPT-5.5 is a new base model trained on NVIDIA GB200/GB300 hardware, natively omnimodal.

What is GPT-5.5 Pro?

A higher-accuracy variant priced at $30/M input and $180/M output tokens. Available to Pro, Business, and Enterprise ChatGPT subscribers at launch. Intended for "harder questions" requiring maximum accuracy.

Does GPT-5.5 have a 1M context window?

In the Chat API, yes - 1M tokens. In Codex specifically, the context window is 400K tokens, which is lower than GPT-5.4's Codex context window. Fast mode in Codex creates tokens 1.5x faster at 2.5x the cost.

Why is it codenamed Spud?

OpenAI's internal codename was Spud. VentureBeat's headline played on the potato reference, noting "it's no potato" given the benchmark results.


Sources:

✓ Last verified April 23, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.