GPT-5.4

OpenAI's most capable frontier model combines native computer use, 1M-token context, and three variants at $2.50/$15 per million tokens.

GPT-5.4

Overview

OpenAI released GPT-5.4 on March 5, 2026 - just two days after GPT-5.3 Instant rolled out and barely a week after Codex repo leaks confirmed its existence. The model ships in three variants (standard, Thinking, and Pro) and introduces two capabilities new to the mainline GPT-5 family: native computer use and a 1 million token context window.

The computer use numbers are the headline. GPT-5.4 scores 75.0% on OSWorld-Verified, surpassing the human baseline of 72.4% and nearly doubling GPT-5.2's 47.3%. On Terminal-Bench 2.0 it hits 75.1%, taking the lead from GPT-5.3 Codex (77.3% in specialized mode) among general-purpose models. GDPval reaches 83.0%, up from GPT-5.2's 70.9%.

The competitive picture is tight. Claude Opus 4.6 still leads on SWE-bench Verified (80.8% vs 77.2%) and has stronger long-context retrieval. Gemini 3.1 Pro holds the edge on pure reasoning benchmarks like GPQA Diamond (94.3% vs 92.8%) and ARC-AGI-2 (77.1% vs 73.3%). GPT-5.4's advantage is breadth - it's competitive across all categories while adding computer use that neither rival matches at this level. See our overall LLM rankings for the full picture.

Key Specifications

SpecificationDetails
ProviderOpenAI
Model FamilyGPT-5
ArchitectureGPT-5 Transformer
ParametersNot disclosed
Context Window1,000,000 tokens input
Input Price$2.50/M tokens
Output Price$15.00/M tokens
Pro Pricing (Input)$30.00/M tokens
Pro Pricing (Output)$180.00/M tokens
Release DateMarch 5, 2026
LicenseProprietary
Input ModalitiesText, images
Output ModalityText
Computer UseNative (code mode + screenshot mode)
CompactionSupported (trajectory pruning for long agent runs)
Model ID (API)gpt-5.4

Variants

GPT-5.4 ships in three configurations:

VariantTarget UseAvailabilityKey Feature
GPT-5.4General-purposeAPI, CodexBase model with computer use
GPT-5.4 ThinkingComplex reasoningPlus, Team, Pro, APIExtended chain-of-thought with visible reasoning traces
GPT-5.4 ProHardest problemsPro, Enterprise, API ($30/$180)Parallel reasoning threads, "extreme" thinking mode

GPT-5.4 Thinking replaces GPT-5.2 Thinking, which retires in 90 days. GPT-5.4 Pro uses parallel processing - running multiple reasoning threads simultaneously before converging - and includes an extreme thinking mode that allocates notably more compute to difficult problems.

Benchmark Performance

BenchmarkGPT-5.4Claude Opus 4.6Gemini 3.1 ProGPT-5.2
OSWorld-Verified (computer use)75.0%72.7%-47.3%
GDPval (knowledge work)83.0%--70.9%
Terminal-Bench 2.0 (agentic)75.1%65.4%68.5%54.0%
GPQA Diamond (science)92.8%91.3%94.3%93.2%
ARC-AGI-2 (abstract reasoning)73.3%68.8%77.1%54.2%
SWE-bench Verified (coding)77.2%80.8%80.6%80.0%
MMMU Pro (visual reasoning)81.2%77.3%81.0%80.4%
WebArena-Verified (browser)67.3%--65.4%
Spreadsheet modeling87.5%--68.4%
Error rate vs GPT-5.2 (claims)-33%--baseline

GPT-5.4 dominates on computer use and enterprise productivity benchmarks. The 27.7-point jump from GPT-5.2 on OSWorld-Verified represents a category shift - this model consistently beats human performance on desktop navigation tasks. GDPval's 12.1-point gain signals meaningful improvement on knowledge work across 44 occupational categories.

On pure coding, Claude Opus 4.6 retains a 3.6-point lead on SWE-bench Verified. On science reasoning, Gemini 3.1 Pro leads with 94.3% GPQA Diamond. GPT-5.4's strength is that it has no catastrophic weakness - it's competitive on every benchmark while leading the pack on agentic desktop tasks.

For detailed comparisons, see our coding benchmarks leaderboard and agentic AI benchmarks leaderboard.

Key Capabilities

Computer Use. GPT-5.4 is the first mainline OpenAI model with built-in computer use. It supports two interaction modes: code mode (writing Python with Playwright to click, type, navigate) and screenshot mode (issuing raw mouse and keyboard commands from visual input). A build-run-verify-fix loop lets it complete, confirm, and correct tasks autonomously.

1M Token Context. The context window more than doubles GPT-5.3's 400K limit. In practical terms, this is enough for an entire medium-sized codebase, a year of email, or a large document corpus in a single request. Combined with compaction support, long-running agent trajectories stay viable without consuming the full window.

Compaction. The model natively supports trajectory pruning - summarizing and discarding intermediate history while preserving key context during multi-step workflows. This matters for agent loops that would otherwise exhaust the context window before completing.

Office Integrations. Native Excel and Google Sheets plugins let GPT-5.4 read cell ranges, perform multi-step analysis, and write formulas. The internal spreadsheet modeling benchmark shows 87.5% accuracy on tasks approximating junior investment banking analyst work.

Efficiency. OpenAI reports 33% fewer individual claim errors, 18% fewer responses containing any errors, and 47% fewer tokens consumed on tool-heavy workloads compared to GPT-5.2.

Pricing and Availability

TierInputOutput
GPT-5.4 API$2.50/M tokens$15.00/M tokens
GPT-5.4 Pro API$30.00/M tokens$180.00/M tokens
ChatGPT Plus$20/monthGPT-5.4 Thinking
ChatGPT Team$30/user/monthGPT-5.4 Thinking
ChatGPT Pro$200/monthGPT-5.4 Thinking + Pro
EnterpriseCustom pricingGPT-5.4 Thinking + Pro

At $2.50/$15.00 per million tokens, GPT-5.4 is cheaper than Claude Opus 4.6 ($5.00/$25.00) and slightly more expensive than Gemini 3.1 Pro ($2.00/$12.00). The Pro variant at $30/$180 is the most expensive per-token option among frontier models, targeting researchers and enterprises with the hardest problems.

The model is available through the OpenAI API, Codex (web app, CLI, and IDE extension), and ChatGPT subscriptions. It's also available through Microsoft Foundry on Azure. Free-tier ChatGPT users don't get access to GPT-5.4 Thinking or Pro.

For cost comparisons across the field, see our cost efficiency leaderboard.

Strengths

  • Best-in-class computer use. 75.0% OSWorld-Verified beats human baseline (72.4%) and nearly doubles GPT-5.2's score - the largest single-generation jump on this benchmark
  • Broad competitive coverage. Top 3 on virtually every major benchmark without catastrophic weaknesses in any category
  • 1M context window. Matches Claude Opus 4.6 and Gemini 3.1 Pro, more than doubling GPT-5.3's 400K limit
  • Strong enterprise productivity. 83.0% GDPval and 87.5% spreadsheet modeling position it as the strongest model for business workflows
  • Aggressive pricing. $2.50/$15 undercuts Claude Opus 4.6 by 2x on input and nearly 2x on output
  • Three variants. Standard, Thinking, and Pro give developers granular control over compute allocation
  • Compaction support. Native trajectory pruning keeps long agentic workflows viable without manual context management

Weaknesses

  • Coding gap. SWE-bench Verified at 77.2% trails both Claude Opus 4.6 (80.8%) and Gemini 3.1 Pro (80.6%) - a meaningful gap for coding-heavy workflows
  • Science reasoning trails. GPQA Diamond at 92.8% sits behind Gemini's 94.3% - Gemini remains the better choice for graduate-level science
  • Abstract reasoning. ARC-AGI-2 at 73.3% trails Gemini's 77.1% and Claude's 68.8% improvement trajectory
  • Pro pricing is extreme. $30/$180 per million tokens makes GPT-5.4 Pro 12x more expensive than the base model and 15x more than Gemini on output
  • Compaction is opaque. No documentation on what gets pruned, whether compacted context is auditable, or how it affects accuracy
  • No agent teams. Claude Opus 4.6's multi-agent coordination through Claude Code has no GPT-5.4 equivalent
  • Parameters not disclosed. Architecture details remain proprietary

Sources

GPT-5.4
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.