Name: GPT-5.4
Author: OpenAI

Overview

OpenAI released GPT-5.4 on March 5, 2026 - just two days after GPT-5.3 Instant rolled out and barely a week after Codex repo leaks confirmed its existence. The model ships in three variants (standard, Thinking, and Pro) and introduces two capabilities new to the mainline GPT-5 family: native computer use and a 1 million token context window.

The computer use numbers are the headline. GPT-5.4 scores 75.0% on OSWorld-Verified, surpassing the human baseline of 72.4% and nearly doubling GPT-5.2's 47.3%. On Terminal-Bench 2.0 it hits 75.1%, taking the lead from GPT-5.3 Codex (77.3% in specialized mode) among general-purpose models. GDPval reaches 83.0%, up from GPT-5.2's 70.9%.

The competitive picture is tight. Claude Opus 4.6 still leads on SWE-bench Verified (80.8% vs 77.2%) and has stronger long-context retrieval. Gemini 3.1 Pro holds the edge on pure reasoning benchmarks like GPQA Diamond (94.3% vs 92.8%) and ARC-AGI-2 (77.1% vs 73.3%). GPT-5.4's advantage is breadth - it's competitive across all categories while adding computer use that neither rival matches at this level. See our overall LLM rankings for the full picture.

Key Specifications

Specification	Details
Provider	OpenAI
Model Family	GPT-5
Architecture	GPT-5 Transformer
Parameters	Not disclosed
Context Window	1,000,000 tokens input
Input Price	$2.50/M tokens
Output Price	$15.00/M tokens
Pro Pricing (Input)	$30.00/M tokens
Pro Pricing (Output)	$180.00/M tokens
Release Date	March 5, 2026
License	Proprietary
Input Modalities	Text, images
Output Modality	Text
Computer Use	Native (code mode + screenshot mode)
Compaction	Supported (trajectory pruning for long agent runs)
Model ID (API)	`gpt-5.4`

Variants

GPT-5.4 ships in three configurations:

Variant	Target Use	Availability	Key Feature
GPT-5.4	General-purpose	API, Codex	Base model with computer use
GPT-5.4 Thinking	Complex reasoning	Plus, Team, Pro, API	Extended chain-of-thought with visible reasoning traces
GPT-5.4 Pro	Hardest problems	Pro, Enterprise, API ($30/$180)	Parallel reasoning threads, "extreme" thinking mode

GPT-5.4 Thinking replaces GPT-5.2 Thinking, which retires in 90 days. GPT-5.4 Pro uses parallel processing - running multiple reasoning threads simultaneously before converging - and includes an extreme thinking mode that allocates notably more compute to difficult problems.

Benchmark Performance

Benchmark	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro	GPT-5.2
OSWorld-Verified (computer use)	75.0%	72.7%	-	47.3%
GDPval (knowledge work)	83.0%	-	-	70.9%
Terminal-Bench 2.0 (agentic)	75.1%	65.4%	68.5%	54.0%
GPQA Diamond (science)	92.8%	91.3%	94.3%	93.2%
ARC-AGI-2 (abstract reasoning)	73.3%	68.8%	77.1%	54.2%
SWE-bench Verified (coding)	77.2%	80.8%	80.6%	80.0%
MMMU Pro (visual reasoning)	81.2%	77.3%	81.0%	80.4%
WebArena-Verified (browser)	67.3%	-	-	65.4%
Spreadsheet modeling	87.5%	-	-	68.4%
Error rate vs GPT-5.2 (claims)	-33%	-	-	baseline

GPT-5.4 dominates on computer use and enterprise productivity benchmarks. The 27.7-point jump from GPT-5.2 on OSWorld-Verified represents a category shift - this model consistently beats human performance on desktop navigation tasks. GDPval's 12.1-point gain signals meaningful improvement on knowledge work across 44 occupational categories.

On pure coding, Claude Opus 4.6 retains a 3.6-point lead on SWE-bench Verified. On science reasoning, Gemini 3.1 Pro leads with 94.3% GPQA Diamond. GPT-5.4's strength is that it has no catastrophic weakness - it's competitive on every benchmark while leading the pack on agentic desktop tasks.

For detailed comparisons, see our coding benchmarks leaderboard and agentic AI benchmarks leaderboard.

Key Capabilities

Computer Use. GPT-5.4 is the first mainline OpenAI model with built-in computer use. It supports two interaction modes: code mode (writing Python with Playwright to click, type, navigate) and screenshot mode (issuing raw mouse and keyboard commands from visual input). A build-run-verify-fix loop lets it complete, confirm, and correct tasks autonomously.

1M Token Context. The context window more than doubles GPT-5.3's 400K limit. In practical terms, this is enough for an entire medium-sized codebase, a year of email, or a large document corpus in a single request. Combined with compaction support, long-running agent trajectories stay viable without consuming the full window.

Compaction. The model natively supports trajectory pruning - summarizing and discarding intermediate history while preserving key context during multi-step workflows. This matters for agent loops that would otherwise exhaust the context window before completing.

Office Integrations. Native Excel and Google Sheets plugins let GPT-5.4 read cell ranges, perform multi-step analysis, and write formulas. The internal spreadsheet modeling benchmark shows 87.5% accuracy on tasks approximating junior investment banking analyst work.

Efficiency. OpenAI reports 33% fewer individual claim errors, 18% fewer responses containing any errors, and 47% fewer tokens consumed on tool-heavy workloads compared to GPT-5.2.

Pricing and Availability

Tier	Input	Output
GPT-5.4 API	$2.50/M tokens	$15.00/M tokens
GPT-5.4 Pro API	$30.00/M tokens	$180.00/M tokens
ChatGPT Plus	$20/month	GPT-5.4 Thinking
ChatGPT Team	$30/user/month	GPT-5.4 Thinking
ChatGPT Pro	$200/month	GPT-5.4 Thinking + Pro
Enterprise	Custom pricing	GPT-5.4 Thinking + Pro

At $2.50/$15.00 per million tokens, GPT-5.4 is cheaper than Claude Opus 4.6 ($5.00/$25.00) and slightly more expensive than Gemini 3.1 Pro ($2.00/$12.00). The Pro variant at $30/$180 is the most expensive per-token option among frontier models, targeting researchers and enterprises with the hardest problems.

The model is available through the OpenAI API, Codex (web app, CLI, and IDE extension), and ChatGPT subscriptions. It's also available through Microsoft Foundry on Azure. Free-tier ChatGPT users don't get access to GPT-5.4 Thinking or Pro.

For cost comparisons across the field, see our cost efficiency leaderboard.

Strengths

Best-in-class computer use. 75.0% OSWorld-Verified beats human baseline (72.4%) and nearly doubles GPT-5.2's score - the largest single-generation jump on this benchmark
Broad competitive coverage. Top 3 on virtually every major benchmark without catastrophic weaknesses in any category
1M context window. Matches Claude Opus 4.6 and Gemini 3.1 Pro, more than doubling GPT-5.3's 400K limit
Strong enterprise productivity. 83.0% GDPval and 87.5% spreadsheet modeling position it as the strongest model for business workflows
Aggressive pricing. $2.50/$15 undercuts Claude Opus 4.6 by 2x on input and nearly 2x on output
Three variants. Standard, Thinking, and Pro give developers granular control over compute allocation
Compaction support. Native trajectory pruning keeps long agentic workflows viable without manual context management

Weaknesses

Coding gap. SWE-bench Verified at 77.2% trails both Claude Opus 4.6 (80.8%) and Gemini 3.1 Pro (80.6%) - a meaningful gap for coding-heavy workflows
Science reasoning trails. GPQA Diamond at 92.8% sits behind Gemini's 94.3% - Gemini remains the better choice for graduate-level science
Abstract reasoning. ARC-AGI-2 at 73.3% trails Gemini's 77.1% and Claude's 68.8% improvement trajectory
Pro pricing is extreme. $30/$180 per million tokens makes GPT-5.4 Pro 12x more expensive than the base model and 15x more than Gemini on output
Compaction is opaque. No documentation on what gets pruned, whether compacted context is auditable, or how it affects accuracy
No agent teams. Claude Opus 4.6's multi-agent coordination through Claude Code has no GPT-5.4 equivalent
Parameters not disclosed. Architecture details remain proprietary

GPT-5.4 Lands with Computer Use and 1M Token Context - Our launch day coverage
GPT-5.4 Leaked Twice in Codex Repo PRs - Pre-release leaks analysis
OpenAI's Three-Word GPT-5.4 Tease - The "5.4 sooner than you Think" post
GPT-5.3 Codex - The predecessor coding specialist
GPT-5.4 vs Claude Opus 4.6 - Head-to-head comparison
GPT-5.4 vs Gemini 3.1 Pro - Head-to-head comparison
Coding Benchmarks Leaderboard - Full coding rankings
Agentic AI Benchmarks Leaderboard - Desktop and tool-use rankings
Cost Efficiency Leaderboard - Token cost comparisons