GPT-5.4 Lands with Computer Use and 1M Token Context

OpenAI launched GPT-5.4 today, a model the company describes as "our most capable and efficient frontier model for professional work." The release includes three variants - standard, Pro, and Thinking - and introduces two capabilities that previous GPT-5 models have not had in mainline form: native computer use and a 1 million token context window.

This arrives two days after GPT-5.3 Instant rolled out to all ChatGPT tiers, and roughly a week after the Codex repo leaks confirmed GPT-5.4 was imminent.

Key Specs

Spec	Value
Context window	1 million tokens (API)
Computer use	Native, first mainline GPT-5 model
OSWorld-Verified	75.0% (human baseline: 72.4%)
GDPval	83.0% (vs GPT-5.2's 70.9%)
Variants	Standard, Pro, Thinking
Availability	ChatGPT Plus / Team / Pro, Codex
Replaces	GPT-5.2 Thinking (retiring in 3 months)

Computer Use Baked Into the Base Model

Previous OpenAI offerings pushed computer use into specialized products. GPT-5.4 is the first mainline OpenAI model trained with it as a core capability, available directly through the API and inside Codex.

How It Works

The model supports two interaction modes. In code mode, it writes Python using Playwright or similar libraries to click, type, and navigate. In screenshot mode, it issues raw mouse and keyboard commands in response to what it sees on screen - no code intermediary. A typical agentic loop runs as build - run - verify - fix, with GPT-5.4 able to confirm its own outcomes before declaring a task complete.

OpenAI is also billing GPT-5.4 as the first mainline model trained to support compaction: a mechanism that summarizes and prunes long agent trajectories while preserving the key context needed to continue. This matters for multi-step workflows where keeping the full history in a 1 million token window would be expensive even at GPT-5.4's improved token efficiency.

OSWorld Numbers

On OSWorld-Verified - a benchmark that measures autonomous desktop navigation using screenshots plus keyboard and mouse - GPT-5.4 reaches 75.0%. The human baseline on the same tasks is 72.4%. GPT-5.2 scored 47.3%. That 27.7 point jump from GPT-5.2 isn't gradual; it is a category shift in what an API call can accomplish without a human watching.

Person using a laptop for analysis tasks GPT-5.4 targets complex, multi-step professional workflows that previously required human oversight at every step.

Context Window: 1 Million Tokens

GPT-5.4 supports a 1 million token context window through the API, more than double the 400,000 tokens available in GPT-5.3. In practical terms, this is enough to load an entire medium-sized codebase, a year of corporate email, or a large document corpus in a single request.

Compaction and Long Trajectories

For agentic workloads, the 1 million token limit pairs with compaction support to keep long-running tasks viable. Without compaction, a 100-step workflow that passes its full history into each call can consume the context window well before the task finishes. Compaction lets GPT-5.4 prune intermediate history while retaining the key facts - similar in spirit to what coding agents like GPT-5.3 Codex do with task memory, but now in the base model rather than a specialized variant.

Office and Finance Integrations

With the core model launch, OpenAI announced a new suite of ChatGPT integrations that let GPT-5.4 operate directly inside Microsoft Excel and Google Sheets. In Excel, the model can read cell ranges, perform multi-step analysis, and write formulas. In Sheets, it does the same through Google Workspace APIs.

OpenAI published an internal benchmark for spreadsheet modeling tasks designed to approximate what a junior investment banking analyst would do. GPT-5.4 hits 87.5% on that benchmark, compared to 68.4% for GPT-5.2. These benchmarks are self-reported, which is worth keeping in mind, but the gap is large enough to suggest real improvement on structured document tasks.

Multiple financial graphs displayed on a laptop screen GPT-5.4's Excel and Google Sheets plugins target financial modeling and analytics workflows.

Six Capability Improvements

OpenAI lists six areas of improvement in the official announcement:

Coding and document comprehension - Better performance on long-document QA, tool use, and instruction following
Multimodal perception - Improved image understanding and mixed-modal task performance
Agent workflows - More reliable execution on long-running, multi-step tasks
Efficiency - Increased token efficiency and lower latency on tool-heavy workloads
Web search and synthesis - Better multi-source synthesis, especially for hard-to-find information
Business applications - Stronger performance on document-heavy and spreadsheet-heavy enterprise use cases

Benchmark Comparison

Benchmark	GPT-5.2	GPT-5.4	Human Baseline
OSWorld-Verified (desktop use)	47.3%	75.0%	72.4%
WebArena-Verified (browser use)	65.4%	67.3%	-
GDPval (knowledge work)	70.9%	83.0%	-
Spreadsheet modeling	68.4%	87.5%	-
Error rate vs GPT-5.2 (individual claims)	baseline	-33%	-

WebArena-Verified shows a more modest gain (+1.9 points), which is notable. Browser-based task completion appears closer to saturation in current evaluation methods. The bigger swings are on desktop navigation and open-ended knowledge work - which is exactly where computer use capability would make the most difference.

You can see how these scores stack up against other current frontier models in the agentic AI benchmarks leaderboard.

What To Watch

Pricing Is Not Yet Published

OpenAI hasn't announced API pricing for GPT-5.4 or GPT-5.4 Thinking. The 1 million token context window will presumably be expensive to use at full length, especially for Thinking. Cost efficiency per token is described as improved over GPT-5.2, but until rates are public the spreadsheet modeling and context window numbers need to be taken in isolation from deployment economics.

Compaction Behavior Is Opaque

The compaction mechanism isn't fully documented. It's unclear what heuristics determine what gets pruned, whether compacted context is auditable, or how it affects accuracy on tasks that require exact recall of earlier steps. This matters for enterprise use cases where auditability of agent reasoning is a compliance requirement.

GPT-5.3 Codex Is Still the Coding Specialist

OpenAI explicitly positions GPT-5.4 as combining GPT-5.3 Codex's coding strengths with computer use and tool capabilities. In practice, workflows that are purely text-to-code without environmental interaction should still default to GPT-5.3 Codex until head-to-head coding benchmarks confirm GPT-5.4 matches or passes it.

GPT-5.2 Thinking Retires in Three Months

GPT-5.4 Thinking replaces GPT-5.2 Thinking and will be available to Plus, Team, and Pro subscribers. GPT-5.2 Thinking disappears in 90 days. Organizations running production workflows on GPT-5.2 Thinking need to plan a migration path.

GPT-5.4 is a meaningful step for computer use - beating human benchmark performance on desktop navigation is a remarkable threshold, not a marketing claim. The Excel and Sheets integrations reflect OpenAI's shift toward direct enterprise productivity workflows rather than API-only deployments. Whether the context window and compaction combination actually holds up on long-horizon agentic tasks is the question worth watching as developers start stress-testing it in production.

Sources: