GPT-5.4 Lands with Computer Use and 1M Token Context
OpenAI ships GPT-5.4 with built-in computer use that beats human desktop performance, a 1 million token context window, and native Excel and Google Sheets integrations.

OpenAI launched GPT-5.4 today, a model the company describes as "our most capable and efficient frontier model for professional work." The release includes three variants - standard, Pro, and Thinking - and introduces two capabilities that previous GPT-5 models have not had in mainline form: native computer use and a 1 million token context window.
This arrives two days after GPT-5.3 Instant rolled out to all ChatGPT tiers, and roughly a week after the Codex repo leaks confirmed GPT-5.4 was imminent.
Key Specs
| Spec | Value |
|---|---|
| Context window | 1 million tokens (API) |
| Computer use | Native, first mainline GPT-5 model |
| OSWorld-Verified | 75.0% (human baseline: 72.4%) |
| GDPval | 83.0% (vs GPT-5.2's 70.9%) |
| Variants | Standard, Pro, Thinking |
| Availability | ChatGPT Plus / Team / Pro, Codex |
| Replaces | GPT-5.2 Thinking (retiring in 3 months) |
Computer Use Baked Into the Base Model
Previous OpenAI offerings pushed computer use into specialized products. GPT-5.4 is the first mainline OpenAI model trained with it as a core capability, available directly through the API and inside Codex.
How It Works
The model supports two interaction modes. In code mode, it writes Python using Playwright or similar libraries to click, type, and navigate. In screenshot mode, it issues raw mouse and keyboard commands in response to what it sees on screen - no code intermediary. A typical agentic loop runs as build - run - verify - fix, with GPT-5.4 able to confirm its own outcomes before declaring a task complete.
OpenAI is also billing GPT-5.4 as the first mainline model trained to support compaction: a mechanism that summarizes and prunes long agent trajectories while preserving the key context needed to continue. This matters for multi-step workflows where keeping the full history in a 1 million token window would be expensive even at GPT-5.4's improved token efficiency.
OSWorld Numbers
On OSWorld-Verified - a benchmark that measures autonomous desktop navigation using screenshots plus keyboard and mouse - GPT-5.4 reaches 75.0%. The human baseline on the same tasks is 72.4%. GPT-5.2 scored 47.3%. That 27.7 point jump from GPT-5.2 isn't gradual; it is a category shift in what an API call can accomplish without a human watching.
GPT-5.4 targets complex, multi-step professional workflows that previously required human oversight at every step.
Context Window: 1 Million Tokens
GPT-5.4 supports a 1 million token context window through the API, more than double the 400,000 tokens available in GPT-5.3. In practical terms, this is enough to load an entire medium-sized codebase, a year of corporate email, or a large document corpus in a single request.
Compaction and Long Trajectories
For agentic workloads, the 1 million token limit pairs with compaction support to keep long-running tasks viable. Without compaction, a 100-step workflow that passes its full history into each call can consume the context window well before the task finishes. Compaction lets GPT-5.4 prune intermediate history while retaining the key facts - similar in spirit to what coding agents like GPT-5.3 Codex do with task memory, but now in the base model rather than a specialized variant.
Office and Finance Integrations
With the core model launch, OpenAI announced a new suite of ChatGPT integrations that let GPT-5.4 operate directly inside Microsoft Excel and Google Sheets. In Excel, the model can read cell ranges, perform multi-step analysis, and write formulas. In Sheets, it does the same through Google Workspace APIs.
OpenAI published an internal benchmark for spreadsheet modeling tasks designed to approximate what a junior investment banking analyst would do. GPT-5.4 hits 87.5% on that benchmark, compared to 68.4% for GPT-5.2. These benchmarks are self-reported, which is worth keeping in mind, but the gap is large enough to suggest real improvement on structured document tasks.
GPT-5.4's Excel and Google Sheets plugins target financial modeling and analytics workflows.
Six Capability Improvements
OpenAI lists six areas of improvement in the official announcement:
- Coding and document comprehension - Better performance on long-document QA, tool use, and instruction following
- Multimodal perception - Improved image understanding and mixed-modal task performance
- Agent workflows - More reliable execution on long-running, multi-step tasks
- Efficiency - Increased token efficiency and lower latency on tool-heavy workloads
- Web search and synthesis - Better multi-source synthesis, especially for hard-to-find information
- Business applications - Stronger performance on document-heavy and spreadsheet-heavy enterprise use cases
Benchmark Comparison
| Benchmark | GPT-5.2 | GPT-5.4 | Human Baseline |
|---|---|---|---|
| OSWorld-Verified (desktop use) | 47.3% | 75.0% | 72.4% |
| WebArena-Verified (browser use) | 65.4% | 67.3% | - |
| GDPval (knowledge work) | 70.9% | 83.0% | - |
| Spreadsheet modeling | 68.4% | 87.5% | - |
| Error rate vs GPT-5.2 (individual claims) | baseline | -33% | - |
WebArena-Verified shows a more modest gain (+1.9 points), which is notable. Browser-based task completion appears closer to saturation in current evaluation methods. The bigger swings are on desktop navigation and open-ended knowledge work - which is exactly where computer use capability would make the most difference.
You can see how these scores stack up against other current frontier models in the agentic AI benchmarks leaderboard.
What To Watch
Pricing Is Not Yet Published
OpenAI hasn't announced API pricing for GPT-5.4 or GPT-5.4 Thinking. The 1 million token context window will presumably be expensive to use at full length, especially for Thinking. Cost efficiency per token is described as improved over GPT-5.2, but until rates are public the spreadsheet modeling and context window numbers need to be taken in isolation from deployment economics.
Compaction Behavior Is Opaque
The compaction mechanism isn't fully documented. It's unclear what heuristics determine what gets pruned, whether compacted context is auditable, or how it affects accuracy on tasks that require exact recall of earlier steps. This matters for enterprise use cases where auditability of agent reasoning is a compliance requirement.
GPT-5.3 Codex Is Still the Coding Specialist
OpenAI explicitly positions GPT-5.4 as combining GPT-5.3 Codex's coding strengths with computer use and tool capabilities. In practice, workflows that are purely text-to-code without environmental interaction should still default to GPT-5.3 Codex until head-to-head coding benchmarks confirm GPT-5.4 matches or passes it.
GPT-5.2 Thinking Retires in Three Months
GPT-5.4 Thinking replaces GPT-5.2 Thinking and will be available to Plus, Team, and Pro subscribers. GPT-5.2 Thinking disappears in 90 days. Organizations running production workflows on GPT-5.2 Thinking need to plan a migration path.
GPT-5.4 is a meaningful step for computer use - beating human benchmark performance on desktop navigation is a remarkable threshold, not a marketing claim. The Excel and Sheets integrations reflect OpenAI's shift toward direct enterprise productivity workflows rather than API-only deployments. Whether the context window and compaction combination actually holds up on long-horizon agentic tasks is the question worth watching as developers start stress-testing it in production.
Sources:
- OpenAI launches GPT-5.4 with Pro and Thinking versions - TechCrunch
- OpenAI upgrades ChatGPT with GPT-5.4 Thinking, offering six key improvements - 9to5Mac
- OpenAI launches GPT-5.4 with native computer use mode, financial plugins - VentureBeat
- OpenAI Set to Launch GPT-5.4 With 1M-Token Context Window - Trending Topics
- OpenAI launches GPT-5.4 - The New Stack
