Item: GPT-5.4
Author: Elena Marchetti

OpenAI released GPT-5.4 yesterday, and the headline is simple and remarkable: this is the first general-purpose model from the company that can sit down at a computer and actually use it. Not through a bolted-on sidecar product, not via a specialized compute-use API that only enterprise customers can access - but as a native capability in the same model you call for text, code, and reasoning. The computer has always been the thing AI was supposed to be able to use. With GPT-5.4, OpenAI is making a credible claim that it now can.

TL;DR

8.2/10 - the most capable all-in-one frontier model right now, with native computer use that actually beats human desktop benchmark performance
Key strength: OSWorld-Verified score of 75.0% (human baseline: 72.4%), combined with 1M token context and best-in-class tool search efficiency
Key weakness: creative and open-ended writing still trails Claude Opus 4.6, and pricing at the Pro tier is genuinely eye-watering
Use it if you need agentic workflows, spreadsheet automation, or enterprise knowledge work at scale; skip it if your primary use case is creative writing or you're on a tight API budget

What GPT-5.4 Actually Is

Before getting into specifics, it is worth being clear about what this model is and what it replaces. GPT-5.4 isn't just a gradual step from GPT-5.2 or GPT-5.3 - it's a consolidation. OpenAI has folded the coding capabilities of GPT-5.3 Codex into the general model, added native computer use, and extended the context window to 1 million tokens. The result is a single model that can do what previously required routing between two or three specialized variants.

Three variants ship with the release: standard GPT-5.4, GPT-5.4 Thinking (the reasoning-optimized version replacing GPT-5.2 Thinking, which retires in June), and GPT-5.4 Pro (a high-performance tier aimed at enterprise). GPT-5.4 Thinking is available to ChatGPT Plus, Team, and Pro subscribers. GPT-5.4 Pro is restricted to Pro and Enterprise plans.

Computer Use: The Headline Capability

This is where GPT-5.4 makes the strongest claim on attention. Previous OpenAI products offered computer use as a specialized feature - Atlas agent mode, or the Codex platform's environment runner. GPT-5.4 bakes it into the base model.

The implementation supports two modes. In code mode, the model writes Python using Playwright or similar libraries to issue browser and desktop commands - this is more reproducible and auditable. In screenshot mode, it interprets raw screen images and issues mouse and keyboard commands directly, similar to how a human would navigate a UI without underlying DOM access. The agentic loop runs as build - run - verify - fix, with the model able to confirm its own task outcomes before declaring completion.

The benchmark numbers are hard to argue with. On OSWorld-Verified - a test of autonomous desktop navigation using screenshots and keyboard and mouse commands - GPT-5.4 scores 75.0%. The human baseline on those same tasks is 72.4%. GPT-5.2, for comparison, scored 47.3%. That's not a gradual improvement; it is a qualitative jump. On WebArena-Verified (browser-based tasks), GPT-5.4 reaches 67.3%, and on Online-Mind2Web it hits 92.8% - the latter measuring screenshot-driven browser interaction specifically.

As our agentic AI benchmarks leaderboard shows, computer use has been a genuinely hard problem for frontier models. The gap between claimed capability and real-world reliability on multi-step desktop tasks has been wide. GPT-5.4's OSWorld score is the first time I have seen a general-purpose model credibly claim to match - and narrowly exceed - human baseline performance on a standardized desktop navigation benchmark. That matters even if the specific benchmark tasks don't map perfectly to your production workloads.

Data analytics dashboard showing charts and performance metrics GPT-5.4 introduces tool search, which reduces token usage by 47% on tool-heavy agentic workflows - a significant cost and speed improvement for enterprise deployments.

The companion feature here is tool search. In previous models, every API call with tool definitions required passing the full schema of every available tool upfront - potentially adding tens of thousands of tokens to every request. GPT-5.4 receives a lightweight tool index and fetches full definitions only when it decides to use a specific tool. On a 36-server MCP benchmark (Scale's MCP Atlas suite), tool search reduced total token usage by 47% while maintaining the same accuracy. For anyone building agents on top of large MCP ecosystems, this is a real and immediate cost saving.

Context Window and Office Integrations

The API context window is 1.05 million tokens - roughly double what GPT-5.3 offered, and more than five times GPT-5.2's ceiling. Combined with a native compaction mechanism (which summarizes and prunes long agent trajectories while preserving key context), GPT-5.4 aims to sustain long-horizon agentic runs without burning through context in the first 30 steps.

The compaction mechanism is the part I would watch most carefully before deploying in production. OpenAI hasn't fully documented what heuristics determine what gets pruned, or how it affects accuracy on tasks requiring precise recall of earlier steps. For regulated industries where agent reasoning needs to be auditable, this opacity is a genuine gap.

The office integrations are a different kind of news. OpenAI is shipping ChatGPT plugins for Microsoft Excel and Google Sheets that embed GPT-5.4 directly into spreadsheet cells - the model can read ranges, run multi-step analysis, and write formulas within the tools financial teams already use. On OpenAI's internal spreadsheet modeling benchmark (designed to approximate what a junior investment banking analyst would do), GPT-5.4 scores 87.5% versus 68.4% for GPT-5.2. This is a self-reported benchmark, which deserves the standard caveat, but the gap is large enough to suggest a meaningful real-world improvement on structured document work.

Reports from before the launch suggest Microsoft had been preferring Claude for some spreadsheet and presentation tasks, which adds strategic context to why this integration landed with the model itself.

Financial data and charts spread across a desk with spreadsheets and analysis reports GPT-5.4 targets the professional knowledge work market directly, with Excel and Sheets integrations aimed at financial modeling and analytics workflows.

How It Codes

Coding performance is where the GPT-5.4 picture gets more nuanced. OpenAI positioned the model as absorbing GPT-5.3 Codex's programming strengths. On SWE-Bench Pro (a harder private-codebase variant of the standard SWE-bench), GPT-5.4 scores 57.7% - higher than Claude Opus 4.6's estimated 45.9% on the same suite. But on the publicly auditable SWE-Bench Verified, Claude Opus 4.6 leads at 80.8% versus GPT-5.4's 77.2%.

The pattern that emerges from independent testing is consistent with what the benchmarks suggest: GPT-5.4 is better at structured, well-specified automation tasks - the kind where you give it a clear spec and it executes with minimal deviation. Claude Opus 4.6 tends to beat on multi-file reasoning, architectural understanding, and tasks that require the model to infer intent across a large codebase with sparse instructions. As we noted in our Claude Opus 4.6 review, Anthropic's model excels at the kind of work that requires reading between the lines.

CodeRabbit's evaluation across 300 pull requests found GPT-5.4 identified 254 of 300 bugs (84.7%), compared to 200-207 for other frontier models. That's a real advantage for code review and quality assurance workflows.

The token efficiency claim is also real. Augment Code, which switched GPT-5.4 to their default model at launch, reported 18-20% fewer tokens on complex tasks like refactors and architectural planning. That's not just a cost story - it means agent loops stay coherent longer before context builds up.

Pricing: Capable but Not Cheap

GPT-5.4 standard is priced at $2.50 per million input tokens (under 272K context) and $15.00 per million output tokens. Cached input drops to $0.25 per million - a 90% discount that matters a lot for agentic loops with repetitive system prompts. When context beats the 272K threshold, input pricing doubles to $5.00 per million and output increases to $22.50 per million.

GPT-5.4 Pro is a different conversation completely: $30.00/$180.00 per million input/output tokens at the base tier, rising to $60.00/$270.00 above 272K context. This is OpenAI's most expensive model yet, aimed at organizations that need maximum performance and are willing to pay for it.

For comparison, Claude Opus 4.6 is priced at $5.00/$25.00 per million input/output tokens. GPT-5.4 standard is meaningfully cheaper than Opus on both dimensions. Gemini 3.1 Pro at $2.00/$12.00 is the most cost-competitive of the three, though it lacks native computer use and has a smaller max output window (65K versus 128K for GPT-5.4).

The price jump between tiers is steep. GPT-5.4 Pro costs 12 times more per input token than the standard model. For most use cases, the standard tier is the one to assess first.

Strengths and Weaknesses

Strengths

Native computer use with a 75.0% OSWorld-Verified score - truly above human baseline
Tool search reduces token usage by 47% on multi-tool agentic workloads
1.05M token context window, the largest from OpenAI by a wide margin
GDPval benchmark: 83.0% across 44 professional occupations, up from 70.9% for GPT-5.2
MMMU-Pro visual reasoning: 81.2%, strongest multimodal score of any current OpenAI model
Spreadsheet modeling at 87.5%, with direct Excel and Google Sheets integration
33% fewer hallucinated facts and 18% fewer errors compared to GPT-5.2 (per OpenAI's internal testing)
BrowseComp web search accuracy: 82.7% (Pro reaches 89.3%)

Weaknesses

Creative and narrative writing still trails Claude Opus 4.6, which produces more layered, contextually aware prose
Compaction mechanism is underdocumented - opacity around what gets pruned is a risk for production agentic pipelines requiring auditability
ARC-AGI-2 reasoning: 52.9% versus Claude Opus 4.6's 68.8% - a striking gap on abstract reasoning tasks
GPT-5.4 Pro pricing is extreme ($30/$180 per M tokens) and appropriate only for high-stakes enterprise deployments
Access restrictions: Pro tier requires a ChatGPT Pro or Enterprise subscription; not available on free or standard Plus tiers
The 272K context threshold introduces a pricing discontinuity - long-context document work gets expensive fast
WebArena-Verified improvement (+1.9 points over GPT-5.2) is modest, suggesting browser task saturation at current benchmark difficulty

Verdict

GPT-5.4 is the most complete general-purpose frontier model available right now. It is the only model in the class that combines native computer use above human baseline performance, a 1 million token context window, Codex-level coding, and a truly efficient tool-use system into a single API call. If you are building agentic systems - workflows where the model needs to interact with software environments, browse the web, manipulate documents, or coordinate multiple tools - this is the best available starting point.

Where it does not win is at the creative and open-ended end of the range. For writing that needs to feel genuinely intelligent rather than competently structured, Claude Opus 4.6 remains the stronger choice. For pure long-context reasoning that requires holding contradictions and inferring intent, Opus 4.6's ARC-AGI-2 lead (68.8% versus 52.9%) reflects a real difference in how the models handle ambiguity.

For most developers and enterprises, the decision between GPT-5.4 and Claude Opus 4.6 will come down to workload type. Automation, computer use, structured document work, financial modeling: GPT-5.4. Creative generation, complex multi-file codebase reasoning, abstract problem solving: Claude Opus 4.6 or Gemini 3.1 Pro.

Score: 8.2 / 10

The computer use story alone justifies serious evaluation. Whether it justifies the price depends on what you're actually building.

GPT-5.4 Review: The Computer-Use Frontier

What GPT-5.4 Actually Is

Computer Use: The Headline Capability

Context Window and Office Integrations

How It Codes

Pricing: Capable but Not Cheap

Strengths and Weaknesses

Verdict

Sources

What GPT-5.4 Actually Is

Computer Use: The Headline Capability

Context Window and Office Integrations

How It Codes

Pricing: Capable but Not Cheap

Strengths and Weaknesses

Verdict

Sources

Google Analytics