GPT-5.3 Codex
OpenAI's most capable agentic coding model combines frontier code generation with GPT-5-class reasoning, 400K context, and a 77.3% Terminal-Bench 2.0 score.

Overview
OpenAI released GPT-5.3-Codex on February 5, 2026, and it represents something genuinely new in the model landscape: the first frontier model purpose-built for agentic software development that also helped create itself during training. Rather than being a general-purpose model with coding bolted on, GPT-5.3-Codex merges the coding specialization of the Codex line with the reasoning and professional knowledge capabilities of GPT-5.2, producing a model that can plan, write, test, and debug entire software projects autonomously.
The headline numbers are strong. A 77.3% on Terminal-Bench 2.0, 64.7% on OSWorld-Verified, and 77.6% on cybersecurity CTF benchmarks place it decisively ahead of all competitors on tool-use and terminal-based coding tasks. It is 25% faster than GPT-5.2-Codex, and OpenAI claims it uses fewer tokens than any prior model to complete equivalent tasks. The self-bootstrapping story - earlier Codex versions were used to debug the training run, diagnose test results, and optimize deployment infrastructure - is the kind of detail that makes you pay attention.
But the competitive picture is nuanced. Gemini 3.1 Pro and Claude Opus 4.6 match or beat GPT-5.3-Codex on general reasoning benchmarks like GPQA Diamond and MMLU Pro. Where this model carves out its niche is sustained agentic execution: dropping into a terminal, navigating a codebase, running tests, and iterating until the task is done. If your workflow is agentic coding, this is the model to benchmark against. For everything else, the answer is less clear-cut.
Key Specifications
| Specification | Details |
|---|---|
| Provider | OpenAI |
| Model Family | GPT-5 / Codex |
| Architecture | GPT-5 Transformer (trained on NVIDIA GB200 NVL72) |
| Parameters | Not disclosed |
| Context Window | 400,000 tokens input / 128,000 tokens output |
| Input Price | $3.50/M tokens (standard), $1.75/M tokens (batch) |
| Output Price | $28.00/M tokens (standard), $14.00/M tokens (batch) |
| Release Date | February 5, 2026 |
| License | Proprietary |
| Input Modalities | Text, images |
| Output Modality | Text |
| Cybersecurity Rating | High (Preparedness Framework) |
Benchmark Performance
| Benchmark | GPT-5.3 Codex | Gemini 3.1 Pro | Claude Opus 4.6 |
|---|---|---|---|
| Terminal-Bench 2.0 (agentic) | 77.3% | 68.5% | 65.4% |
| OSWorld-Verified (desktop automation) | 64.7% | - | - |
| Cybersecurity CTF | 77.6% | - | - |
| SWE-Lancer IC Diamond | 81.4% | - | - |
| SWE-bench Verified (coding) | 80.0% | 80.6% | 80.8% |
| SWE-bench Pro Public | 56.8% | 54.2% | - |
| GPQA Diamond (science) | 73.8% | 94.3% | 91.3% |
| MMLU Pro (knowledge) | 82.9% | 92.6% | 85.1% |
| MMLU (knowledge) | 93% | - | - |
| AIME 2025 (math) | 94% | 100% | - |
| MATH | 96% | - | - |
| HumanEval (code generation) | 93% | - | - |
The numbers tell a clear story of specialization. GPT-5.3-Codex dominates every benchmark that involves sustained tool use: Terminal-Bench 2.0 (77.3% vs. 68.5% for the nearest competitor), OSWorld-Verified (64.7%, a 26.5-point jump over GPT-5.2-Codex), and cybersecurity CTFs (77.6%). These are the benchmarks that matter most for agentic coding workflows where the model must navigate a real development environment.
On general reasoning, the story flips. Gemini 3.1 Pro's 94.3% on GPQA Diamond and 100% on AIME 2025 put it well ahead of GPT-5.3-Codex's 73.8% and 94% respectively. Claude Opus 4.6 also leads on GPQA Diamond (91.3%) and MMLU Pro (85.1% vs 82.9%). On SWE-bench Verified, which tests real-world bug fixing, all three frontier models cluster within a point of each other - effectively a three-way tie at ~80%. For a full breakdown of how these models stack up, see our coding benchmarks leaderboard and overall LLM rankings.
Key Capabilities
The defining capability of GPT-5.3-Codex is end-to-end agentic task execution. Give it a feature spec, and it will plan an approach, write code across multiple files, run the test suite, diagnose failures, and iterate. It handles context switching between files naturally, understands testing frameworks and build systems, and can work with complex legacy codebases that tripped up earlier models. The 400K-token context window means it can ingest entire repositories in a single pass, maintaining coherent understanding across sprawling projects.
The self-improving training story is not marketing fluff. OpenAI's team used earlier Codex versions to monitor and debug the training run itself, track patterns across training, analyze interaction quality, and propose fixes. When engineering encountered edge cases in production, they used Codex to identify context rendering bugs and root-cause low cache hit rates. The model even built regex classifiers to analyze its own performance across session logs during alpha testing. This recursive development approach is a first for a publicly released model and signals a tightening feedback loop between AI capability and AI development speed.
OpenAI also introduced interactive mid-task steering, allowing developers to redirect the model while it works. Previous Codex models sometimes completed tasks prematurely or got stuck in linting loops - GPT-5.3-Codex shows marked improvements in both areas. The model ships with deep diffs for reasoning transparency, giving developers visibility into why it made specific changes rather than just presenting the final output.
Pricing and Availability
GPT-5.3-Codex is available through ChatGPT subscriptions and, as of late February 2026, the OpenAI API. Pricing follows the standard token-based model:
| Tier | Input | Output |
|---|---|---|
| Standard API | $3.50/M tokens | $28.00/M tokens |
| Batch API (async) | $1.75/M tokens | $14.00/M tokens |
| ChatGPT Plus | $20/month (45-225 local messages/5hr) | - |
| ChatGPT Pro | $200/month (300-1500 local messages/5hr) | - |
| ChatGPT Business | $30/user/month | - |
| Enterprise/Edu | Custom pricing | - |
At $3.50/$28.00 per M tokens, GPT-5.3-Codex is cheaper than Claude Opus 4.6 ($15.00/$75.00) per token but more expensive than Gemini 3.1 Pro ($2.00/$12.00). The batch API at $1.75/$14.00 brings costs closer to parity with Gemini for asynchronous workloads. For subscription users, a credit system governs usage: each GPT-5.3-Codex message costs roughly 5 credits locally and 25 credits for cloud tasks.
The model is accessible through the Codex web app, CLI, and IDE extension. Pro subscribers also get access to GPT-5.3-Codex-Spark, a smaller research preview variant optimized for real-time coding at ~1000 tokens per second. For a broader view of how these costs compare across the field, check our cost efficiency leaderboard.
Strengths
- Dominant agentic performance. Terminal-Bench 2.0 (77.3%), OSWorld-Verified (64.7%), and cybersecurity CTFs (77.6%) are the highest scores by any model, and the gaps are significant
- Speed. 25% faster than GPT-5.2-Codex with fewer tokens consumed per task, which compounds into meaningful cost and time savings on long-running agentic sessions
- 400K context window. Processes entire enterprise codebases in a single pass, enabling multi-file reasoning and cross-repository refactoring
- Self-improving development. The recursive training approach, where Codex debugged its own training run, points to a compounding development advantage
- Cybersecurity depth. First model rated "High" under OpenAI's Preparedness Framework, with $10M in API credits for cyber defense research
Weaknesses
- General reasoning gap. GPQA Diamond (73.8%) trails Gemini 3.1 Pro (94.3%) and Claude Opus 4.6 (91.3%) by wide margins - this is not the model for graduate-level science or complex analytical reasoning
- MMLU Pro trails. 82.9% versus 92.6% for Gemini and 85.1% for Claude suggests weaker broad knowledge when not in a coding context
- API pricing initially unavailable. Full API access lagged weeks behind the product launch, frustrating developers building production integrations
- Subscription usage caps. Even Pro users at $200/month face rate limits (300-1500 local messages per 5 hours), which can bottleneck heavy agentic workflows
- Cybersecurity risk. The "High" Preparedness Framework rating cuts both ways - OpenAI acknowledged the model could "meaningfully enable real-world cyber harm" and routes some elevated-risk requests to GPT-5.2 instead
Related Coverage
- OpenAI Launches GPT-5.3-Codex: The Most Capable Agentic Coding Model Yet - Our launch coverage
- GPT-5.2 Review - Review of the predecessor general-purpose model
- OpenAI Frontier Review - Review of OpenAI's Frontier platform
- Coding Benchmarks Leaderboard - Full coding model rankings
- Overall LLM Rankings: February 2026 - Where GPT-5.3 Codex fits in the competitive landscape
- Getting Started with AI Coding Assistants - Guide for developers adopting AI coding tools
- Cost Efficiency Leaderboard - Token cost comparisons across models
Sources
- Introducing GPT-5.3-Codex - OpenAI
- GPT-5.3-Codex System Card - OpenAI
- Codex Pricing - OpenAI Developer Docs
- OpenAI API Pricing
- GPT-5.3 Codex Specs and Benchmarks - LLM Stats
- GPT-5.3 Codex Specs - Automatio
- Claude Opus 4.6 vs GPT-5.3 Codex Comparison - Digital Applied
- GPT-5.3-Codex Raises Cybersecurity Risks - Fortune
- GPT-5.3 Codex Pricing and Features - Eesel
- GPT-5.3-Codex: From Coding Assistant to General Work Agent - DataCamp
