Name: GPT-5.3 Codex
Author: OpenAI

Overview

OpenAI released GPT-5.3-Codex on February 5, 2026, and it represents something truly new in the model landscape: the first frontier model purpose-built for agentic software development that also helped create itself during training. Rather than being a general-purpose model with coding bolted on, GPT-5.3-Codex merges the coding specialization of the Codex line with the reasoning and professional knowledge capabilities of GPT-5.2, producing a model that can plan, write, test, and debug entire software projects autonomously.

The headline numbers are strong. A 77.3% on Terminal-Bench 2.0, 64.7% on OSWorld-Verified, and 77.6% on cybersecurity CTF benchmarks place it decisively ahead of all competitors on tool-use and terminal-based coding tasks. It is 25% faster than GPT-5.2-Codex, and OpenAI claims it uses fewer tokens than any prior model to complete equivalent tasks. The self-bootstrapping story - earlier Codex versions were used to debug the training run, diagnose test results, and optimize deployment infrastructure - is the kind of detail that makes you pay attention.

But the competitive picture is nuanced. Gemini 3.1 Pro and Claude Opus 4.6 match or beat GPT-5.3-Codex on general reasoning benchmarks like GPQA Diamond and MMLU Pro. Where this model carves out its niche is sustained agentic execution: dropping into a terminal, navigating a codebase, running tests, and iterating until the task is done. If your workflow is agentic coding, this is the model to benchmark against. For everything else, the answer is less clear-cut.

Key Specifications

Specification	Details
Provider	OpenAI
Model Family	GPT-5 / Codex
Architecture	GPT-5 Transformer (trained on NVIDIA GB200 NVL72)
Parameters	Not disclosed
Context Window	400,000 tokens input / 128,000 tokens output
Input Price	$3.50/M tokens (standard), $1.75/M tokens (batch)
Output Price	$28.00/M tokens (standard), $14.00/M tokens (batch)
Release Date	February 5, 2026
License	Proprietary
Input Modalities	Text, images
Output Modality	Text
Cybersecurity Rating	High (Preparedness Framework)

Benchmark Performance

Benchmark	GPT-5.3 Codex	Gemini 3.1 Pro	Claude Opus 4.6
Terminal-Bench 2.0 (agentic)	77.3%	68.5%	65.4%
OSWorld-Verified (desktop automation)	64.7%	-	-
Cybersecurity CTF	77.6%	-	-
SWE-Lancer IC Diamond	81.4%	-	-
SWE-bench Verified (coding)	80.0%	80.6%	80.8%
SWE-bench Pro Public	56.8%	54.2%	-
GPQA Diamond (science)	73.8%	94.3%	91.3%
MMLU Pro (knowledge)	82.9%	92.6%	85.1%
MMLU (knowledge)	93%	-	-
AIME 2025 (math)	94%	100%	-
MATH	96%	-	-
HumanEval (code generation)	93%	-	-

The numbers tell a clear story of specialization. GPT-5.3-Codex leads every benchmark that involves sustained tool use: Terminal-Bench 2.0 (77.3% vs. 68.5% for the nearest competitor), OSWorld-Verified (64.7%, a 26.5-point jump over GPT-5.2-Codex), and cybersecurity CTFs (77.6%). These are the benchmarks that matter most for agentic coding workflows where the model must navigate a real development environment.

On general reasoning, the story flips. Gemini 3.1 Pro's 94.3% on GPQA Diamond and 100% on AIME 2025 put it well ahead of GPT-5.3-Codex's 73.8% and 94% respectively. Claude Opus 4.6 also leads on GPQA Diamond (91.3%) and MMLU Pro (85.1% vs 82.9%). On SWE-bench Verified, which tests real-world bug fixing, all three frontier models cluster within a point of each other - effectively a three-way tie at ~80%. For a full breakdown of how these models stack up, see our coding benchmarks leaderboard and overall LLM rankings.

Key Capabilities

The defining capability of GPT-5.3-Codex is end-to-end agentic task execution. Give it a feature spec, and it'll plan an approach, write code across multiple files, run the test suite, diagnose failures, and iterate. It handles context switching between files naturally, understands testing frameworks and build systems, and can work with complex legacy codebases that tripped up earlier models. The 400K-token context window means it can ingest entire repositories in a single pass, maintaining coherent understanding across sprawling projects.

The self-improving training story isn't marketing fluff. OpenAI's team used earlier Codex versions to monitor and debug the training run itself, track patterns across training, analyze interaction quality, and propose fixes. When engineering encountered edge cases in production, they used Codex to identify context rendering bugs and root-cause low cache hit rates. The model even built regex classifiers to analyze its own performance across session logs during alpha testing. This recursive development approach is a first for a publicly released model and signals a tightening feedback loop between AI capability and AI development speed.

OpenAI also introduced interactive mid-task steering, allowing developers to redirect the model while it works. Previous Codex models sometimes completed tasks prematurely or got stuck in linting loops - GPT-5.3-Codex shows marked improvements in both areas. The model ships with deep diffs for reasoning transparency, giving developers visibility into why it made specific changes rather than just presenting the final output.

Pricing and Availability

GPT-5.3-Codex is available through ChatGPT subscriptions and, as of late February 2026, the OpenAI API. Pricing follows the standard token-based model:

Tier	Input	Output
Standard API	$3.50/M tokens	$28.00/M tokens
Batch API (async)	$1.75/M tokens	$14.00/M tokens
ChatGPT Plus	$20/month (45-225 local messages/5hr)	-
ChatGPT Pro	$200/month (300-1500 local messages/5hr)	-
ChatGPT Business	$30/user/month	-
Enterprise/Edu	Custom pricing	-

At $3.50/$28.00 per M tokens, GPT-5.3-Codex is cheaper than Claude Opus 4.6 ($15.00/$75.00) per token but more expensive than Gemini 3.1 Pro ($2.00/$12.00). The batch API at $1.75/$14.00 brings costs closer to parity with Gemini for asynchronous workloads. For subscription users, a credit system governs usage: each GPT-5.3-Codex message costs roughly 5 credits locally and 25 credits for cloud tasks.

The model is accessible through the Codex web app, CLI, and IDE extension. Pro subscribers also get access to GPT-5.3-Codex-Spark, a smaller research preview variant optimized for real-time coding at ~1000 tokens per second. For a broader view of how these costs compare across the field, check our cost efficiency leaderboard.

Strengths

Dominant agentic performance. Terminal-Bench 2.0 (77.3%), OSWorld-Verified (64.7%), and cybersecurity CTFs (77.6%) are the highest scores by any model, and the gaps are significant
Speed. 25% faster than GPT-5.2-Codex with fewer tokens consumed per task, which compounds into meaningful cost and time savings on long-running agentic sessions
400K context window. Processes entire enterprise codebases in a single pass, enabling multi-file reasoning and cross-repository refactoring
Self-improving development. The recursive training approach, where Codex debugged its own training run, points to a compounding development advantage
Cybersecurity depth. First model rated "High" under OpenAI's Preparedness Framework, with $10M in API credits for cyber defense research

Weaknesses

General reasoning gap. GPQA Diamond (73.8%) trails Gemini 3.1 Pro (94.3%) and Claude Opus 4.6 (91.3%) by wide margins - this isn't the model for graduate-level science or complex analytical reasoning
MMLU Pro trails. 82.9% versus 92.6% for Gemini and 85.1% for Claude suggests weaker broad knowledge when not in a coding context
API pricing initially unavailable. Full API access lagged weeks behind the product launch, frustrating developers building production integrations
Subscription usage caps. Even Pro users at $200/month face rate limits (300-1500 local messages per 5 hours), which can bottleneck heavy agentic workflows
Cybersecurity risk. The "High" Preparedness Framework rating cuts both ways - OpenAI acknowledged the model could "meaningfully enable real-world cyber harm" and routes some elevated-risk requests to GPT-5.2 instead

OpenAI Launches GPT-5.3-Codex: The Most Capable Agentic Coding Model Yet - Our launch coverage
GPT-5.2 Review - Review of the predecessor general-purpose model
OpenAI Frontier Review - Review of OpenAI's Frontier platform
Coding Benchmarks Leaderboard - Full coding model rankings
Overall LLM Rankings: February 2026 - Where GPT-5.3 Codex fits in the competitive landscape
Getting Started with AI Coding Assistants - Guide for developers adopting AI coding tools
Cost Efficiency Leaderboard - Token cost comparisons across models