Name: OpenAI o3
Author: OpenAI

OpenAI o3 is the company's most capable reasoning model as of its release in April 2025 - a successor to o1 that applies extended chain-of-thought processing to problems in math, science, software engineering, and visual reasoning. It doesn't just produce a response; it thinks through a problem before answering, spending compute on intermediate reasoning steps that aren't exposed to the user but are billed as output tokens.

TL;DR

Best-in-class on AIME 2024 (96.7%), GPQA Diamond (87.7%), and SWE-bench Verified (71.7%) at launch
200K context window, $2.00/M input, $8.00/M output - cut 80% from original launch pricing in June 2025
Beats o1 across every major benchmark; slower than non-reasoning models, and reasoning tokens add hidden cost

What separates o3 from earlier o-series models is breadth. The o1 generation excelled at math and coding but couldn't natively handle images. O3 integrates visual inputs directly into its reasoning chain - you can hand it a whiteboard photo or a diagram and it reasons with the image, not just about it. Combined with tool use (web search, Python code execution, image generation), this makes o3 the first reasoning model that works as a general-purpose agent within ChatGPT without switching modes.

O3 launched with o4-mini on April 16, 2025. O4-mini is the cheaper, faster alternative for STEM tasks; o3 is the larger model targeting complex multi-step problems where depth matters more than speed. The pricing at launch - $10/$40 per million tokens - was steep enough to limit adoption. OpenAI cut it 80% on June 10, 2025, when it also introduced o3-pro at $20/$80 per million tokens as the new premium tier.

Key Specifications

Specification	Details
Provider	OpenAI
Model Family	o-series (reasoning)
Parameters	Not disclosed
Context Window	200K tokens
Max Output Tokens	100K tokens
Input Price	$2.00/M tokens
Cached Input Price	$0.50/M tokens
Output Price	$8.00/M tokens
Release Date	April 16, 2025
Knowledge Cutoff	June 1, 2024
License	Proprietary

Reasoning tokens are billed as output tokens at $8.00/M. A complex task can create tens of thousands of reasoning tokens before producing any visible output, so real-world cost often runs 3-10x the naive estimate based on input length alone. The Batch API gives a flat 50% discount for non-time-sensitive workloads. Prompt caching cuts input costs to $0.50/M for repeated prefixes.

Benchmark Performance

O3's launch benchmarks showed the largest single-generation improvement in OpenAI's o-series history. The jump from o1 to o3 on AIME and SWE-bench was more dramatic than the jump from GPT-4o to o1.

Benchmark	o3	o1	Claude Opus 4	Gemini 2.5 Pro
AIME 2024	96.7%	74.3%	-	-
AIME 2025	88.9%	79.2%	90.0%	83.0%
GPQA Diamond	87.7%	78.0%	-	86.4%
SWE-bench Verified	71.7%	48.9%	~70.3%	-
ARC-AGI (high compute)	87.5%	-	-	-
Codeforces Elo	2,727	1,891	-	-
EpochAI FrontierMath	25.2%	<2%	-	-

The EpochAI FrontierMath number deserves attention. This benchmark targets unpublished research-level math problems where prior models had never broken 2%. O3's 25.2% - still far from human expert performance - represents a qualitative shift in what these models can do on genuinely novel mathematical reasoning, not just well-represented competition problems.

On ARC-AGI, o3 at high compute (87.5%) edges past average human performance (85%). ARC Prize confirmed there was no performance change after the June 2025 price cut - same model, just cheaper inference.

The caveat: on ARC-AGI-2, released to specifically probe generalization beyond training data, o3 scores below 3% while humans score around 60%. The gap between competitive-benchmark performance and transferable reasoning is real and measurable.

$A pen resting on mathematical calculations on paper$ O3's headline numbers come from mathematics competitions like AIME - the model scores 96.7% on AIME 2024, nearly double o1's pre-launch scores on the same test. Source: unsplash.com

Key Capabilities

Reasoning with images

O3 is the first reasoning model to integrate visual inputs into its chain of thought rather than treating images as static context. When given a diagram, it can rotate, crop, and reason over it as part of solving a problem. This makes it useful for tasks that previously required either a separate vision model or a human intermediary - reading engineering schematics, analyzing data visualizations, or working through handwritten math.

Adaptive compute

The reasoning_effort parameter accepts four settings: low, medium, high, and xhigh. At low, the model spends fewer tokens thinking and responds faster. At xhigh, it allocates more reasoning tokens and tends to self-correct more aggressively. For production use, medium offers a reasonable balance. The tradeoff is explicit: higher effort means more reasoning tokens billed at $8.00/M. OpenAI recommends reserving at least 25K tokens for reasoning plus outputs when sizing your requests.

Agentic tool use

Within ChatGPT, o3 can combine tools in a single reasoning trace - it can search the web for data, write and run Python to analyze it, and produce a visualization, all within one conversation turn. This is what powers the Deep Research mode, which launches an o3-backed agent that autonomously runs multi-step research tasks. Via the API, developers can reproduce this by providing tool definitions and letting the model decide when and how to call them.

Developer using multiple monitors with code editors and terminal windows open O3's 71.7% on SWE-bench Verified made it the strongest publicly available model for autonomous software engineering tasks at launch. Source: unsplash.com

Deliberative alignment

OpenAI introduced a safety technique called deliberative alignment with o3, where the model reasons about whether a request violates safety specifications before responding. This differs from rule-based filters: the model actively applies its reasoning capability to the safety question. In practice, this means fewer false refusals on edge cases while maintaining rejection rates on genuinely harmful requests. It's the same compute budget applied to safety as to task completion.

Pricing and Availability

API access

O3 is available via the OpenAI API at $2.00/M input, $8.00/M output. The Batch API drops this to $1.00/M input, $4.00/M output for tasks that tolerate a 24-hour completion window. Prompt caching applies automatically to eligible long prompts and cuts input costs to $0.50/M on repeated prefixes. Note that reasoning tokens don't benefit from caching - only the input prefix does.

ChatGPT tiers

ChatGPT Plus and Team subscribers get access to o3 with usage caps (50 messages per week as of the April 2025 launch). Pro subscribers get higher limits. Free-tier users don't have access to o3; they can use o4-mini via the reasoning toggle.

O3-pro tier

O3-pro uses the same weights as o3 but applies more compute per request, targeting tasks where reliability matters more than throughput. Pricing is $20/M input, $80/M output - 10x the standard o3 rate. Response times run notably slower; early tests showed multi-minute waits for complex requests. This tier makes sense for professional workflows where occasional very high-stakes queries justify the cost, not for interactive or high-volume use.

O3-mini

O3-mini, released January 31, 2025, predates the full o3 model and is separately positioned as a low-latency option for STEM tasks. Its pricing is $1.10/M input, $4.40/M output. It doesn't support image inputs and has lower overall benchmark scores, but it's faster and cheaper for coding or math tasks where visual reasoning isn't needed. For a detailed comparison, see our reasoning benchmarks leaderboard.

Strengths and Weaknesses

Strengths

Highest published math scores among OpenAI models at launch, including 96.7% on AIME 2024
Visual reasoning integrated into chain of thought, not bolted on as a separate step
Agentic tool use enables multi-step autonomous workflows within a single reasoning trace
Deliberative alignment reduces false refusals compared to filter-based safety systems
80% price cut in June 2025 brought cost to $2/$8/M - competitive with several non-reasoning models
200K context window handles large codebases and long documents
Batch API and prompt caching support for production workloads

Weaknesses

Reasoning tokens are billed as output; actual cost per query can be 3-10x the listed token price
Knowledge cutoff is June 2024 - outdated for fast-moving domains without web search enabled
ARC-AGI-2 score (below 3%) shows the gap between benchmark-optimized reasoning and general transfer
High latency compared to non-reasoning models; unsuitable for latency-sensitive applications
Occasional "laziness" in coding tasks, inserting placeholder comments rather than completing implementations
Context window bottleneck: users have reported hitting effective limits of 57-64K tokens despite the 200K advertised window when reasoning token usage is factored in

Reasoning Benchmarks Leaderboard - where o3 sits among current reasoning models
Coding Benchmarks Leaderboard - SWE-bench and Codeforces context
Math Olympiad AI Leaderboard - AIME rankings across models
SWE-bench Coding Agent Leaderboard - detailed coding agent rankings

FAQ

What is o3 best suited for?

Complex multi-step reasoning in math, science, and software engineering. Tasks that require working through intermediate steps, analyzing images with text, or running extended agentic workflows. Not the right choice for fast, simple queries.

How does reasoning token billing work?

Reasoning tokens are produced internally before the visible response. They're billed at the output token rate ($8.00/M). A single complex query can generate 10,000-50,000 reasoning tokens before producing any user-visible output, making real per-query cost hard to predict from input length alone.

What's the difference between o3 and o3-pro?

Same model weights, different compute budget. O3-pro runs more reasoning iterations before responding, which improves reliability on ambiguous or hard tasks but increases cost 10x and notably increases latency. O3 is the better default for most use cases.

Does o3 support function calling?

Yes. O3 supports function calling, structured outputs, and all standard API features. It can also use hosted tools (web search, Python execution, image generation) when those are enabled.

Is o3 available on Azure?

Yes. O3 is available through Azure OpenAI Service, though pricing and rate limits on Azure may differ from the direct OpenAI API.

Sources: