OpenAI o3-pro

OpenAI's maximum-compute reasoning model targets the hardest problems where o3 falls short, at $20/$80 per million tokens.

OpenAI o3-pro

OpenAI o3-pro launched on June 10, 2025, with a simultaneous 80% price cut to the base o3 model. It's a version of o3 that runs more compute at inference time to produce more reliable, consistent answers on the problems where o3 already struggled. It replaced o1-pro as the default high-end option for Pro and Team users in ChatGPT.

TL;DR

  • Maximum-compute o3 variant; designed for tasks where standard o3 fails or gives inconsistent answers
  • 200K context, $20/M input and $80/M output - 10x the per-token cost of o3 at current pricing
  • Expert reviewers preferred o3-pro over o3 across every tested category, with an AIME 2024 score of ~93% vs Gemini 2.5 Pro's ~92%

Overview

The o-series from OpenAI uses reinforcement learning to reason through problems before responding. o3-pro sits at the top of that hierarchy - not by changing the base architecture, but by allocating substantially more compute at inference time. OpenAI describes the approach as giving the model room to "think harder," and internal evaluations show it beats standard o3 on every tested category: science, education, programming, business, and writing.

The practical tradeoff is steep. Responses regularly take 5-15 minutes. Users on the OpenAI Developer Community reported queries timing out on mobile clients, and some simple requests consumed more than 13 minutes of processing. That latency is by design: o3-pro works best as a background task for problems that would otherwise take a human expert several hours. For interactive chat or latency-sensitive production pipelines, the standard o3-based GPT-5 line is a better fit.

What makes o3-pro truly useful is its reliability profile. OpenAI's internal "4/4" benchmark requires the model to answer the same question correctly four consecutive times. o3-pro beat both o1-pro and base o3 on that test - a measure of consistency that matters when deploying AI in high-stakes settings where a single wrong answer is unacceptable.

Key Specifications

SpecificationDetails
ProviderOpenAI
Model Familyo-series (reasoning)
ParametersNot disclosed
Context Window200,000 tokens
Max Output Tokens100,000 tokens
Input Price$20.00 per million tokens
Output Price$80.00 per million tokens
Cached Input Price$5.00 per million tokens
Release DateJune 10, 2025
Knowledge CutoffJune 1, 2024
LicenseProprietary

A chart showing performance benchmarks across AI reasoning models Benchmark comparisons for frontier reasoning models in mid-2025. o3-pro and Gemini 2.5 Pro tied at 84% on GPQA Diamond in third-party testing. Source: pexels.com

Benchmark Performance

OpenAI's release materials cite consistent preference for o3-pro over o3 in expert human evaluations, but the company didn't publish a detailed numeric benchmark table at launch. Third-party testing fills in some gaps:

Benchmarko3-proo3Gemini 2.5 Pro
AIME 2024 (math)~93%91.6%~92%
GPQA Diamond (PhD science)~84%~83.3%~84%
SWE-bench Verified (coding)~71.7%71.7%63.2%
Codeforces Elo~2,7272,727Not published

A few things stand out. On GPQA Diamond, o3-pro and Gemini 2.5 Pro scored identically in Analytics Vidhya's testing, despite o3-pro costing roughly 16x more. On math, o3-pro edged ahead slightly. On SWE-bench, the advantage over Gemini 2.5 Pro (71.7% vs 63.2%) is the clearest gap - consistent with OpenAI's claim that coding and scientific reasoning are where the extra compute pays off most.

The more meaningful differentiator is the 4/4 reliability score. For tasks where passing 75% of the time isn't good enough - peer-review-level math, security audits, formal proofs - that consistency advantage matters more than a raw accuracy number. The reasoning benchmarks leaderboard tracks the full picture across models.

Key Capabilities

Formal reasoning and mathematics. o3-pro's extended reasoning budget makes it capable of multi-step proofs and competition-math problems where standard models either give up or produce confident errors. The AIME score above reflects contest problems designed to defeat pattern-matching; the 96.7% score on o3's initial release charts cited in open evaluations suggests o3-pro's consistency improvements compound here.

Code on a screen representing software analysis and vulnerability research Security research use case: researchers have used the o3 family to identify previously unknown vulnerabilities in complex codebases. Source: pexels.com

Security research and vulnerability auditing. The o3 family gained credibility in this space when a researcher used the model to identify CVE-2025-37899, a use-after-free bug in the Linux kernel's SMB implementation. The model spotted the vulnerability as a secondary finding while the researcher was benchmarking it against a different known CVE. o3-pro's reliability profile - not just finding bugs but finding them consistently - is the feature that matters for this workflow. Independent tests showed o3 finding a Kerberos authentication vulnerability in 8 of 100 attempts; Claude 3.7 Sonnet managed 3 of 100.

Tool use in ChatGPT. In the ChatGPT interface, o3-pro has access to web search, Python execution, file analysis, and memory. It doesn't support image generation, Canvas, or temporary chats (the last two were listed as pending fixes at launch).

The API version supports text and image input, function calling, and structured outputs. Streaming isn't supported, which is consistent with the multi-minute response times. The Responses API is the primary recommended interface.

Pricing and Availability

At $20/$80 per million input/output tokens, o3-pro is 10x the per-token price of the current o3 API rate ($2/$8 after the simultaneous price cut on launch day). Against o1-pro, which was its predecessor for Pro users, it's about 87% cheaper.

Prompt caching cuts input costs to $5 per million cached tokens. The Batch API is listed as supported in the model card, though some developers reported errors when trying to use it at launch - worth testing if you plan to run bulk evaluations.

Rate limits scale with usage tier: from 500 requests/minute and 30K tokens/minute at entry level up to 10,000 requests/minute and 30M tokens/minute for high-volume tiers.

Access is available to:

  • ChatGPT Pro and Team users (default model, replaces o1-pro)
  • Enterprise and Edu accounts (rolled out the week after launch)
  • API (all tiers, check rate limits per tier)

The model isn't available in the Realtime API or for fine-tuning. It doesn't support audio input or output.

Strengths and Weaknesses

Strengths

  • Highest consistency of any o-series model - the 4/4 reliability metric matters for formal work
  • Strong AIME and GPQA performance for frontier math and science
  • 200K context handles long documents, large codebases, and multi-document reasoning
  • Prompt caching reduces costs on repeated long-context requests
  • Multimodal input (text + images) without compromising reasoning depth

Weaknesses

  • 5-15 minute response times; mobile clients and short timeouts will break
  • Streaming not supported - you wait for the full response
  • $80/M output is hard to justify for tasks where o3 or Claude Opus 4.7 perform comparably
  • Knowledge cutoff of June 2024 is behind newer models
  • No image generation, Canvas, or temporary chats in ChatGPT
  • Continued hallucination reports from users despite improved consistency claims

FAQ

What is o3-pro best at?

Tasks requiring consistent correctness across multiple attempts: formal math proofs, security audits, PhD-level science problems, and complex code analysis. It's the right choice when o3 passes 70% of the time and you need 95%+.

How much slower is o3-pro than o3?

Notably slower. Users report 5-15 minutes per response for complex queries, compared to seconds or low minutes for standard o3. The Responses API with background mode is the recommended approach for production use.

Can I use o3-pro in the API?

Yes. It's available via the Chat Completions and Responses APIs. The Responses API is the primary recommended interface. Streaming isn't supported, so your application needs to handle long-polling or webhooks.

Is prompt caching supported?

Yes. Cached input tokens are priced at $5 per million, reducing the cost of long repeated context. This matters most for workflows that re-send large system prompts or document contexts across multiple requests.

How does o3-pro compare to Gemini 2.5 Pro?

On GPQA Diamond they're basically tied (~84%). o3-pro has a slight edge on AIME math (~93% vs ~92%) and a more significant lead on SWE-bench (~71.7% vs ~63.2%). Gemini 2.5 Pro has a 1M-token context window and costs a fraction of the price for most workloads.

Does o3-pro support vision/multimodal input?

Yes, image input is supported with text. Audio and video are not supported. It can't produce images.

Sources

✓ Last verified May 11, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.