Reasoning Model API Pricing Compared 2026
Real cost of AI reasoning APIs in 2026 - o3, o4-mini, Claude thinking, Gemini Deep Think, DeepSeek-R2, QwQ. Includes reasoning-token overhead and effective cost per query.

Reasoning model pricing is deceptive - the headline per-token rate understates real cost by 5x to 30x.
TL;DR
- Reasoning models bill hidden thinking tokens at the same rate as output tokens - the advertised price understates your real bill by 5x to 30x depending on model and task
- o3 at $8/MTok output sounds expensive; at 15,000 average reasoning tokens per query on hard math, the effective output cost per query is around $0.12 - before you see a single character of the answer
- Claude extended thinking bills
thinkingtokens andtexttokens separately at the same $15-75/MTok output rate; thebudget_tokensparameter is a soft ceiling, not a hard cost cap - Gemini 2.5 Pro thinking tokens cost the same as regular output ($10/MTok); Flash thinking is $3.50/MTok for thinking, the best value on hard tasks
- o4-mini at $4.40/MTok output is cheaper than o3 and competitive on most benchmarks - it should be your default reasoning model unless you need o3's ceiling
- For simple tasks like chatbot responses, RAG lookups, and creative writing, reasoning mode adds cost and latency with no measurable accuracy gain - turn it off
If you've been reading my general LLM API pricing comparison, you already know the headline rates for the major models. This article is about a specific trap that burns engineering teams: reasoning models bill substantially more per actual response than the listed output token price suggests.
The problem is straightforward. Every "thinking" or "reasoning" model internally generates a chain of thought before producing a final answer. Those internal tokens - OpenAI calls them reasoning tokens, Anthropic calls them thinking tokens, Google calls them thought tokens - are billed at the same rate as your visible output. But they are not visible. You get the answer; you don't get the scratchpad. You pay for both.
On a hard math problem, o3 might spend 20,000 tokens reasoning before it writes one word of the response you see. At $8/MTok for output tokens, that's $0.16 in invisible compute before your actual answer starts. If the visible answer is 500 tokens, your $0.004 answer cost you $0.164 total. The effective price per "query worth of output" is 41x the advertised output token rate.
This matters more as reasoning models become the default tier for production workloads. Let's work through the real numbers.
The Full Pricing Table
All prices in USD per million tokens (MTok). Prices verified against official provider documentation as of April 2026. "Reasoning output" refers to hidden thinking tokens billed separately. Where a model has a single output price for both thinking and visible tokens, that price appears once in "Final output."
| Model | Provider | Input /MTok | Cached input /MTok | Reasoning output /MTok | Final output /MTok | Context | Notes |
|---|---|---|---|---|---|---|---|
| o4-mini | OpenAI | $1.10 | $0.275 | $4.40 | $4.40 | 200K | Best value reasoning; same price for thinking+output |
| o3 | OpenAI | $2.00 | $0.50 | $8.00 | $8.00 | 200K | Heavy tasks; same price for thinking+output |
| Gemini 2.5 Flash thinking | $0.30 | $0.075 | $3.50 | $3.50 | 1M | Cheapest capable thinking model | |
| DeepSeek-R2 | DeepSeek | $0.55 | $0.14 | $2.19 | $2.19 | 128K | CoT separate token type; same rate |
| QwQ-Max | Qwen/Alibaba | $0.40 | $0.10 | $1.60 | $1.60 | 131K | Via OpenRouter; strong STEM |
| Qwen3-235B-A22B | Qwen/Alibaba | $0.50 | $0.125 | $1.40 | $1.40 | 131K | MoE architecture; cheapest frontier-class reasoning |
| Claude Sonnet 4 (thinking) | Anthropic | $3.00 | $0.30 | $15.00 | $15.00 | 200K | thinking + text blocks at same rate |
| Claude Opus 4 (thinking) | Anthropic | $5.00 | $0.50 | $25.00 | $25.00 | 200K | Highest ceiling; extended thinking optional |
| Gemini 2.5 Pro thinking | $1.25 | $0.31 | $10.00 | $10.00 | 2M | Input tiers: $1.25 up to 200K, $2.50 above | |
| Gemini 2.5 Flash Deep Think | $0.30 | $0.075 | $3.50 | $3.50 | 1M | Same rate as Flash thinking; higher budget mode | |
| Grok 4 (heavy reasoning) | xAI | $3.00 | $0.75 | $15.00 | $15.00 | 256K | Reasoning mode toggleable |
| Kimi-R (MoonshotAI) | OpenRouter | $1.00 | N/A | $3.00 | $3.00 | 256K | R-series; open-weights |
| Skywork-OR1-32B | Open-weights | self-host | self-host | self-host | self-host | 32K | Open-weights ORPO; strong reasoning |
| Hermes-R (NousResearch) | Open-weights | self-host | self-host | self-host | self-host | 128K | Fine-tune on Llama; structured reasoning |
Notes on this table:
The "same rate" pattern - where reasoning tokens and final output tokens cost the same - applies to every model here. OpenAI was the first to publish this clearly; reasoning tokens are billed as standard output tokens at identical rates. Anthropic's extended thinking works the same way: thinking content blocks and text content blocks in the response are billed together as output tokens.
For open-weights models (Skywork-OR1, Hermes-R), you pay GPU runtime instead of per-token fees. On cloud GPU rental at $2-4/hr for a single A100, a 32B model processes roughly 1,500 tokens/second, which works out to $0.37-$0.74 per million tokens - competitive with commercial mid-tier options when you factor in thinking tokens that you're not being charged for separately.
Why the Advertised Price Misleads
Standard LLM pricing assumes you pay for what you prompt (input) and what you receive (output). Reasoning models break that model because there is a third category: computation you trigger but never see.
Consider this request: "Prove that the square root of 2 is irrational."
A standard GPT-4-class model writes a proof in roughly 300-500 output tokens. Direct cost at $8/MTok: about $0.003.
An o3-class model internally works through multiple proof strategies, checks edge cases, and evaluates clarity before committing - roughly 8,000-12,000 thinking tokens. Then it writes the same 300-500 token proof. Direct cost at $8/MTok for thinking: $0.064-$0.096. Plus $0.003 for the visible output. Total: $0.067-$0.099.
That's 22x-33x more expensive per response, for the same visible output length. The quality difference is real - the proof is more rigorous and handles edge cases - but you need to decide whether that quality is worth 30x the cost for your use case.
The ratio varies by task type and model. Here are typical ranges based on observed token counts across model documentation and community benchmarks.
Reasoning Token Overhead by Task Type
The following estimates are based on typical reasoning token counts reported in model documentation, API responses (where reasoning token counts are exposed), and published benchmarks. They are representative, not guarantees - the same model on the same prompt can vary by 2-3x.
Task Type 1: Hard Math (Competition-Level)
Math problems at the difficulty of AIME or AMC 12 require multi-step chained reasoning with backtracking.
| Model | Typical reasoning tokens | Visible output tokens | Total output billed | Effective output cost |
|---|---|---|---|---|
| o3 | 18,000 - 25,000 | 400 - 600 | 18,400 - 25,600 | $0.147 - $0.205 |
| o4-mini | 10,000 - 16,000 | 400 - 600 | 10,400 - 16,600 | $0.046 - $0.073 |
| Gemini 2.5 Pro thinking | 8,000 - 15,000 | 300 - 500 | 8,300 - 15,500 | $0.083 - $0.155 |
| Gemini 2.5 Flash thinking | 6,000 - 12,000 | 300 - 500 | 6,300 - 12,500 | $0.022 - $0.044 |
| Claude Opus 4 (thinking) | 5,000 - 12,000 | 400 - 700 | 5,400 - 12,700 | $0.135 - $0.318 |
| DeepSeek-R2 | 6,000 - 14,000 | 350 - 600 | 6,350 - 14,600 | $0.014 - $0.032 |
Hard math observation: o4-mini gives you roughly 80% of o3's AIME score at one-third the effective cost per query. For AIME-level problems, Claude Opus 4 is the most expensive option per query despite competitive accuracy, because its output token rate ($25/MTok) is 3x o3's.
Task Type 2: Complex Code Fix (Multi-File Bug)
A bug that spans multiple files and requires understanding call graph context. Prompt size: ~8,000 tokens.
| Model | Typical reasoning tokens | Visible output tokens | Total output billed | Effective output cost |
|---|---|---|---|---|
| o3 | 5,000 - 10,000 | 800 - 1,500 | 5,800 - 11,500 | $0.046 - $0.092 |
| o4-mini | 3,000 - 7,000 | 800 - 1,500 | 3,800 - 8,500 | $0.017 - $0.037 |
| Gemini 2.5 Pro thinking | 3,000 - 8,000 | 700 - 1,200 | 3,700 - 9,200 | $0.037 - $0.092 |
| Gemini 2.5 Flash thinking | 2,500 - 6,000 | 700 - 1,200 | 3,200 - 7,200 | $0.011 - $0.025 |
| Claude Sonnet 4 (thinking) | 2,000 - 6,000 | 800 - 1,500 | 2,800 - 7,500 | $0.042 - $0.113 |
| DeepSeek-R2 | 2,000 - 5,000 | 700 - 1,200 | 2,700 - 6,200 | $0.006 - $0.014 |
Code fix observation: DeepSeek-R2 is the clear cost winner here. At $0.006-$0.014 per complex code query, it's 5-8x cheaper than Gemini 2.5 Flash thinking and 15-30x cheaper than Claude Sonnet 4 in thinking mode. Benchmark accuracy gap versus frontier models is ~8-12% on SWE-bench, which may or may not matter depending on your CI setup.
Task Type 3: Agentic Task (Multi-Step Planning)
A multi-step agentic workflow: user asks the agent to "research competitors, summarize findings, and draft a report." This involves multiple sub-tasks, tool calls, and iterative planning. Prompt grows across turns.
| Model | Typical reasoning tokens per turn | Turns in session | Total reasoning tokens | Session output cost |
|---|---|---|---|---|
| o3 | 4,000 - 8,000 | 8 - 15 | 32,000 - 120,000 | $0.26 - $0.96 |
| o4-mini | 2,000 - 5,000 | 8 - 15 | 16,000 - 75,000 | $0.07 - $0.33 |
| Claude Sonnet 4 (thinking) | 1,500 - 4,000 | 8 - 15 | 12,000 - 60,000 | $0.18 - $0.90 |
| Gemini 2.5 Flash thinking | 1,500 - 4,000 | 8 - 15 | 12,000 - 60,000 | $0.04 - $0.21 |
| DeepSeek-R2 | 1,500 - 4,000 | 8 - 15 | 12,000 - 60,000 | $0.026 - $0.13 |
Agentic observation: Multi-session reasoning costs compound. An o3 agent running 15 turns on a complex task can cost close to $1.00 in reasoning tokens alone, before counting input tokens across a growing context. In high-volume agentic workloads, this is the line item that kills your budget. For agentic pipelines, Gemini 2.5 Flash thinking or DeepSeek-R2 are the pragmatic default unless you need o3-tier ceiling on individual steps.
See our agentic AI benchmarks leaderboard for accuracy comparisons on agent-specific tasks.
Effective Cost Per Query - The Real Number
To estimate what you'll actually pay per meaningful response, use this formula:
effective_cost = (input_tokens × input_price)
+ (reasoning_tokens × output_price)
+ (visible_output_tokens × output_price)
Or as a single number you can benchmark your workload against:
Effective output multiplier = (reasoning_tokens + visible_output_tokens) / visible_output_tokens
If a model generates 10,000 reasoning tokens to write 500 visible output tokens, the multiplier is 21x. You are paying for 21x the tokens you receive.
Here is that multiplier for each task type and model, using median token counts:
| Model | Math (21K reasoning) | Code fix (6K reasoning) | Agent turn (3K reasoning) |
|---|---|---|---|
| o3 | 45x | 8x | 5x |
| o4-mini | 28x | 6x | 4x |
| Gemini 2.5 Pro thinking | 27x | 6x | 4x |
| Gemini 2.5 Flash thinking | 21x | 5x | 4x |
| Claude Opus 4 (thinking) | 19x | 5x | 4x |
| Claude Sonnet 4 (thinking) | 16x | 5x | 3x |
| DeepSeek-R2 | 23x | 6x | 4x |
The multipliers are similar across models. The big differentiator is the per-token rate once you apply the multiplier. DeepSeek-R2 at $2.19/MTok output with a 23x multiplier still beats Claude Sonnet 4 at $15/MTok output with a 16x multiplier.
When Reasoning Mode Is Worth the Cost
Reasoning mode earns its cost premium when correctness on hard problems measurably matters and errors carry real cost.
High-value use cases:
Competition math and formal proofs: o3 and o4-mini score 93-96% on AIME 2024/2025; standard GPT-5 scores ~60%. If you're building a math tutoring product, the accuracy delta is the product. See the math olympiad AI leaderboard for current rankings.
Security-critical code review: Reasoning models catch logic bugs and edge cases that autocompletion-style models miss. For code review pipelines where a missed vulnerability costs $10K+ in incident response, spending $0.05-$0.15 per review is trivially justified.
Multi-step legal/financial analysis: Tasks requiring citation chaining, cross-reference checking, and logical consistency across long documents. Reasoning models outperform standard models by 15-25% on structured analytical tasks in professional domains.
Agentic planning (single-pass): When an agent needs to plan a complex workflow in one shot before executing, reasoning models reduce replanning failures. The cost is justified when replanning means re-running expensive downstream steps.
Hard coding tasks: SWE-bench performance. o3 and Sonnet 4 (thinking) both clear 60%+; standard non-reasoning frontier models are in the 40-50% range. For auto-fix pipelines where each incorrect fix triggers a new CI run, the cost delta buys fewer failed runs.
When to Skip Reasoning Mode
Not every task benefits from reasoning tokens. Turning them off - or using a non-reasoning model - is often the right engineering decision.
Skip reasoning mode for:
Chatbot and conversational responses: Reasoning models add latency with negligible quality improvement on open-ended conversational tasks. Standard GPT-5, Claude Sonnet, or Gemini 2.5 Pro without thinking mode handles 95%+ of chatbot traffic more cheaply and faster.
Simple RAG pipelines: If the task is "read these chunks and summarize them," there's no multi-step reasoning happening. The model just needs context length and instruction following - both available without thinking tokens.
Creative writing: Reasoning tokens show almost no benchmark improvement on creative tasks. The model's creativity comes from its base weights and sampling parameters, not from thinking through a math proof.
Classification and routing: Classifying intent, routing to tools, extracting structured data from text - all of these are best served by smaller, faster, cheaper models. Running o3 for intent classification is like using a torque wrench to drive a thumbtack.
Streaming latency-critical applications: Reasoning models have a mandatory latency penalty - see the next section - that makes them inappropriate for applications where first-token latency below 1 second is required.
For benchmark comparisons that show where reasoning models help and where they plateau, see the reasoning benchmarks leaderboard.
The Latency Cost
Dollar cost is only half the problem. Reasoning models have real latency implications that affect architecture decisions.
Time to first visible token (TTFVT) - the time from sending your request to receiving the first character of the actual answer - ranges from:
| Model | Typical TTFVT (easy task) | Typical TTFVT (hard task) | Notes |
|---|---|---|---|
| o4-mini | 5 - 12 sec | 15 - 45 sec | Faster than o3 on most tasks |
| o3 | 8 - 20 sec | 30 - 90 sec | Long reasoning budgets |
| Gemini 2.5 Flash thinking | 4 - 10 sec | 12 - 35 sec | Flash is snappier |
| Gemini 2.5 Pro thinking | 6 - 15 sec | 20 - 60 sec | Can stream thoughts if enabled |
| Claude Sonnet 4 (thinking) | 5 - 12 sec | 15 - 50 sec | Extended thinking adds time |
| Claude Opus 4 (thinking) | 8 - 20 sec | 25 - 80 sec | Large budget_tokens values slow this further |
| DeepSeek-R2 | 5 - 14 sec | 18 - 55 sec | API queue times vary |
For a user waiting on a response, 30-90 seconds is the difference between an acceptable workflow and a product that feels broken. Build your UX accordingly: show a thinking indicator, stream reasoning summaries if the provider supports it, or route to a non-reasoning fallback for sub-tasks where latency matters more than correctness.
For streaming options: Google exposes thought summaries in Gemini 2.5 Pro's streaming API. Anthropic exposes extended thinking blocks in streaming mode. OpenAI does not stream reasoning tokens - you see the answer when thinking is complete.
Hidden Costs You Need to Know
Reasoning tokens are billed even when truncated
If you set a maximum token limit and the model hits it mid-reasoning, you pay for all reasoning tokens generated before the cutoff, even if no final answer was produced. This can result in paying for a full reasoning session and receiving an empty or partial response.
Always set your token limits with enough headroom for both the expected reasoning budget and the visible answer. A reasonable rule of thumb: set max_tokens to (expected_reasoning_tokens × 1.5) + expected_output_tokens.
Claude's budget_tokens is a soft ceiling
Anthropic's extended thinking API accepts a budget_tokens parameter that sets the maximum number of thinking tokens. This is a soft ceiling, not a hard cost cap. The model may occasionally exceed the budget by up to 20% on complex tasks. Also, budget_tokens is not the minimum - Claude may use far fewer thinking tokens on simple tasks when the parameter is set high.
In practice: set budget_tokens to a reasonable ceiling for your task type, monitor actual usage in the response metadata, and set your max_tokens accordingly.
Gemini Deep Think requires specific configuration
Gemini 2.5 Flash's "Deep Think" mode is not just a parameter flag - it requires setting thinking_config with a specific budget threshold in the API call. Without explicit configuration, Flash thinking uses a lighter reasoning pass that generates fewer thinking tokens and produces worse results on hard tasks. The pricing is the same ($3.50/MTok for thinking tokens), but the token count and quality differ significantly.
OpenAI reasoning effort parameter
o3 and o4-mini accept a reasoning_effort parameter with values low, medium, and high. This is not a cost parameter - it doesn't change per-token pricing - but it controls how many reasoning tokens the model generates. Setting reasoning_effort: low on o4-mini reduces typical reasoning tokens by 40-60% and reduces quality proportionally. For tasks where a fast approximate answer is acceptable, this is a legitimate cost reduction lever.
Caching is less effective for reasoning models
Standard prompt caching works on the input side regardless of whether reasoning is enabled. But reasoning tokens themselves are not cacheable across requests - they are generated fresh each time. This means that for reasoning-heavy workloads with a shared system prompt, you can cache the prompt portion but not the thinking portion. Your effective caching savings are lower than on standard models.
Free Tier Comparison
| Provider | Model | Free tier availability | Notes |
|---|---|---|---|
| Gemini 2.5 Flash thinking | Yes - free tier | Rate-limited but usable for dev | |
| Gemini 2.5 Pro thinking | Limited - 50 req/day | Part of Google AI Studio free tier | |
| OpenAI | o4-mini | ChatGPT Plus quota | ~50 messages/3 hours; API requires payment |
| OpenAI | o3 | ChatGPT Plus quota | Lower quota than o4-mini |
| Anthropic | Claude (extended thinking) | Limited Claude.ai Pro | Available in Claude.ai Pro; API requires payment |
| DeepSeek | DeepSeek-R2 | 5M tokens signup credit | Covers ~2,000 hard reasoning queries |
| xAI | Grok 4 (reasoning) | Grok SuperGrok subscribers | Rate-limited; API requires payment |
Google's free tier remains the best option for development and prototyping with reasoning models. Gemini 2.5 Flash thinking at zero cost with 15-50 RPD is enough to test your integration and develop your prompts. The rate limits prevent production use, but that's what the free tier is for.
The Claude.ai Pro plan gives you access to extended thinking mode in the chat interface, but the API always requires billable credentials. There's no free API tier for Claude extended thinking.
Choosing the Right Reasoning Model
If you're new to reasoning models or starting a new project, here's a decision framework:
Budget is primary concern: Start with Gemini 2.5 Flash thinking or DeepSeek-R2. Flash thinking at $3.50/MTok thinking tokens is the best dollar/benchmark ratio for most hard tasks.
Accuracy ceiling matters most: o3 for math and formal reasoning. For coding tasks, the gap between o3 and o4-mini has narrowed substantially - run your own evals before paying the o3 premium.
Long context required: Gemini 2.5 Pro thinking has a 2M token context window. No other reasoning model comes close. If your task involves reasoning over very long documents, Gemini is the only option.
Agent pipelines: o4-mini is the current sweet spot - strong enough for planning tasks, cheap enough for multi-turn sessions, and well-supported in agent frameworks. Claude Sonnet 4 in thinking mode is a strong alternative with Anthropic's tool use ecosystem.
Production cost sensitivity: DeepSeek-R2 at $0.55/$2.19 per MTok is the cost leader for teams that need commercial API support. If you can tolerate open-weights hosting, Qwen3-235B-A22B is cheaper still and competitive on STEM benchmarks.
For broader model capability rankings, see the overall LLM rankings.
FAQ
Do reasoning tokens count toward my context window?
No. Reasoning tokens are generated internally and do not appear in the context window for subsequent turns. They are ephemeral compute that affects billing but not context length. Your conversation history contains only the visible input and output tokens.
Can I see the reasoning tokens?
OpenAI does not expose reasoning tokens in o3 or o4-mini responses. You can see the count in usage metadata but not the content. Anthropic's extended thinking exposes thinking content blocks in the response (you can optionally surface them to users). Google's Gemini 2.5 Pro and Flash expose thought summaries in streaming mode.
What's the cheapest way to use o3?
Set reasoning_effort: low or reasoning_effort: medium and keep prompts focused. Vague or broad prompts trigger more reasoning. If you structure your prompt to reduce ambiguity, you can cut reasoning token counts by 30-50% without a significant accuracy drop on most tasks.
Does prompt caching work with reasoning models?
Prompt caching applies to the input side and works normally. You pay cached input rates for repeated prefixes. The reasoning portion is not cached - it is generated fresh for each request. Your actual savings are proportional to what fraction of your bill is input tokens vs. reasoning tokens; on hard math tasks where reasoning tokens dominate, caching helps less.
Is DeepSeek-R2 available on a US-based API?
DeepSeek's own API endpoint (api.deepseek.com) is based in China. For US data residency requirements, DeepSeek-R2 is available via OpenRouter, Together AI, and Fireworks AI at similar pricing with US infrastructure. Routing through these providers adds a small markup (typically 10-20%) but keeps data in US/EU infrastructure.
Does reasoning mode work in batch API requests?
OpenAI supports reasoning models via the Batch API - you get the 50% batch discount on both input and reasoning output tokens. Anthropic's Batch API supports extended thinking. Google's Batch API supports Gemini thinking mode. The quality of reasoning output in batch mode is identical to synchronous mode.
How do I benchmark reasoning token usage before committing to a model?
Run 50-100 representative prompts from your actual workload, capture the usage metadata (input_tokens, reasoning_tokens, output_tokens), and compute your effective cost multiplier. Most providers expose this in the response body. Don't estimate from blog posts - token counts vary significantly based on prompt structure, which you control.
Pricing sources:
