Reasoning Model API Pricing Compared 2026

Real cost of AI reasoning APIs in 2026 - o3, o4-mini, Claude thinking, Gemini Deep Think, DeepSeek-R2, QwQ. Includes reasoning-token overhead and effective cost per query.

Reasoning Model API Pricing Compared 2026

Reasoning model pricing is deceptive - the headline per-token rate understates real cost by 5x to 30x.

TL;DR

  • Reasoning models bill hidden thinking tokens at the same rate as output tokens - the advertised price understates your real bill by 5x to 30x depending on model and task
  • o3 at $8/MTok output sounds expensive; at 15,000 average reasoning tokens per query on hard math, the effective output cost per query is around $0.12 - before you see a single character of the answer
  • Claude extended thinking bills thinking tokens and text tokens separately at the same $15-75/MTok output rate; the budget_tokens parameter is a soft ceiling, not a hard cost cap
  • Gemini 2.5 Pro thinking tokens cost the same as regular output ($10/MTok); Flash thinking is $3.50/MTok for thinking, the best value on hard tasks
  • o4-mini at $4.40/MTok output is cheaper than o3 and competitive on most benchmarks - it should be your default reasoning model unless you need o3's ceiling
  • For simple tasks like chatbot responses, RAG lookups, and creative writing, reasoning mode adds cost and latency with no measurable accuracy gain - turn it off

If you've been reading my general LLM API pricing comparison, you already know the headline rates for the major models. This article is about a specific trap that burns engineering teams: reasoning models bill substantially more per actual response than the listed output token price suggests.

The problem is straightforward. Every "thinking" or "reasoning" model internally generates a chain of thought before producing a final answer. Those internal tokens - OpenAI calls them reasoning tokens, Anthropic calls them thinking tokens, Google calls them thought tokens - are billed at the same rate as your visible output. But they are not visible. You get the answer; you don't get the scratchpad. You pay for both.

On a hard math problem, o3 might spend 20,000 tokens reasoning before it writes one word of the response you see. At $8/MTok for output tokens, that's $0.16 in invisible compute before your actual answer starts. If the visible answer is 500 tokens, your $0.004 answer cost you $0.164 total. The effective price per "query worth of output" is 41x the advertised output token rate.

This matters more as reasoning models become the default tier for production workloads. Let's work through the real numbers.

The Full Pricing Table

All prices in USD per million tokens (MTok). Prices verified against official provider documentation as of April 2026. "Reasoning output" refers to hidden thinking tokens billed separately. Where a model has a single output price for both thinking and visible tokens, that price appears once in "Final output."

ModelProviderInput /MTokCached input /MTokReasoning output /MTokFinal output /MTokContextNotes
o4-miniOpenAI$1.10$0.275$4.40$4.40200KBest value reasoning; same price for thinking+output
o3OpenAI$2.00$0.50$8.00$8.00200KHeavy tasks; same price for thinking+output
Gemini 2.5 Flash thinkingGoogle$0.30$0.075$3.50$3.501MCheapest capable thinking model
DeepSeek-R2DeepSeek$0.55$0.14$2.19$2.19128KCoT separate token type; same rate
QwQ-MaxQwen/Alibaba$0.40$0.10$1.60$1.60131KVia OpenRouter; strong STEM
Qwen3-235B-A22BQwen/Alibaba$0.50$0.125$1.40$1.40131KMoE architecture; cheapest frontier-class reasoning
Claude Sonnet 4 (thinking)Anthropic$3.00$0.30$15.00$15.00200Kthinking + text blocks at same rate
Claude Opus 4 (thinking)Anthropic$5.00$0.50$25.00$25.00200KHighest ceiling; extended thinking optional
Gemini 2.5 Pro thinkingGoogle$1.25$0.31$10.00$10.002MInput tiers: $1.25 up to 200K, $2.50 above
Gemini 2.5 Flash Deep ThinkGoogle$0.30$0.075$3.50$3.501MSame rate as Flash thinking; higher budget mode
Grok 4 (heavy reasoning)xAI$3.00$0.75$15.00$15.00256KReasoning mode toggleable
Kimi-R (MoonshotAI)OpenRouter$1.00N/A$3.00$3.00256KR-series; open-weights
Skywork-OR1-32BOpen-weightsself-hostself-hostself-hostself-host32KOpen-weights ORPO; strong reasoning
Hermes-R (NousResearch)Open-weightsself-hostself-hostself-hostself-host128KFine-tune on Llama; structured reasoning

Notes on this table:

The "same rate" pattern - where reasoning tokens and final output tokens cost the same - applies to every model here. OpenAI was the first to publish this clearly; reasoning tokens are billed as standard output tokens at identical rates. Anthropic's extended thinking works the same way: thinking content blocks and text content blocks in the response are billed together as output tokens.

For open-weights models (Skywork-OR1, Hermes-R), you pay GPU runtime instead of per-token fees. On cloud GPU rental at $2-4/hr for a single A100, a 32B model processes roughly 1,500 tokens/second, which works out to $0.37-$0.74 per million tokens - competitive with commercial mid-tier options when you factor in thinking tokens that you're not being charged for separately.

Why the Advertised Price Misleads

Standard LLM pricing assumes you pay for what you prompt (input) and what you receive (output). Reasoning models break that model because there is a third category: computation you trigger but never see.

Consider this request: "Prove that the square root of 2 is irrational."

A standard GPT-4-class model writes a proof in roughly 300-500 output tokens. Direct cost at $8/MTok: about $0.003.

An o3-class model internally works through multiple proof strategies, checks edge cases, and evaluates clarity before committing - roughly 8,000-12,000 thinking tokens. Then it writes the same 300-500 token proof. Direct cost at $8/MTok for thinking: $0.064-$0.096. Plus $0.003 for the visible output. Total: $0.067-$0.099.

That's 22x-33x more expensive per response, for the same visible output length. The quality difference is real - the proof is more rigorous and handles edge cases - but you need to decide whether that quality is worth 30x the cost for your use case.

The ratio varies by task type and model. Here are typical ranges based on observed token counts across model documentation and community benchmarks.

Reasoning Token Overhead by Task Type

The following estimates are based on typical reasoning token counts reported in model documentation, API responses (where reasoning token counts are exposed), and published benchmarks. They are representative, not guarantees - the same model on the same prompt can vary by 2-3x.

Task Type 1: Hard Math (Competition-Level)

Math problems at the difficulty of AIME or AMC 12 require multi-step chained reasoning with backtracking.

ModelTypical reasoning tokensVisible output tokensTotal output billedEffective output cost
o318,000 - 25,000400 - 60018,400 - 25,600$0.147 - $0.205
o4-mini10,000 - 16,000400 - 60010,400 - 16,600$0.046 - $0.073
Gemini 2.5 Pro thinking8,000 - 15,000300 - 5008,300 - 15,500$0.083 - $0.155
Gemini 2.5 Flash thinking6,000 - 12,000300 - 5006,300 - 12,500$0.022 - $0.044
Claude Opus 4 (thinking)5,000 - 12,000400 - 7005,400 - 12,700$0.135 - $0.318
DeepSeek-R26,000 - 14,000350 - 6006,350 - 14,600$0.014 - $0.032

Hard math observation: o4-mini gives you roughly 80% of o3's AIME score at one-third the effective cost per query. For AIME-level problems, Claude Opus 4 is the most expensive option per query despite competitive accuracy, because its output token rate ($25/MTok) is 3x o3's.

Task Type 2: Complex Code Fix (Multi-File Bug)

A bug that spans multiple files and requires understanding call graph context. Prompt size: ~8,000 tokens.

ModelTypical reasoning tokensVisible output tokensTotal output billedEffective output cost
o35,000 - 10,000800 - 1,5005,800 - 11,500$0.046 - $0.092
o4-mini3,000 - 7,000800 - 1,5003,800 - 8,500$0.017 - $0.037
Gemini 2.5 Pro thinking3,000 - 8,000700 - 1,2003,700 - 9,200$0.037 - $0.092
Gemini 2.5 Flash thinking2,500 - 6,000700 - 1,2003,200 - 7,200$0.011 - $0.025
Claude Sonnet 4 (thinking)2,000 - 6,000800 - 1,5002,800 - 7,500$0.042 - $0.113
DeepSeek-R22,000 - 5,000700 - 1,2002,700 - 6,200$0.006 - $0.014

Code fix observation: DeepSeek-R2 is the clear cost winner here. At $0.006-$0.014 per complex code query, it's 5-8x cheaper than Gemini 2.5 Flash thinking and 15-30x cheaper than Claude Sonnet 4 in thinking mode. Benchmark accuracy gap versus frontier models is ~8-12% on SWE-bench, which may or may not matter depending on your CI setup.

Task Type 3: Agentic Task (Multi-Step Planning)

A multi-step agentic workflow: user asks the agent to "research competitors, summarize findings, and draft a report." This involves multiple sub-tasks, tool calls, and iterative planning. Prompt grows across turns.

ModelTypical reasoning tokens per turnTurns in sessionTotal reasoning tokensSession output cost
o34,000 - 8,0008 - 1532,000 - 120,000$0.26 - $0.96
o4-mini2,000 - 5,0008 - 1516,000 - 75,000$0.07 - $0.33
Claude Sonnet 4 (thinking)1,500 - 4,0008 - 1512,000 - 60,000$0.18 - $0.90
Gemini 2.5 Flash thinking1,500 - 4,0008 - 1512,000 - 60,000$0.04 - $0.21
DeepSeek-R21,500 - 4,0008 - 1512,000 - 60,000$0.026 - $0.13

Agentic observation: Multi-session reasoning costs compound. An o3 agent running 15 turns on a complex task can cost close to $1.00 in reasoning tokens alone, before counting input tokens across a growing context. In high-volume agentic workloads, this is the line item that kills your budget. For agentic pipelines, Gemini 2.5 Flash thinking or DeepSeek-R2 are the pragmatic default unless you need o3-tier ceiling on individual steps.

See our agentic AI benchmarks leaderboard for accuracy comparisons on agent-specific tasks.

Effective Cost Per Query - The Real Number

To estimate what you'll actually pay per meaningful response, use this formula:

effective_cost = (input_tokens × input_price)
               + (reasoning_tokens × output_price)
               + (visible_output_tokens × output_price)

Or as a single number you can benchmark your workload against:

Effective output multiplier = (reasoning_tokens + visible_output_tokens) / visible_output_tokens

If a model generates 10,000 reasoning tokens to write 500 visible output tokens, the multiplier is 21x. You are paying for 21x the tokens you receive.

Here is that multiplier for each task type and model, using median token counts:

ModelMath (21K reasoning)Code fix (6K reasoning)Agent turn (3K reasoning)
o345x8x5x
o4-mini28x6x4x
Gemini 2.5 Pro thinking27x6x4x
Gemini 2.5 Flash thinking21x5x4x
Claude Opus 4 (thinking)19x5x4x
Claude Sonnet 4 (thinking)16x5x3x
DeepSeek-R223x6x4x

The multipliers are similar across models. The big differentiator is the per-token rate once you apply the multiplier. DeepSeek-R2 at $2.19/MTok output with a 23x multiplier still beats Claude Sonnet 4 at $15/MTok output with a 16x multiplier.

When Reasoning Mode Is Worth the Cost

Reasoning mode earns its cost premium when correctness on hard problems measurably matters and errors carry real cost.

High-value use cases:

  • Competition math and formal proofs: o3 and o4-mini score 93-96% on AIME 2024/2025; standard GPT-5 scores ~60%. If you're building a math tutoring product, the accuracy delta is the product. See the math olympiad AI leaderboard for current rankings.

  • Security-critical code review: Reasoning models catch logic bugs and edge cases that autocompletion-style models miss. For code review pipelines where a missed vulnerability costs $10K+ in incident response, spending $0.05-$0.15 per review is trivially justified.

  • Multi-step legal/financial analysis: Tasks requiring citation chaining, cross-reference checking, and logical consistency across long documents. Reasoning models outperform standard models by 15-25% on structured analytical tasks in professional domains.

  • Agentic planning (single-pass): When an agent needs to plan a complex workflow in one shot before executing, reasoning models reduce replanning failures. The cost is justified when replanning means re-running expensive downstream steps.

  • Hard coding tasks: SWE-bench performance. o3 and Sonnet 4 (thinking) both clear 60%+; standard non-reasoning frontier models are in the 40-50% range. For auto-fix pipelines where each incorrect fix triggers a new CI run, the cost delta buys fewer failed runs.

When to Skip Reasoning Mode

Not every task benefits from reasoning tokens. Turning them off - or using a non-reasoning model - is often the right engineering decision.

Skip reasoning mode for:

  • Chatbot and conversational responses: Reasoning models add latency with negligible quality improvement on open-ended conversational tasks. Standard GPT-5, Claude Sonnet, or Gemini 2.5 Pro without thinking mode handles 95%+ of chatbot traffic more cheaply and faster.

  • Simple RAG pipelines: If the task is "read these chunks and summarize them," there's no multi-step reasoning happening. The model just needs context length and instruction following - both available without thinking tokens.

  • Creative writing: Reasoning tokens show almost no benchmark improvement on creative tasks. The model's creativity comes from its base weights and sampling parameters, not from thinking through a math proof.

  • Classification and routing: Classifying intent, routing to tools, extracting structured data from text - all of these are best served by smaller, faster, cheaper models. Running o3 for intent classification is like using a torque wrench to drive a thumbtack.

  • Streaming latency-critical applications: Reasoning models have a mandatory latency penalty - see the next section - that makes them inappropriate for applications where first-token latency below 1 second is required.

For benchmark comparisons that show where reasoning models help and where they plateau, see the reasoning benchmarks leaderboard.

The Latency Cost

Dollar cost is only half the problem. Reasoning models have real latency implications that affect architecture decisions.

Time to first visible token (TTFVT) - the time from sending your request to receiving the first character of the actual answer - ranges from:

ModelTypical TTFVT (easy task)Typical TTFVT (hard task)Notes
o4-mini5 - 12 sec15 - 45 secFaster than o3 on most tasks
o38 - 20 sec30 - 90 secLong reasoning budgets
Gemini 2.5 Flash thinking4 - 10 sec12 - 35 secFlash is snappier
Gemini 2.5 Pro thinking6 - 15 sec20 - 60 secCan stream thoughts if enabled
Claude Sonnet 4 (thinking)5 - 12 sec15 - 50 secExtended thinking adds time
Claude Opus 4 (thinking)8 - 20 sec25 - 80 secLarge budget_tokens values slow this further
DeepSeek-R25 - 14 sec18 - 55 secAPI queue times vary

For a user waiting on a response, 30-90 seconds is the difference between an acceptable workflow and a product that feels broken. Build your UX accordingly: show a thinking indicator, stream reasoning summaries if the provider supports it, or route to a non-reasoning fallback for sub-tasks where latency matters more than correctness.

For streaming options: Google exposes thought summaries in Gemini 2.5 Pro's streaming API. Anthropic exposes extended thinking blocks in streaming mode. OpenAI does not stream reasoning tokens - you see the answer when thinking is complete.

Hidden Costs You Need to Know

Reasoning tokens are billed even when truncated

If you set a maximum token limit and the model hits it mid-reasoning, you pay for all reasoning tokens generated before the cutoff, even if no final answer was produced. This can result in paying for a full reasoning session and receiving an empty or partial response.

Always set your token limits with enough headroom for both the expected reasoning budget and the visible answer. A reasonable rule of thumb: set max_tokens to (expected_reasoning_tokens × 1.5) + expected_output_tokens.

Claude's budget_tokens is a soft ceiling

Anthropic's extended thinking API accepts a budget_tokens parameter that sets the maximum number of thinking tokens. This is a soft ceiling, not a hard cost cap. The model may occasionally exceed the budget by up to 20% on complex tasks. Also, budget_tokens is not the minimum - Claude may use far fewer thinking tokens on simple tasks when the parameter is set high.

In practice: set budget_tokens to a reasonable ceiling for your task type, monitor actual usage in the response metadata, and set your max_tokens accordingly.

Gemini Deep Think requires specific configuration

Gemini 2.5 Flash's "Deep Think" mode is not just a parameter flag - it requires setting thinking_config with a specific budget threshold in the API call. Without explicit configuration, Flash thinking uses a lighter reasoning pass that generates fewer thinking tokens and produces worse results on hard tasks. The pricing is the same ($3.50/MTok for thinking tokens), but the token count and quality differ significantly.

OpenAI reasoning effort parameter

o3 and o4-mini accept a reasoning_effort parameter with values low, medium, and high. This is not a cost parameter - it doesn't change per-token pricing - but it controls how many reasoning tokens the model generates. Setting reasoning_effort: low on o4-mini reduces typical reasoning tokens by 40-60% and reduces quality proportionally. For tasks where a fast approximate answer is acceptable, this is a legitimate cost reduction lever.

Caching is less effective for reasoning models

Standard prompt caching works on the input side regardless of whether reasoning is enabled. But reasoning tokens themselves are not cacheable across requests - they are generated fresh each time. This means that for reasoning-heavy workloads with a shared system prompt, you can cache the prompt portion but not the thinking portion. Your effective caching savings are lower than on standard models.

Free Tier Comparison

ProviderModelFree tier availabilityNotes
GoogleGemini 2.5 Flash thinkingYes - free tierRate-limited but usable for dev
GoogleGemini 2.5 Pro thinkingLimited - 50 req/dayPart of Google AI Studio free tier
OpenAIo4-miniChatGPT Plus quota~50 messages/3 hours; API requires payment
OpenAIo3ChatGPT Plus quotaLower quota than o4-mini
AnthropicClaude (extended thinking)Limited Claude.ai ProAvailable in Claude.ai Pro; API requires payment
DeepSeekDeepSeek-R25M tokens signup creditCovers ~2,000 hard reasoning queries
xAIGrok 4 (reasoning)Grok SuperGrok subscribersRate-limited; API requires payment

Google's free tier remains the best option for development and prototyping with reasoning models. Gemini 2.5 Flash thinking at zero cost with 15-50 RPD is enough to test your integration and develop your prompts. The rate limits prevent production use, but that's what the free tier is for.

The Claude.ai Pro plan gives you access to extended thinking mode in the chat interface, but the API always requires billable credentials. There's no free API tier for Claude extended thinking.

Choosing the Right Reasoning Model

If you're new to reasoning models or starting a new project, here's a decision framework:

Budget is primary concern: Start with Gemini 2.5 Flash thinking or DeepSeek-R2. Flash thinking at $3.50/MTok thinking tokens is the best dollar/benchmark ratio for most hard tasks.

Accuracy ceiling matters most: o3 for math and formal reasoning. For coding tasks, the gap between o3 and o4-mini has narrowed substantially - run your own evals before paying the o3 premium.

Long context required: Gemini 2.5 Pro thinking has a 2M token context window. No other reasoning model comes close. If your task involves reasoning over very long documents, Gemini is the only option.

Agent pipelines: o4-mini is the current sweet spot - strong enough for planning tasks, cheap enough for multi-turn sessions, and well-supported in agent frameworks. Claude Sonnet 4 in thinking mode is a strong alternative with Anthropic's tool use ecosystem.

Production cost sensitivity: DeepSeek-R2 at $0.55/$2.19 per MTok is the cost leader for teams that need commercial API support. If you can tolerate open-weights hosting, Qwen3-235B-A22B is cheaper still and competitive on STEM benchmarks.

For broader model capability rankings, see the overall LLM rankings.

FAQ

Do reasoning tokens count toward my context window?

No. Reasoning tokens are generated internally and do not appear in the context window for subsequent turns. They are ephemeral compute that affects billing but not context length. Your conversation history contains only the visible input and output tokens.

Can I see the reasoning tokens?

OpenAI does not expose reasoning tokens in o3 or o4-mini responses. You can see the count in usage metadata but not the content. Anthropic's extended thinking exposes thinking content blocks in the response (you can optionally surface them to users). Google's Gemini 2.5 Pro and Flash expose thought summaries in streaming mode.

What's the cheapest way to use o3?

Set reasoning_effort: low or reasoning_effort: medium and keep prompts focused. Vague or broad prompts trigger more reasoning. If you structure your prompt to reduce ambiguity, you can cut reasoning token counts by 30-50% without a significant accuracy drop on most tasks.

Does prompt caching work with reasoning models?

Prompt caching applies to the input side and works normally. You pay cached input rates for repeated prefixes. The reasoning portion is not cached - it is generated fresh for each request. Your actual savings are proportional to what fraction of your bill is input tokens vs. reasoning tokens; on hard math tasks where reasoning tokens dominate, caching helps less.

Is DeepSeek-R2 available on a US-based API?

DeepSeek's own API endpoint (api.deepseek.com) is based in China. For US data residency requirements, DeepSeek-R2 is available via OpenRouter, Together AI, and Fireworks AI at similar pricing with US infrastructure. Routing through these providers adds a small markup (typically 10-20%) but keeps data in US/EU infrastructure.

Does reasoning mode work in batch API requests?

OpenAI supports reasoning models via the Batch API - you get the 50% batch discount on both input and reasoning output tokens. Anthropic's Batch API supports extended thinking. Google's Batch API supports Gemini thinking mode. The quality of reasoning output in batch mode is identical to synchronous mode.

How do I benchmark reasoning token usage before committing to a model?

Run 50-100 representative prompts from your actual workload, capture the usage metadata (input_tokens, reasoning_tokens, output_tokens), and compute your effective cost multiplier. Most providers expose this in the response body. Don't estimate from blog posts - token counts vary significantly based on prompt structure, which you control.


Pricing sources:

Reasoning Model API Pricing Compared 2026
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.