Reasoning Model API Pricing Compared - May 2026

Updated May 2026: DeepSeek V4-Flash reasoning now $0.28/MTok output (8x cheaper than R1), o3-pro launched at $20/$80, Grok 4 retires May 15 - verified pricing across 11 models.

Cheapest: DeepSeek V4-Flash (thinking) Best Value: o4-mini for most tasks, Gemini 2.5 Flash for budget Updated weekly
Reasoning Model API Pricing Compared - May 2026

Reasoning model pricing is deceptive - the headline per-token rate understates real cost by 5x to 30x depending on model and task.

TL;DR

  • DeepSeek V4-Flash thinking mode is now $0.14/$0.28 per MTok - nearly 8x cheaper on output than R1 was six weeks ago, and the new deepseek-reasoner alias routes here
  • o3-pro has been available in the API since June 2025 at $20/$80 per MTok (10x o3's rate); it wasn't in the April edition - it's added now alongside updated coverage across all models
  • Grok 4 retires May 15, 2026 - migrate to Grok 4.3 at $1.25/$2.50, which is cheaper and supports 1M context
  • Google dropped Gemini 2.5 Flash thinking tokens from $3.50 to $2.50/MTok and removed the 2.5 Pro free tier on April 1
  • The reasoning token overhead problem hasn't changed: you pay for 5x to 45x more tokens than you receive, depending on model and task difficulty

If you've been following my general LLM API pricing comparison, you already know the headline rates. This page is specifically about the hidden cost structure of reasoning models - the pattern that catches every engineering team off guard when their first production invoice lands.

The core problem is simple. Reasoning models create an internal chain of thought before producing a final answer. Those "thinking tokens" are billed at the same rate as visible output tokens, but you never see them. You get the answer; you pay for the scratchpad. On a hard math problem, a model might burn 20,000 reasoning tokens before writing a 400-token response. The visible output suggests a cheap call. The actual bill reflects 20,400 output tokens.

Prices have moved clearly since April's update. The biggest change is at the low end: DeepSeek V4-Flash's reasoning mode is now nearly 8x cheaper than the R1 it replaced. At the high end, o3-pro (available since June 2025 but absent from the April edition) is now included as the maximum-compute option for problems where o3 falls short. And Grok 4's retirement on May 15 forces a migration to Grok 4.3, which is actually better priced.

The Full Pricing Table

All prices in USD per million tokens (MTok). Verified against official provider pricing pages as of May 11, 2026. Sorted by effective output cost.

ModelProviderInput /MTokCached InputThinking /MTokOutput /MTokContextNotes
DeepSeek V4-Flash (thinking)DeepSeek$0.14$0.003$0.28$0.281Mdeepseek-reasoner alias; thinking and output same rate
Gemini 2.5 Flash (thinking)Google$0.30$0.03$2.50$2.501MDown from $3.50; moved to paid-only April 1
Qwen3.6 Plus (thinking)Alibaba$0.33N/A$1.95$1.951MVia OpenRouter; 1M context; thinking toggleable
o4-miniOpenAI$1.10$0.28$4.40$4.40200KBest value for most production reasoning workloads
Grok 4.3 (reasoning)xAI$1.25N/A$2.50$2.501MReplaces Grok 4 (retiring May 15); higher rate above 200K
Gemini 2.5 Pro (thinking)Google$1.25$0.13$10.00$10.002M$2.50 input / $15 output above 200K context
o3OpenAI$2.00$0.50$8.00$8.00200KDown 80% from launch price
Gemini 3.1 Pro (thinking)Google$2.00$0.20$12.00$12.002M$4 input / $18 output above 200K; newest Google frontier
Claude Sonnet 4.6 (thinking)Anthropic$3.00$0.30$15.00$15.001Mthinking and text blocks billed at same output rate
Claude Opus 4.7 (thinking)Anthropic$5.00$0.50$25.00$25.001MLatest Anthropic flagship; new tokenizer (up to 35% more tokens vs Opus 4.6)
o3-proOpenAI$20.00$5.00$80.00$80.00200KMaximum compute; 10x o3's rate; for tasks where o3 fails

On Anthropic's tokenizer note: Claude Opus 4.7 ships with a new tokenizer that can consume up to 35% more tokens for identical input text. Benchmark the change before migrating from Opus 4.6, as your effective input costs may increase even though the per-token rate stayed the same.

On Grok's context tiers: Grok 4.3 bills requests above 200K total tokens at a higher rate. If your use case involves long documents, model the context-tier pricing before committing.

On DeepSeek's V4-Pro: A V4-Pro tier also exists at $1.74/$3.48 per MTok (standard price), but is currently discounted 75% to $0.44/$0.88 until May 31, 2026. After that date, V4-Flash is almost certainly the better value option unless V4-Pro shows meaningfully better reasoning quality on your workload.

What Changed Since April 2026

Several entries in this table are substantially different from six weeks ago.

  • April 1, 2026 - Google removed the free tier for Gemini 2.5 Pro. Previously available under rate limits, it now requires a paid API key. Free access to thinking models now means Gemini 2.5 Flash only.

  • April 16, 2026 - Anthropic released Claude Opus 4.7 at the same $5/$25 price as Opus 4.6, but with a new tokenizer. The performance uplift is real (87.6% on SWE-bench Verified), but so is the token count increase.

  • April 26, 2026 - DeepSeek cut its cache-hit input price to one-tenth of launch pricing. And more clearly, the deepseek-reasoner alias now routes to V4-Flash thinking mode rather than the old R1 model. The effective output price dropped from $2.19 to $0.28 per MTok.

  • June 10, 2025 - OpenAI cut o3's price 80% (from ~$10/$40 to $2/$8) and launched o3-pro at $20/$80. Both were not included in the April edition of this comparison; they're added in this update.

  • May 15, 2026 - Grok 4, Grok 4.1 Fast, and several related xAI models reach end of life. Grok 4.3 at $1.25/$2.50 is the direct replacement and offers a 1M context window.

Why the Advertised Price Misleads

The headline output rate is the price per visible output token. Reasoning models create a second class of tokens before that output - the internal chain of thought - and those tokens are invisible to you but billed identically.

Consider a request: "Find the bug in this code and fix it." The prompt is 6,000 tokens. A reasoning model internally traces call graphs, checks variable scoping, considers three candidate fixes, and evaluates edge cases before writing the repair. That might produce 8,000 reasoning tokens and 500 visible output tokens. You pay for 8,500 output-priced tokens, not 500.

The ratio varies by task. On simple questions, reasoning models sometimes skip extended thinking and the overhead is low. On competition math, they can hit 25,000 reasoning tokens for a 500-token proof. The same model, the same output rate, but a 50x difference in effective cost per response.

OpenAI o3 and o3-pro pricing page showing the reasoning model lineup OpenAI Developer Community announcement from June 2025 confirming o3's 80% price cut and the launch of o3-pro at $20/$80. Source: openai.com

Reasoning Token Overhead by Task Type

These estimates use median token counts from model documentation, community benchmarks, and API usage metadata. Actual counts vary by 2-3x depending on prompt structure, which you can influence.

Hard Math (Competition Level)

Problems at AIME / AMC 12 difficulty. These require multi-step chained reasoning with backtracking.

ModelTypical thinking tokensVisible outputTotal billed outputEffective cost/query
o3-pro30,000 - 60,000400 - 60030,400 - 60,600$2.43 - $4.85
o318,000 - 25,000400 - 60018,400 - 25,600$0.15 - $0.20
Claude Opus 4.78,000 - 15,000400 - 7008,400 - 15,700$0.21 - $0.39
Gemini 3.1 Pro8,000 - 14,000300 - 5008,300 - 14,500$0.10 - $0.17
Gemini 2.5 Pro8,000 - 14,000300 - 5008,300 - 14,500$0.083 - $0.145
o4-mini10,000 - 16,000400 - 60010,400 - 16,600$0.046 - $0.073
Gemini 2.5 Flash6,000 - 12,000300 - 5006,300 - 12,500$0.016 - $0.031
DeepSeek V4-Flash6,000 - 14,000350 - 6006,350 - 14,600$0.0018 - $0.0041

The DeepSeek number is striking. At $0.002-$0.004 per competition math query, it's more than 50x cheaper than o4-mini and roughly 100x cheaper than o3. The quality delta is real - o3 scores 96%+ on AIME 2025 while DeepSeek V4-Flash benchmarks around 82% - but for non-critical education or research pipelines, that accuracy difference may not be worth $0.07 per query.

Complex Code Fix (Multi-File Bug)

A bug spanning multiple files with 8,000-token context prompt.

ModelTypical thinking tokensVisible outputTotal billedEffective cost/query
o35,000 - 10,000800 - 1,5005,800 - 11,500$0.046 - $0.092
o4-mini3,000 - 7,000800 - 1,5003,800 - 8,500$0.017 - $0.037
Claude Sonnet 4.62,000 - 6,000800 - 1,5002,800 - 7,500$0.042 - $0.113
Gemini 2.5 Flash2,500 - 6,000700 - 1,2003,200 - 7,200$0.008 - $0.018
DeepSeek V4-Flash2,000 - 5,000700 - 1,2002,700 - 6,200$0.00076 - $0.0017

For coding workloads at scale, DeepSeek V4-Flash is roughly 20-25x cheaper than Claude Sonnet 4.6 in thinking mode. The SWE-bench gap is around 8-12 percentage points. Measure that gap on your actual CI failure rate before deciding whether it's worth the cost premium. See our coding benchmarks leaderboard for current scores.

Agentic Task (Multi-Turn Planning)

Agent session: "research competitors, summarize findings, draft a report." Eight to fifteen turns with growing context.

ModelThinking tokens/turnTurnsTotal thinkingSession output cost
o34,000 - 8,0008 - 1532,000 - 120,000$0.26 - $0.96
o4-mini2,000 - 5,0008 - 1516,000 - 75,000$0.07 - $0.33
Claude Sonnet 4.61,500 - 4,0008 - 1512,000 - 60,000$0.18 - $0.90
Grok 4.31,500 - 3,5008 - 1512,000 - 52,500$0.030 - $0.13
Gemini 2.5 Flash1,500 - 4,0008 - 1512,000 - 60,000$0.030 - $0.15
DeepSeek V4-Flash1,500 - 4,0008 - 1512,000 - 60,000$0.0034 - $0.017

Grok 4.3 is a new entry worth considering for agentic pipelines. At $1.25/$2.50 per MTok with a 1M context window, it's in the same price tier as Gemini 2.5 Flash for multi-turn sessions but offers much longer context without tier pricing penalties. Check the agentic AI benchmarks leaderboard for current task scores.

Effective Cost Per Query

The effective output multiplier captures how much you're actually paying relative to what you receive:

multiplier = (thinking_tokens + visible_output_tokens) / visible_output_tokens

If the model produces 10,000 thinking tokens for 500 visible output tokens, the multiplier is 21x. You pay for 21x the tokens you receive.

ModelHard mathCode fixAgent turn
o3-pro70x10x6x
o345x8x5x
Gemini 3.1 Pro30x7x5x
Gemini 2.5 Pro28x6x4x
o4-mini28x6x4x
Gemini 2.5 Flash22x5x4x
Claude Opus 4.722x5x4x
DeepSeek V4-Flash23x6x4x

The multipliers are similar across models. DeepSeek V4-Flash at a 23x multiplier still beats Claude Sonnet 4.6 at 16x because $0.28/MTok × 23 = $6.44 effective per billion visible tokens, versus $15/MTok × 16 = $240. The token rate dominates.

When Reasoning Mode Earns Its Cost

Reasoning tokens add cost and latency to every call. They only pay off when correctness on hard problems carries real consequences.

High-value use cases:

  • Competition math and formal proofs: o3 and o4-mini score 93-96% on AIME 2025 benchmarks. Standard GPT-5-class models score around 60%. See the math olympiad AI leaderboard for current rankings. If accuracy difference translates directly to user value, the cost is trivial.

  • Security-critical code review: Reasoning models find logic bugs and edge cases that pattern-completion models miss. For a code review pipeline where a missed vulnerability costs $10K+ in incident response, $0.05-$0.15 per review is noise.

  • Hard coding tasks: o3 and Claude Sonnet 4.6 thinking both clear 60%+ on SWE-bench Verified; non-reasoning frontier models land in the 40-50% range. If each incorrect fix triggers a CI run, fewer failures offset the per-query cost.

  • Agentic planning (single-pass): When an agent plans a complex workflow before executing, reasoning models cut replanning failures. The math works when replanning means re-running expensive downstream steps.

When to Skip It

Not every task benefits from extended thinking. Routing to a non-reasoning model is often the right call.

Skip reasoning mode for chatbot conversations, simple RAG lookups, classification and routing, creative writing, and any latency-critical path where sub-second first-token response is required. Reasoning models add 5-90 seconds of mandatory TTFVT depending on task difficulty - that's an UX problem in interactive applications, not just a cost problem.

Also skip reasoning for structured data extraction. Pulling JSON from a known schema doesn't benefit from chain-of-thought. Use a smaller, faster model.

Latency Comparison

Time to first visible token (TTFVT) - from request submission to the first character of the actual answer:

ModelEasy taskHard taskNotes
o4-mini5 - 12 sec15 - 45 secFastest o-series
Grok 4.34 - 10 sec12 - 35 secreasoning=low is default; fast for moderate difficulty
Gemini 2.5 Flash4 - 10 sec12 - 35 secLightweight thinking pass
Gemini 2.5 Pro6 - 15 sec20 - 55 secCan stream thought summaries
o38 - 20 sec30 - 90 secreasoning_effort=low cuts this clearly
Claude Sonnet 4.65 - 12 sec15 - 50 secExtended thinking adds mandatory latency
Gemini 3.1 Pro7 - 18 sec25 - 70 secthinking_level=low reduces latency
Claude Opus 4.79 - 22 sec30 - 85 secLarge budget_tokens values add time
o3-pro20 - 60 sec90 - 300 secExtended compute; not suitable for interactive use
DeepSeek V4-Flash5 - 14 sec18 - 55 secAPI queue times vary by region

O3-pro's latency range is the standout. On hard tasks, it can take 5 minutes. Build your UX with this in mind - o3-pro is a batch workload, not a conversational one.

DeepSeek V4 pricing page showing V4-Flash and V4-Pro tiers DeepSeek's current pricing page showing V4-Flash thinking mode at $0.28/MTok output - nearly 8x cheaper than R1 was in March 2026. Source: api-docs.deepseek.com

Hidden Costs You Need to Know

Truncated reasoning still bills

If your token limit is hit during reasoning, you pay for all reasoning tokens generated before the cutoff, even if no final answer was produced. Set max_tokens to at least (expected_reasoning × 1.5) + expected_output. Budget conservatively.

Anthropic's new tokenizer changes your math

Claude Opus 4.7 ships with a new tokenizer that can use up to 35% more tokens for identical input. If you're migrating from Opus 4.6, benchmark your actual prompts before assuming cost parity. The per-token price is the same; the token count may not be.

Claude's budget_tokens is a soft ceiling

Anthropic's budget_tokens parameter sets a ceiling on thinking tokens but is not a hard cap. The model may exceed it by up to 20% on complex tasks. Monitor actual usage in the response metadata rather than relying on the parameter as a cost control.

Grok's 200K context tier

Grok 4.3 prices requests with more than 200K total tokens at a higher rate. If your agentic pipeline builds up a long conversation context, the cost can jump unexpectedly mid-session. Track total_tokens in each response and plan context compression before you hit the tier boundary.

DeepSeek V4-Pro discount expiration

The 75% discount on DeepSeek V4-Pro expires May 31, 2026. After that, V4-Pro reverts to $1.74/$3.48 per MTok. Any pipeline relying on V4-Pro at discounted rates needs to be re-assessed or migrated to V4-Flash before then.

Reasoning token caching

Standard prompt caching applies to input tokens regardless of whether reasoning is enabled. But reasoning tokens are generated fresh for each request - they're not cacheable across calls. For reasoning-heavy workloads with shared system prompts, your caching savings are proportional to what fraction of your bill is input tokens. On hard math tasks where reasoning tokens dominate, caching helps less.

Free Tier Comparison

ProviderModelFree accessRate limitsNotes
GoogleGemini 2.5 Flash thinkingYes15 RPM, 1M TPDStill free; rate limited but usable for dev
GoogleGemini 2.5 Pro thinkingNoPaid API requiredFree tier removed April 1, 2026
GoogleGemini 3.1 ProNoPaid API requiredPreview pricing; paid only
OpenAIo4-miniChatGPT Plus quota~50 messages/3hChat UI only; API always paid
OpenAIo3ChatGPT Plus quotaLower quota than o4-miniChat UI only; API always paid
OpenAIo3-proChatGPT Pro subscribersLimitedHighest ChatGPT tier required
AnthropicExtended thinkingClaude.ai ProChat UI onlyAPI always paid
DeepSeekV4-Flash5M token signup creditCovers ~17M reasoning tokensEnough for integration testing
xAIGrok 4.3SuperGrok subscribersRate limitedAPI requires payment

Gemini 2.5 Flash remains the best free option for reasoning model development and integration testing. The elimination of the 2.5 Pro free tier was significant - teams that were prototyping with Pro now need to evaluate whether to pay for Pro or accept Flash's lower ceiling.

Choosing the Right Model

The right reasoning model is the cheapest one that passes your accuracy bar. Run evals on your actual data before committing to a tier.

Budget first: Start with DeepSeek V4-Flash at $0.14/$0.28. On tasks where 82-85% accuracy is acceptable, it's the overwhelming cost leader. For free-tier development, Gemini 2.5 Flash.

Balance cost and accuracy: o4-mini ($1.10/$4.40) is the production sweet spot for most teams. It delivers within a few percentage points of o3 on most benchmarks at roughly one-third the effective query cost. Claude Sonnet 4.6 in thinking mode is a strong alternative with Anthropic's tool use ecosystem.

Need the highest ceiling: o3 ($2/$8) for math and formal reasoning. The gap over o4-mini is meaningful on AIME-level problems (96%+ vs 88%) but narrower on code and agent tasks. Run your own evals before paying the premium.

Extremely hard problems: o3-pro ($20/$80) for the subset of tasks where o3 fails. Expected use cases: cutting-edge formal verification, research-grade math proofs, complex security auditing. For most engineering teams this isn't a regular production call.

Long context: Gemini 3.1 Pro (2M context) or Grok 4.3 (1M context) if you need to reason over very long documents. No other reasoning model comes close to Gemini 3.1 Pro's context ceiling at a reasonable per-token rate.

Agent pipelines: o4-mini is the current default - cost-effective for multi-turn, well-supported in agent frameworks. Grok 4.3 is worth evaluating for long-context agent sessions now that Grok 4 is retiring.

For model capability comparisons beyond pricing, see the reasoning benchmarks leaderboard and cost efficiency leaderboard.

FAQ

Which reasoning model is cheapest right now?

DeepSeek V4-Flash thinking mode at $0.14/$0.28 per MTok. For thinking-heavy workloads, its effective per-query cost can run 50-100x lower than premium models like o3 or Claude Opus 4.7.

Is o3 still worth using now that o3-pro exists?

Yes. O3 at $2/$8 handles most hard reasoning tasks. O3-pro is for the specific subset where o3 itself fails - formal proofs, research math, deep security audits. Most teams will never need o3-pro.

What happened to the Grok 4 models?

Grok 4, Grok 4.1 Fast, and related xAI models retire on May 15, 2026. Migrate to Grok 4.3 at $1.25/$2.50, which is cheaper and adds a 1M context window. xAI's documentation has migration guidance for affected endpoints.

Do reasoning tokens expire from the context window?

No. Reasoning tokens are ephemeral - they affect billing but don't appear in the conversation history for subsequent turns. Your context window contains only visible input and output tokens.

Can I see the reasoning tokens?

OpenAI doesn't expose reasoning content for o3, o4-mini, or o3-pro - only the count in usage metadata. Anthropic's extended thinking exposes thinking content blocks (optional). Google's Gemini 2.5 Pro and 3.1 Pro expose thought summaries in streaming mode. Grok 4.3 exposes the full reasoning in reasoning_details.

Does batch API work with reasoning models?

Yes for all major providers. OpenAI's Batch API gives a 50% discount on both input and reasoning output tokens for o3, o4-mini, and o3-pro. Anthropic's Batch API supports extended thinking at 50% off. Google's Batch API supports Gemini thinking mode. If your workload is async-compatible, batch processing is the easiest cost reduction lever.


Sources:

✓ Last verified May 11, 2026

LinkedIn
Reddit
Hacker News
Telegram
Reasoning Model API Pricing Compared - May 2026
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.