OpenAI o4-mini
OpenAI o4-mini is a fast, cost-efficient reasoning model in the o-series, delivering near-o3 performance on math and coding benchmarks at roughly 10x lower cost.

OpenAI o4-mini launched on April 16, 2025, alongside o3. It's the fourth generation of OpenAI's o-series reasoning models - a family that creates internal "reasoning tokens" before producing a final response, trading token budget for accuracy on hard problems. What makes o4-mini different from its predecessors is the combination of multimodal reasoning, native tool use, and a price point that makes it viable to run at production scale.
TL;DR
- Best-in-class math reasoning (93.4% AIME 2024, 92.7% AIME 2025) at $1.10/M input tokens
- 200K context window, native image reasoning, full tool access via the API
- Matches o3 on coding (68.1% vs 69.1% SWE-bench verified) at roughly 10x lower cost
Released to ChatGPT Plus, Pro, and Team users on April 16, 2025, o4-mini replaced o3-mini across OpenAI's tier system. Free users gained access on April 24 through a "Think" toggle in the ChatGPT composer. The model sits at $1.10 per million input tokens and $4.40 per million output tokens - the same price as the predecessor o3-mini, but with substantially stronger benchmark numbers across math, coding, and vision tasks.
The headline capability is reasoning with images. Unlike earlier o-series models that could describe visual inputs, o4-mini can add images directly into its chain of thought - rotating, zooming, and manipulating them as part of its reasoning process. OpenAI calls this "thinking with images." Combined with full agentic tool access (web search, Python execution, file analysis, image generation), this positions o4-mini as the default choice for production deployments where o3's cost is prohibitive.
The o4-mini model is available to ChatGPT users across all tiers, with free users accessing it via the "Think" toggle.
Source: techcrunch.com
Key Specifications
| Specification | Details |
|---|---|
| Provider | OpenAI |
| Model Family | o-series (reasoning) |
| Parameters | Not disclosed |
| Context Window | 200,000 tokens |
| Max Output | 100,000 tokens |
| Input Price | $1.10/M tokens |
| Cached Input Price | $0.275/M tokens |
| Output Price | $4.40/M tokens |
| Release Date | April 16, 2025 |
| Knowledge Cutoff | June 1, 2024 |
| License | Proprietary |
| Modalities | Text + image input, text output |
Batch API is supported at a 50% discount, bringing input to $0.55/M and output to $2.20/M tokens. Prompt caching (75% discount on repeated prefixes) drops cached input to $0.275/M. For high-volume workloads where requests aren't time-sensitive, the batch + cache combination can reduce effective costs substantially.
Benchmark Performance
The numbers below come from OpenAI's official evals and are the "standard" (not high-effort) configuration unless noted.
| Benchmark | o4-mini | o3 | o3-mini |
|---|---|---|---|
| AIME 2024 | 93.4% | 91.6% | 63.6% |
| AIME 2025 | 92.7% | 88.9% | - |
| GPQA Diamond | 81.4% | 83.3% | 77.0% |
| SWE-bench Verified | 68.1% | 69.1% | 49.3% |
| MMMU (vision) | 81.6% | 82.9% | - |
| MathVista | 84.3% | 86.8% | - |
| HumanEval | 97.5% | - | - |
| Codeforces ELO | 2719 | 2706 | - |
On math, o4-mini is the stronger model - it outperforms o3 on both AIME 2024 and AIME 2025 by two to four percentage points. On competitive programming (Codeforces ELO), it edges out o3 as well. The only areas where o3 holds a clear lead are GPQA Diamond (expert-level science questions), SWE-bench (real-world coding tasks), and vision benchmarks. The gaps are small - 1-3 percentage points across the board - which matters because o3 costs roughly 10x more.
SWE-bench deserves a closer look. At 68.1%, o4-mini trails o3 by just one percentage point on what's probably the most representative coding benchmark available. For context, o3-mini scored 49.3% and o1 scored 48.9%. The jump from the o3-mini generation to o4-mini is far more significant than the gap between o4-mini and o3. See our coding benchmarks leaderboard for current rankings across all major models.
For reasoning benchmark context, GPQA Diamond at 81.4% places o4-mini well above most models outside the o3/Claude Opus class. Our reasoning benchmarks leaderboard tracks the full field.
Key Capabilities
Agentic tool use
O4-mini is the first o-series model to support native agentic tool use within ChatGPT and the API. It can browse the web, run Python, analyze uploaded files, call custom functions, and create images via DALL-E - all within a single reasoning chain. The model decides when and how to invoke tools rather than needing explicit prompting. This is significant for production agent pipelines where a reasoning model previously had to be wrapped with an orchestration layer to access tools.
Visual reasoning
The model doesn't just accept image inputs - it reasons with them. Images enter the chain of thought directly. OpenAI demonstrated this with whiteboard analysis, diagram interpretation, and tasks where the model crops and rotates images during reasoning. MMMU at 81.6% (vs 82.9% for o3) and MathVista at 84.3% confirm this isn't marketing. For teams building document intelligence, scientific data extraction, or visual debugging workflows, this is a meaningful capability shift.
reasoning_effort parameter
The API exposes a reasoning_effort parameter with values low, medium, and high. At low, the model spends fewer tokens on internal reasoning, reducing latency and cost. At high, it reasons more thoroughly - this is what OpenAI calls "o4-mini-high" in the ChatGPT interface. For tasks where speed matters more than maximum accuracy (classification, extraction, code completion), low can deliver response quality that beats non-reasoning models at comparable or lower cost.
o4-mini leads all models on AIME 2024 and 2025 math competition benchmarks, outperforming even o3.
Source: pexels.com
Pricing and Availability
O4-mini is available through the OpenAI API via the Chat Completions and Responses endpoints. The model ID is o4-mini. It also supports the Batch API, fine-tuning, streaming, function calling, and structured outputs. One remarkable addition from the API docs: fine-tuning is supported, which wasn't available on earlier reasoning models.
ChatGPT access by tier:
- Free: limited access via "Think" toggle
- Plus/Team: 150 messages/day (o4-mini), 50/day (o4-mini-high)
- Enterprise/Edu: 300 messages/day (o4-mini), 100/day (o4-mini-high)
- API: rate limits scale from 1,000 to 30,000 requests/minute depending on usage tier
Cost comparison against the competitive field:
O4-mini at $1.10/$4.40 competes directly with models like Gemini 2.5 Flash in the cost-optimized reasoning tier. Claude Sonnet sits at $15/$30, making o4-mini roughly 7x cheaper on output despite comparable coding performance on some benchmarks. The cost efficiency leaderboard has a full current comparison.
The batch API discount makes o4-mini particularly attractive for offline workloads - document processing, large-scale evaluation runs, data extraction pipelines - where 24-hour turnaround is acceptable.
Strengths and Weaknesses
Strengths
- Best-in-class math reasoning: leads all published models on AIME 2024 and 2025
- Near-o3 coding performance at 10x lower cost - the gap on SWE-bench is 1 percentage point
- Native image reasoning integrated into chain of thought, not bolted on
- Full agentic tool use in both ChatGPT and the API
reasoning_effortparameter lets callers trade latency for accuracy per request- 200K context handles long documents and large codebases
- Batch API support with 50% discount for async workloads
- Fine-tuning supported - unusual for reasoning models
Weaknesses
- o3 still leads on GPQA Diamond (expert science) and SWE-bench verified when maximum accuracy is required
- Knowledge cutoff is June 2024, which is aging
- No audio input or output (text and image input only)
- Reasoning tokens are billed at the output token rate ($4.40/M), so heavy use of
higheffort can be expensive - Proprietary model with no published architecture or parameter count
- Time to first token is high (median ~24 seconds at high effort) - unsuitable for real-time UX without streaming
Related Coverage
- Coding Benchmarks Leaderboard - SWE-bench and LiveCodeBench rankings across all major models
- Reasoning Benchmarks Leaderboard - GPQA, AIME, and Humanity's Last Exam rankings
- Cost Efficiency Leaderboard - Performance per dollar comparisons
- GPT-4o mini - OpenAI's earlier small model, now superseded for reasoning tasks
FAQ
What is o4-mini best used for?
Math, coding, and visual reasoning tasks where o3-level accuracy is needed but cost or throughput is a constraint. At $1.10/$4.40 per million tokens with batch discounts available, it's the most cost-effective reasoning model in OpenAI's current lineup.
How does o4-mini differ from o4-mini-high?
Both are the same model. "High" refers to the reasoning_effort=high setting, which increases internal reasoning token usage for harder problems. ChatGPT surfaces this as a separate option; API users set it via the reasoning_effort parameter.
Does o4-mini support function calling?
Yes. Function calling, structured outputs, and streaming are all supported. The model also supports fine-tuning, which wasn't available on o3-mini.
What is the context window?
200,000 tokens input with a maximum of 100,000 tokens output. Reasoning tokens count toward output token billing.
Is o4-mini available for free?
Free ChatGPT users can access o4-mini with limited daily usage by selecting the "Think" toggle in the message composer.
Sources
- Introducing OpenAI o3 and o4-mini - OpenAI official announcement
- o4-mini Model Documentation - OpenAI API docs
- OpenAI o4-mini - Wikipedia
- OpenAI launches a pair of AI reasoning models, o3 and o4-mini - TechCrunch
- o4-mini Benchmarks and Analysis - Artificial Analysis
- o4-mini vs o3 Comparison - APIDog
- O4-Mini: Tests, Features, O3 Comparison, Benchmarks - DataCamp
- OpenAI o3 and o4-mini Release Announcement - OpenAI Developer Community
- SWE-bench Leaderboard - SWE-bench official site
✓ Last verified May 11, 2026
