Name: OpenAI o4-mini
Author: OpenAI

OpenAI o4-mini launched on April 16, 2025, alongside o3. It's the fourth generation of OpenAI's o-series reasoning models - a family that creates internal "reasoning tokens" before producing a final response, trading token budget for accuracy on hard problems. What makes o4-mini different from its predecessors is the combination of multimodal reasoning, native tool use, and a price point that makes it viable to run at production scale.

TL;DR

Best-in-class math reasoning (93.4% AIME 2024, 92.7% AIME 2025) at $1.10/M input tokens
200K context window, native image reasoning, full tool access via the API
Matches o3 on coding (68.1% vs 69.1% SWE-bench verified) at roughly 10x lower cost

Released to ChatGPT Plus, Pro, and Team users on April 16, 2025, o4-mini replaced o3-mini across OpenAI's tier system. Free users gained access on April 24 through a "Think" toggle in the ChatGPT composer. The model sits at $1.10 per million input tokens and $4.40 per million output tokens - the same price as the predecessor o3-mini, but with substantially stronger benchmark numbers across math, coding, and vision tasks.

The headline capability is reasoning with images. Unlike earlier o-series models that could describe visual inputs, o4-mini can add images directly into its chain of thought - rotating, zooming, and manipulating them as part of its reasoning process. OpenAI calls this "thinking with images." Combined with full agentic tool access (web search, Python execution, file analysis, image generation), this positions o4-mini as the default choice for production deployments where o3's cost is prohibitive.

OpenAI and ChatGPT interface The o4-mini model is available to ChatGPT users across all tiers, with free users accessing it via the "Think" toggle. Source: techcrunch.com

Key Specifications

Specification	Details
Provider	OpenAI
Model Family	o-series (reasoning)
Parameters	Not disclosed
Context Window	200,000 tokens
Max Output	100,000 tokens
Input Price	$1.10/M tokens
Cached Input Price	$0.275/M tokens
Output Price	$4.40/M tokens
Release Date	April 16, 2025
Knowledge Cutoff	June 1, 2024
License	Proprietary
Modalities	Text + image input, text output

Batch API is supported at a 50% discount, bringing input to $0.55/M and output to $2.20/M tokens. Prompt caching (75% discount on repeated prefixes) drops cached input to $0.275/M. For high-volume workloads where requests aren't time-sensitive, the batch + cache combination can reduce effective costs substantially.

Benchmark Performance

The numbers below come from OpenAI's official evals and are the "standard" (not high-effort) configuration unless noted.

Benchmark	o4-mini	o3	o3-mini
AIME 2024	93.4%	91.6%	63.6%
AIME 2025	92.7%	88.9%	-
GPQA Diamond	81.4%	83.3%	77.0%
SWE-bench Verified	68.1%	69.1%	49.3%
MMMU (vision)	81.6%	82.9%	-
MathVista	84.3%	86.8%	-
HumanEval	97.5%	-	-
Codeforces ELO	2719	2706	-

On math, o4-mini is the stronger model - it outperforms o3 on both AIME 2024 and AIME 2025 by two to four percentage points. On competitive programming (Codeforces ELO), it edges out o3 as well. The only areas where o3 holds a clear lead are GPQA Diamond (expert-level science questions), SWE-bench (real-world coding tasks), and vision benchmarks. The gaps are small - 1-3 percentage points across the board - which matters because o3 costs roughly 10x more.

SWE-bench deserves a closer look. At 68.1%, o4-mini trails o3 by just one percentage point on what's probably the most representative coding benchmark available. For context, o3-mini scored 49.3% and o1 scored 48.9%. The jump from the o3-mini generation to o4-mini is far more significant than the gap between o4-mini and o3. See our coding benchmarks leaderboard for current rankings across all major models.

For reasoning benchmark context, GPQA Diamond at 81.4% places o4-mini well above most models outside the o3/Claude Opus class. Our reasoning benchmarks leaderboard tracks the full field.

Key Capabilities

Agentic tool use

O4-mini is the first o-series model to support native agentic tool use within ChatGPT and the API. It can browse the web, run Python, analyze uploaded files, call custom functions, and create images via DALL-E - all within a single reasoning chain. The model decides when and how to invoke tools rather than needing explicit prompting. This is significant for production agent pipelines where a reasoning model previously had to be wrapped with an orchestration layer to access tools.

Visual reasoning

The model doesn't just accept image inputs - it reasons with them. Images enter the chain of thought directly. OpenAI demonstrated this with whiteboard analysis, diagram interpretation, and tasks where the model crops and rotates images during reasoning. MMMU at 81.6% (vs 82.9% for o3) and MathVista at 84.3% confirm this isn't marketing. For teams building document intelligence, scientific data extraction, or visual debugging workflows, this is a meaningful capability shift.

reasoning_effort parameter

The API exposes a reasoning_effort parameter with values low, medium, and high. At low, the model spends fewer tokens on internal reasoning, reducing latency and cost. At high, it reasons more thoroughly - this is what OpenAI calls "o4-mini-high" in the ChatGPT interface. For tasks where speed matters more than maximum accuracy (classification, extraction, code completion), low can deliver response quality that beats non-reasoning models at comparable or lower cost.

$Math and reasoning tasks on paper$ o4-mini leads all models on AIME 2024 and 2025 math competition benchmarks, outperforming even o3. Source: pexels.com

Pricing and Availability

O4-mini is available through the OpenAI API via the Chat Completions and Responses endpoints. The model ID is o4-mini. It also supports the Batch API, fine-tuning, streaming, function calling, and structured outputs. One remarkable addition from the API docs: fine-tuning is supported, which wasn't available on earlier reasoning models.

ChatGPT access by tier:

Free: limited access via "Think" toggle
Plus/Team: 150 messages/day (o4-mini), 50/day (o4-mini-high)
Enterprise/Edu: 300 messages/day (o4-mini), 100/day (o4-mini-high)
API: rate limits scale from 1,000 to 30,000 requests/minute depending on usage tier

Cost comparison against the competitive field:

O4-mini at $1.10/$4.40 competes directly with models like Gemini 2.5 Flash in the cost-optimized reasoning tier. Claude Sonnet sits at $15/$30, making o4-mini roughly 7x cheaper on output despite comparable coding performance on some benchmarks. The cost efficiency leaderboard has a full current comparison.

The batch API discount makes o4-mini particularly attractive for offline workloads - document processing, large-scale evaluation runs, data extraction pipelines - where 24-hour turnaround is acceptable.

Strengths and Weaknesses

Strengths

Best-in-class math reasoning: leads all published models on AIME 2024 and 2025
Near-o3 coding performance at 10x lower cost - the gap on SWE-bench is 1 percentage point
Native image reasoning integrated into chain of thought, not bolted on
Full agentic tool use in both ChatGPT and the API
reasoning_effort parameter lets callers trade latency for accuracy per request
200K context handles long documents and large codebases
Batch API support with 50% discount for async workloads
Fine-tuning supported - unusual for reasoning models

Weaknesses

o3 still leads on GPQA Diamond (expert science) and SWE-bench verified when maximum accuracy is required
Knowledge cutoff is June 2024, which is aging
No audio input or output (text and image input only)
Reasoning tokens are billed at the output token rate ($4.40/M), so heavy use of high effort can be expensive
Proprietary model with no published architecture or parameter count
Time to first token is high (median ~24 seconds at high effort) - unsuitable for real-time UX without streaming

Coding Benchmarks Leaderboard - SWE-bench and LiveCodeBench rankings across all major models
Reasoning Benchmarks Leaderboard - GPQA, AIME, and Humanity's Last Exam rankings
Cost Efficiency Leaderboard - Performance per dollar comparisons
GPT-4o mini - OpenAI's earlier small model, now superseded for reasoning tasks

FAQ

What is o4-mini best used for?

Math, coding, and visual reasoning tasks where o3-level accuracy is needed but cost or throughput is a constraint. At $1.10/$4.40 per million tokens with batch discounts available, it's the most cost-effective reasoning model in OpenAI's current lineup.

How does o4-mini differ from o4-mini-high?

Both are the same model. "High" refers to the reasoning_effort=high setting, which increases internal reasoning token usage for harder problems. ChatGPT surfaces this as a separate option; API users set it via the reasoning_effort parameter.

Does o4-mini support function calling?

Yes. Function calling, structured outputs, and streaming are all supported. The model also supports fine-tuning, which wasn't available on o3-mini.

What is the context window?

200,000 tokens input with a maximum of 100,000 tokens output. Reasoning tokens count toward output token billing.

Is o4-mini available for free?

Free ChatGPT users can access o4-mini with limited daily usage by selecting the "Think" toggle in the message composer.

Sources

Introducing OpenAI o3 and o4-mini - OpenAI official announcement
o4-mini Model Documentation - OpenAI API docs
OpenAI o4-mini - Wikipedia
OpenAI launches a pair of AI reasoning models, o3 and o4-mini - TechCrunch
o4-mini Benchmarks and Analysis - Artificial Analysis
o4-mini vs o3 Comparison - APIDog
O4-Mini: Tests, Features, O3 Comparison, Benchmarks - DataCamp
OpenAI o3 and o4-mini Release Announcement - OpenAI Developer Community
SWE-bench Leaderboard - SWE-bench official site