GPT-4o mini
OpenAI's budget API workhorse pairs 128K context with $0.15/$0.60 per million token pricing, solid coding benchmarks, and the broadest third-party ecosystem of any small model.

Overview
OpenAI released GPT-4o mini on July 18, 2024, as the affordable tier of the GPT-4o family. At $0.15 per million input tokens and $0.60 per million output tokens, it was over 60% cheaper than GPT-3.5 Turbo while being meaningfully smarter across every benchmark. The model quickly became the default choice for production APIs, chatbot backends, and lightweight automation tasks - not because it was the best at anything, but because it was good enough at everything while being cheap and fast.
Eighteen months later, GPT-4o mini is still running in millions of production deployments. It scores 82.0% on MMLU, 87.2% on HumanEval, 70.2% on MATH, and 87.0% on MGSM. Those are not frontier numbers by 2026 standards - newer models like Gemini 2.5 Flash-Lite beat it on science reasoning by a wide margin, and the Qwen 3.5 series has moved the open-source baseline well past what GPT-4o mini can deliver. But the model's ecosystem advantage is real. Every major framework, every tool vendor, and every tutorial defaults to OpenAI's API. Switching costs are not just about token prices.
The honest assessment: GPT-4o mini is no longer the smartest option at its price point. But it remains the safest and most well-integrated option for teams that need a reliable, well-documented API with predictable behavior. Whether that tradeoff still makes sense depends on whether your bottleneck is model intelligence or engineering time.
TL;DR
- $0.15/$0.60 per million tokens with 128K context - the incumbent budget API from OpenAI
- 82.0% MMLU, 87.2% HumanEval, 70.2% MATH - solid across the board, dominant at nothing
- Multimodal input (text + images) with structured output support and strong function calling
- Outpaced by newer competitors on reasoning (GPQA: 40.2%) but unmatched in ecosystem breadth
Key Specifications
| Specification | Details |
|---|---|
| Provider | OpenAI |
| Model Family | GPT-4o |
| Architecture | Not disclosed (dense transformer) |
| Parameters | Not disclosed |
| Context Window | 128,000 tokens input |
| Max Output | 16,384 tokens |
| Input Modalities | Text, Images |
| Output Modality | Text (with structured output support) |
| Function Calling | Supported (parallel function calls) |
| Knowledge Cutoff | October 2023 |
| Input Price | $0.15/M tokens |
| Output Price | $0.60/M tokens |
| Release Date | July 18, 2024 |
| License | Proprietary (API access) |
| Availability | OpenAI API, ChatGPT, Azure OpenAI Service |
Benchmark Performance
GPT-4o mini was released as a budget model, and the benchmark profile reflects that positioning. It is broadly competent but no longer leads in any individual category. The comparison below places it against two models with similar pricing and use cases:
| Benchmark | GPT-4o mini | Phi-4 (14B) | Gemini 2.5 Flash-Lite |
|---|---|---|---|
| MMLU (general knowledge) | 82.0 | 84.8 | 81.1 |
| GPQA Diamond (PhD-level science) | 40.2 | 56.1 | 64.6 |
| MATH (competition math) | 70.2 | 80.4 | - |
| HumanEval (code generation) | 87.2 | 82.6 | - |
| MGSM (multilingual math) | 87.0 | 80.6 | - |
| LiveCodeBench (coding) | - | - | 33.7 |
| MMMU (visual reasoning) | - | - | 72.9 |
Two things stand out. First, GPT-4o mini's GPQA Diamond score of 40.2% is significantly below both Phi-4 (56.1%) and Gemini 2.5 Flash-Lite (64.6%). On graduate-level science reasoning, it is now the weakest option in its price tier. Second, its HumanEval score of 87.2% remains competitive - code generation is where GPT-4o mini still earns its keep. The MGSM score of 87.0% also shows strong multilingual math capability that exceeds Phi-4's 80.6%.
The MMLU numbers are tightly clustered (81.1-84.8 across all three models), which means general knowledge performance is roughly equivalent. The real differentiation is in the specialized benchmarks, and there, GPT-4o mini's age is starting to show.
Key Capabilities
GPT-4o mini's strongest technical capability is function calling. OpenAI invested heavily in making this model reliable at structured tool use - parallel function calls, JSON schema enforcement via structured outputs, and consistent argument formatting. For applications that need the model to call external APIs, query databases, or drive multi-step workflows, GPT-4o mini's function calling is more battle-tested than any competitor's. Production systems that rely on tool use often stay on GPT-4o mini specifically because switching introduces function calling regressions that are expensive to debug.
The 128K context window handles most real-world document sizes. It is not the 1M tokens that Google and Alibaba offer, but for the vast majority of production use cases - summarizing contracts, analyzing reports, processing customer support threads - 128K is more than sufficient. The model supports multimodal input (text and images), which enables basic document understanding and image-based workflows, though its vision capabilities are modest compared to dedicated multimodal models.
Where GPT-4o mini struggles is on tasks that require deep reasoning. The GPQA Diamond score of 40.2% is the clearest signal: when problems require graduate-level scientific reasoning, the model falls short. The MATH score of 70.2% is decent but 10 points behind Phi-4, which is a free, open-weight model you can run locally. For straightforward generation, classification, and extraction tasks, GPT-4o mini performs well. For anything that requires sustained multi-step reasoning, newer models have moved ahead. See the coding benchmarks leaderboard for a broader view of how budget models compare on technical tasks.
Pricing and Availability
GPT-4o mini is available through the OpenAI API, ChatGPT (free and Plus tiers), and Azure OpenAI Service. It supports batch processing at a 50% discount.
| Provider | Input Cost/M | Output Cost/M | Context |
|---|---|---|---|
| GPT-4o mini | $0.15 | $0.60 | 128K |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | 1M |
| Qwen3.5-Flash | $0.10 | $0.40 | 1M |
| Phi-4 (self-hosted) | Free | Free | 16K |
GPT-4o mini is no longer the cheapest option. Gemini 2.5 Flash-Lite and Qwen3.5-Flash both undercut it by 33% on input and offer 8x the context window. Phi-4 is free to self-host with no per-token costs at all. The pricing gap was minor when GPT-4o mini launched, but the market has moved. At scale - millions of tokens per day - the 50% input price premium over Google and Alibaba compounds into meaningful cost differences.
That said, OpenAI's batch API (50% discount) brings effective pricing down to $0.075/$0.30 per million tokens for async workloads, which closes the gap significantly. And the Azure OpenAI Service integration means enterprises already in Microsoft's cloud can deploy GPT-4o mini without adding a new vendor relationship.
Strengths
- Broadest ecosystem support - every major framework, SDK, and tutorial supports OpenAI's API first
- Best-in-class function calling reliability for production tool-use applications
- Structured output support with JSON schema enforcement
- 87.2% HumanEval - strong code generation that holds up against newer competitors
- 128K context window handles most real-world document processing needs
- Batch API at 50% discount for async workloads
- Available on Azure OpenAI Service for enterprise compliance requirements
Weaknesses
- GPQA Diamond score of 40.2% is now well below budget competitors (Flash-Lite: 64.6%, Phi-4: 56.1%)
- Knowledge cutoff of October 2023 is increasingly stale - nearly 2.5 years out of date
- $0.15/$0.60 pricing is 50% more expensive than Gemini 2.5 Flash-Lite and Qwen3.5-Flash on input
- 128K context is 8x smaller than Flash-Lite and Qwen3.5-Flash's 1M token windows
- Parameters and architecture undisclosed - no self-hosting, fine-tuning limited to OpenAI's platform
- No audio input support unlike newer multimodal competitors
- Max output of 16K tokens is restrictive for long-form generation tasks
Related Coverage
- Gemini 2.5 Flash-Lite - Google's price-matched competitor with 1M context
- Claude Opus 4.6 - Anthropic's frontier model for when budget constraints are not the primary concern
- Open Source vs Proprietary AI - Understanding when free open-weight models like Phi-4 make more sense than paid APIs
- Coding Benchmarks Leaderboard - Where budget models rank on code generation
- DeepSeek V3.2 Review - Another strong budget-tier API option
