Cost Efficiency Leaderboard: Best AI Performance Per Dollar
Rankings of AI models by cost efficiency, comparing performance per dollar across frontier and budget models. See which models deliver GPT-4 level performance at 1/100th the cost.

Two years ago, GPT-4 was the undisputed best model available, and it cost $30 per million input tokens and $60 per million output tokens. Today, you can get equivalent or better performance for less than $0.10 per million tokens. The collapse in AI pricing is one of the most dramatic cost curves in the history of technology, and understanding which models deliver the best performance per dollar is essential for anyone building AI-powered products.
This leaderboard ranks models not by raw capability but by the ratio of performance to cost. The best model is not the one that scores highest on benchmarks; it is the one that scores highest relative to what you pay.
Cost Efficiency Rankings
| Rank | Model | Provider | Input / Output (per 1M tokens) | MMLU-Pro | Chatbot Arena Elo | Efficiency Score |
|---|---|---|---|---|---|---|
| 1 | DeepSeek V3.2 | DeepSeek | $0.028 / $0.11 | 84.1% | 1348 | 98.5 |
| 2 | Qwen 3 32B (self-hosted) | Alibaba | ~$0.02 / ~$0.02 | 73.5% | 1238 | 95.2 |
| 3 | Llama 4 Scout (self-hosted) | Meta | ~$0.01 / ~$0.01 | 78.5% | 1278 | 94.8 |
| 4 | DeepSeek V3.2-Speciale | DeepSeek | $0.28 / $1.10 | 85.9% | 1361 | 91.3 |
| 5 | Gemini 2.5 Flash | Google DeepMind | $0.15 / $0.60 | 80.2% | 1335 | 90.7 |
| 6 | Mistral 3 Small | Mistral AI | $0.10 / $0.30 | 74.8% | 1245 | 88.2 |
| 7 | Qwen 3.5 | Alibaba | $0.50 / $2.00 | 84.6% | 1342 | 85.6 |
| 8 | Gemini 3 Pro | Google DeepMind | $1.25 / $5.00 | 89.8% | 1389 | 78.4 |
| 9 | GPT-5.2 | OpenAI | $2.50 / $10.00 | 86.3% | 1380 | 62.1 |
| 10 | Claude Opus 4.6 | Anthropic | $15.00 / $75.00 | 88.2% | 1398 | 35.8 |
Efficiency Score is a normalized composite of (benchmark performance / cost). Higher is better. Self-hosted costs assume amortized GPU costs at moderate utilization.
The Price-Performance Frontier
The most striking fact about AI pricing in 2026 is the sheer range. The cheapest model on our list (Llama 4 Scout self-hosted at roughly $0.01 per million tokens) costs approximately 1,500 times less than the most expensive (Claude Opus 4.6 at $15 per million input tokens). Yet the performance gap between them is perhaps 15-20% on most benchmarks. You pay a massive premium for that last increment of quality.
This creates a clear decision framework:
The "Good Enough" Tier ($0.01-$0.10 per million tokens): DeepSeek V3.2, Qwen 3 32B, Llama 4 Scout, and Mistral 3 Small all deliver performance that would have been considered frontier-class in early 2024. For the vast majority of applications, including chatbots, content generation, summarization, basic coding assistance, and data extraction, these models are more than sufficient. DeepSeek V3.2 at $0.028 per million input tokens is the standout, offering 84.1% on MMLU-Pro at a price that makes high-volume applications economically viable.
The "Premium Efficient" Tier ($0.15-$2.00 per million tokens): Gemini 2.5 Flash, Qwen 3.5, and DeepSeek V3.2-Speciale occupy the sweet spot for applications that need near-frontier performance without frontier pricing. Gemini 2.5 Flash is particularly notable: its 1335 Arena Elo at $0.15 per million input tokens makes it perhaps the best value in the entire market for conversational AI.
The "Frontier" Tier ($1.25-$15.00 per million tokens): Gemini 3 Pro, GPT-5.2, and Claude Opus 4.6 are for when you need the absolute best results and cost is secondary. Research applications, high-stakes decision support, complex coding tasks, and any scenario where a 5% improvement in accuracy has meaningful business value.
The Two-Year Price Collapse
To appreciate how dramatically pricing has shifted, consider this comparison:
| Capability Level | Cost in Feb 2024 | Cost in Feb 2026 | Reduction |
|---|---|---|---|
| GPT-4 equivalent | $30.00 / $60.00 per 1M tokens | $0.028 / $0.11 per 1M tokens | ~99.9% |
| GPT-4 Turbo equivalent | $10.00 / $30.00 per 1M tokens | $0.01 / $0.01 per 1M tokens | ~99.9% |
| Frontier (best available) | $30.00 / $60.00 per 1M tokens | $15.00 / $75.00 per 1M tokens | ~50% |
The most dramatic savings come from matching the performance of older frontier models. Getting GPT-4 level performance now costs roughly one-thousandth of what it did two years ago. Even at the frontier, prices have roughly halved despite substantially better performance.
This price collapse is driven by three factors: architectural efficiency (mixture-of-experts models like DeepSeek V3.2 activate only a fraction of their parameters per token), hardware improvements (newer GPUs deliver more inference throughput per dollar), and competition (the entry of DeepSeek, Qwen, and others has forced all providers to cut margins).
Self-Hosting vs. API: When to Make the Switch
For high-volume applications, self-hosting open-weight models can be dramatically cheaper than API access. Here is a rough comparison:
| Monthly Volume | Best API Option | API Cost/Month | Best Self-Hosted | Self-Hosted Cost/Month |
|---|---|---|---|---|
| 10M tokens | DeepSeek V3.2 | $0.28 | Not worth it | Higher (overhead) |
| 100M tokens | DeepSeek V3.2 | $2.80 | Not worth it | Higher (overhead) |
| 1B tokens | DeepSeek V3.2 | $28.00 | Qwen 3 32B | ~$20.00 |
| 10B tokens | DeepSeek V3.2 | $280.00 | Llama 4 Scout | ~$100.00 |
| 100B tokens | DeepSeek V3.2 | $2,800.00 | Llama 4 Scout | ~$800.00 |
The break-even point depends on your engineering capacity and operational requirements, but broadly, self-hosting starts making financial sense around 1-10 billion tokens per month. Below that volume, the convenience and reliability of API access usually outweighs the cost savings.
Recommendations
For startups and prototypes: Start with DeepSeek V3.2 or Gemini 2.5 Flash via API. The cost is negligible, and the performance is excellent.
For production applications at scale: Evaluate DeepSeek V3.2-Speciale or Qwen 3.5 via API, or self-host Llama 4 Scout or Qwen 3 32B.
For quality-critical applications: Use Gemini 3 Pro for the best price-to-performance at the frontier tier, or Claude Opus 4.6 and GPT-5.2 Pro when maximum quality justifies the premium.
The most expensive model is not always the best choice. In fact, for the vast majority of real-world applications, it usually is not.