Leaderboards

Cost Efficiency Leaderboard: Best AI Performance Per Dollar

Rankings of AI models by cost efficiency, comparing performance per dollar across frontier and budget models. See which models deliver GPT-4 level performance at 1/100th the cost.

Cost Efficiency Leaderboard: Best AI Performance Per Dollar

Two years ago, GPT-4 was the undisputed best model available, and it cost $30 per million input tokens and $60 per million output tokens. Today, you can get equivalent or better performance for less than $0.10 per million tokens. The collapse in AI pricing is one of the most dramatic cost curves in the history of technology, and understanding which models deliver the best performance per dollar is essential for anyone building AI-powered products.

This leaderboard ranks models not by raw capability but by the ratio of performance to cost. The best model is not the one that scores highest on benchmarks; it is the one that scores highest relative to what you pay.

Cost Efficiency Rankings

RankModelProviderInput / Output (per 1M tokens)MMLU-ProChatbot Arena EloEfficiency Score
1DeepSeek V3.2DeepSeek$0.028 / $0.1184.1%134898.5
2Qwen 3 32B (self-hosted)Alibaba~$0.02 / ~$0.0273.5%123895.2
3Llama 4 Scout (self-hosted)Meta~$0.01 / ~$0.0178.5%127894.8
4DeepSeek V3.2-SpecialeDeepSeek$0.28 / $1.1085.9%136191.3
5Gemini 2.5 FlashGoogle DeepMind$0.15 / $0.6080.2%133590.7
6Mistral 3 SmallMistral AI$0.10 / $0.3074.8%124588.2
7Qwen 3.5Alibaba$0.50 / $2.0084.6%134285.6
8Gemini 3 ProGoogle DeepMind$1.25 / $5.0089.8%138978.4
9GPT-5.2OpenAI$2.50 / $10.0086.3%138062.1
10Claude Opus 4.6Anthropic$15.00 / $75.0088.2%139835.8

Efficiency Score is a normalized composite of (benchmark performance / cost). Higher is better. Self-hosted costs assume amortized GPU costs at moderate utilization.

The Price-Performance Frontier

The most striking fact about AI pricing in 2026 is the sheer range. The cheapest model on our list (Llama 4 Scout self-hosted at roughly $0.01 per million tokens) costs approximately 1,500 times less than the most expensive (Claude Opus 4.6 at $15 per million input tokens). Yet the performance gap between them is perhaps 15-20% on most benchmarks. You pay a massive premium for that last increment of quality.

This creates a clear decision framework:

The "Good Enough" Tier ($0.01-$0.10 per million tokens): DeepSeek V3.2, Qwen 3 32B, Llama 4 Scout, and Mistral 3 Small all deliver performance that would have been considered frontier-class in early 2024. For the vast majority of applications, including chatbots, content generation, summarization, basic coding assistance, and data extraction, these models are more than sufficient. DeepSeek V3.2 at $0.028 per million input tokens is the standout, offering 84.1% on MMLU-Pro at a price that makes high-volume applications economically viable.

The "Premium Efficient" Tier ($0.15-$2.00 per million tokens): Gemini 2.5 Flash, Qwen 3.5, and DeepSeek V3.2-Speciale occupy the sweet spot for applications that need near-frontier performance without frontier pricing. Gemini 2.5 Flash is particularly notable: its 1335 Arena Elo at $0.15 per million input tokens makes it perhaps the best value in the entire market for conversational AI.

The "Frontier" Tier ($1.25-$15.00 per million tokens): Gemini 3 Pro, GPT-5.2, and Claude Opus 4.6 are for when you need the absolute best results and cost is secondary. Research applications, high-stakes decision support, complex coding tasks, and any scenario where a 5% improvement in accuracy has meaningful business value.

The Two-Year Price Collapse

To appreciate how dramatically pricing has shifted, consider this comparison:

Capability LevelCost in Feb 2024Cost in Feb 2026Reduction
GPT-4 equivalent$30.00 / $60.00 per 1M tokens$0.028 / $0.11 per 1M tokens~99.9%
GPT-4 Turbo equivalent$10.00 / $30.00 per 1M tokens$0.01 / $0.01 per 1M tokens~99.9%
Frontier (best available)$30.00 / $60.00 per 1M tokens$15.00 / $75.00 per 1M tokens~50%

The most dramatic savings come from matching the performance of older frontier models. Getting GPT-4 level performance now costs roughly one-thousandth of what it did two years ago. Even at the frontier, prices have roughly halved despite substantially better performance.

This price collapse is driven by three factors: architectural efficiency (mixture-of-experts models like DeepSeek V3.2 activate only a fraction of their parameters per token), hardware improvements (newer GPUs deliver more inference throughput per dollar), and competition (the entry of DeepSeek, Qwen, and others has forced all providers to cut margins).

Self-Hosting vs. API: When to Make the Switch

For high-volume applications, self-hosting open-weight models can be dramatically cheaper than API access. Here is a rough comparison:

Monthly VolumeBest API OptionAPI Cost/MonthBest Self-HostedSelf-Hosted Cost/Month
10M tokensDeepSeek V3.2$0.28Not worth itHigher (overhead)
100M tokensDeepSeek V3.2$2.80Not worth itHigher (overhead)
1B tokensDeepSeek V3.2$28.00Qwen 3 32B~$20.00
10B tokensDeepSeek V3.2$280.00Llama 4 Scout~$100.00
100B tokensDeepSeek V3.2$2,800.00Llama 4 Scout~$800.00

The break-even point depends on your engineering capacity and operational requirements, but broadly, self-hosting starts making financial sense around 1-10 billion tokens per month. Below that volume, the convenience and reliability of API access usually outweighs the cost savings.

Recommendations

For startups and prototypes: Start with DeepSeek V3.2 or Gemini 2.5 Flash via API. The cost is negligible, and the performance is excellent.

For production applications at scale: Evaluate DeepSeek V3.2-Speciale or Qwen 3.5 via API, or self-host Llama 4 Scout or Qwen 3 32B.

For quality-critical applications: Use Gemini 3 Pro for the best price-to-performance at the frontier tier, or Claude Opus 4.6 and GPT-5.2 Pro when maximum quality justifies the premium.

The most expensive model is not always the best choice. In fact, for the vast majority of real-world applications, it usually is not.

About the author Senior AI Editor & Investigative Journalist

Elena is a technology journalist with over eight years of experience covering artificial intelligence, machine learning, and the startup ecosystem.