Overall LLM Rankings: April 2026

Comprehensive ranking of the top large language models in April 2026, combining reasoning, coding, knowledge, human preference, and cost-adjusted value across 12 frontier and open-weight models.

Overall LLM Rankings: April 2026

Two months since our February rankings, the landscape has shifted more than any comparable period in the past year. Gemini 3.1 Pro dethroned GPT-5.2 Pro on GPQA Diamond. GPT-5.4 arrived and took the overall crown. Claude Opus 4.6 climbed to the top of Chatbot Arena. Gemma 4 crashed the open-weight party with scores that embarrass models five times its parameter count. And Anthropic quietly published Mythos Preview benchmarks that blow past everything on this list - but you can't use it, so it doesn't count.

Here's where everything stands.

The April 2026 Overall Rankings

RankModelProviderGPQA DiamondSWE-Bench VerifiedArena EloPrice (In/Out per 1M)Open?
1GPT-5.4OpenAI92.8%77.2%~1484$2.50 / $15.00No
2Claude Opus 4.6Anthropic91.3%80.8%~1504$5.00 / $25.00No
3Gemini 3.1 ProGoogle94.3%80.6%~1500$2.00 / $12.00No
4Grok 4.20xAI~88%N/P~1493$2.00 / $6.00No
5GPT-5.2 ProOpenAI93.2%80.0%~1402$10.00 / $30.00No
6Gemma 4 31BGoogle84.3%N/P~1452Free (Apache 2.0)Yes
7Claude Sonnet 4.6Anthropic74.1%79.6%~1438$3.00 / $15.00No
8Qwen 3.5 27BAlibaba85.5%72.4%~1342Free (Apache 2.0)Yes
9Mistral Large 3Mistral AI43.9%N/P~1418$0.50 / $1.50Yes
10DeepSeek V3.2DeepSeek82.4%73.1%~1361$0.28 / $0.42Yes
11Grok 4 HeavyxAI88.4%N/P~1375$3.00 / $15.00No
12Llama 4 MaverickMeta69.8%55.8%~1320Free (Llama License)Yes

N/P = Not Published by provider. Arena Elo from lmarena.ai as of April 2026. Pricing reflects standard API tiers.


What Changed Since February

The new #1: GPT-5.4

GPT-5.4 landed on March 5 and immediately reshuffled the rankings. Its 77.2% on SWE-Bench Verified is a 17-point jump from GPT-5.2's 58.2%. GPQA Diamond comes in at 92.8% - not the highest (Gemini 3.1 Pro holds that at 94.3%), but paired with 75.0% on OSWorld-Verified (beating the 72.4% human baseline) and a 73.3% ARC-AGI-2 score, the all-around profile is the strongest of any publicly available model.

At $2.50/$15.00 per million tokens, it's also four times cheaper than GPT-5.2 Pro - OpenAI's clearest signal yet that frontier capability is moving down the price curve.

Gemini 3.1 Pro enters the top three

Gemini 3.1 Pro now holds the highest GPQA Diamond score of any publicly available model at 94.3% and the highest ARC-AGI-2 score at 77.1% (2.5x its predecessor). Its SWE-Bench Verified scores range from 78.8% (independent evaluation) to 80.6% (Google's claims), putting it in the same tier as Opus 4.6. At $2.00/$12.00, it offers the best price-to-reasoning ratio among frontier models by a significant margin.

Claude Opus 4.6 leads Chatbot Arena

Opus 4.6 climbed from an Arena Elo of ~1398 in February to ~1504 in April - the highest of any model in human preference voting. It also still leads SWE-Bench Verified at 80.8% among models with consistent independent verification. The $5/$25 pricing (corrected from the $15/$75 in our February edition, which reflected an older tier structure) makes it more accessible than the numbers suggested two months ago.

Grok 4.20: fast, cheap, and hard to benchmark

Grok 4.20 is the ranking's biggest uncertainty. xAI hasn't published GPQA Diamond, SWE-Bench, or MMLU-Pro scores for this model specifically. What we know: it posts ~1493 on Chatbot Arena (neck and neck with Gemini 3.1 Pro), outputs at 234.9 tokens/second (fastest among flagships), has a 2M token context window (largest among flagships), and costs just $2/$6. Our ~88% GPQA estimate comes from the earlier Grok 4 scores - until xAI publishes benchmarks, the ranking carries an asterisk.

Gemma 4: the open-weight disruptor

Gemma 4 shipped April 2 with four variants under Apache 2.0. The 31B Dense hits 85.2% on MMLU-Pro, 84.3% on GPQA Diamond, 89.2% on AIME 2026, and an Arena Elo of ~1452 - placing it above Sonnet 4.6 in human preference despite being a fraction of the size. At $0.14/$0.40 via OpenRouter (or free self-hosted), it's the strongest argument yet that the open-weight tier has caught the proprietary middle class.

DeepSeek V3.2: still the price king, still waiting for V4

DeepSeek V3.2 holds its position as the best value on the board. At $0.28 input / $0.42 output per million tokens (or $0.028 on cache hits), nothing comes close for the performance delivered. V4 is expected in late April but hasn't officially launched. Leaked benchmarks suggest major jumps, but until they're confirmed, V3.2 remains DeepSeek's production entry.


The Ghost Rankings: Models You Can't Use

Two models would top this leaderboard if they were publicly available:

Claude Mythos Preview (our coverage): 93.9% SWE-Bench Verified, 94.6% GPQA Diamond, 82.0% Terminal-Bench 2.0, 64.7% Humanity's Last Exam with tools. These numbers are Anthropic self-reported and independently unverifiable since the model is restricted to Project Glasswing partners at $25/$125 per million tokens. If these scores hold under independent evaluation, Mythos would be the clear #1 by every metric.

DeepSeek V4 (unreleased): Leaked benchmarks claim ~81% SWE-bench Verified, 92.8% MMLU, 99.4% AIME 2026. None are confirmed. Expected late April.


Benchmark Reliability Notes

Several caveats for this edition:

SWE-Bench Verified scores vary by scaffolding. GPT-5.4 scores 77.2% per OpenAI and 74.9% per independent evaluators. GPT-5.2 Pro ranges from 65.8% to 80.0% depending on the agent setup. We report provider-claimed numbers with independent figures noted where available.

Chatbot Arena Elo has shifted substantially. Our February Elo figures were snapshots from mid-February. The April figures reflect two months of additional voting. Opus 4.6's jump from 1398 to ~1504 reflects genuine preference shifts as more users evaluated it, not a model update.

MMLU-Pro is becoming less useful. Multiple frontier models now cluster within 3 points of each other (85-89%), and several providers have stopped reporting it altogether. We've dropped it from the main ranking table in favor of GPQA Diamond, which still shows meaningful separation.

Grok 4.20 benchmarks are largely absent. xAI relies on Arena Elo and speed metrics over academic benchmarks. This makes direct comparison difficult. We've ranked it based on Arena placement and available Grok 4 scores as a proxy, but this is our lowest-confidence ranking.


The Cost-Performance Map

The pricing spread is now 90x between the cheapest and most expensive models on this list.

TierModelsTypical Use
Premium ($10-30 out)GPT-5.2 Pro, Claude Opus 4.6Max-accuracy reasoning, complex agentic tasks
Flagship ($6-15 out)GPT-5.4, Gemini 3.1 Pro, Grok 4.20, Sonnet 4.6Production workloads, balanced cost/quality
Value ($0.40-1.50 out)DeepSeek V3.2, Mistral Large 3, Qwen 3.5High-volume processing, cost-sensitive apps
Free/OSSGemma 4, Qwen 3.5, Llama 4, DeepSeek V3.2Self-hosted, privacy-sensitive, offline

For most production applications, the flagship tier delivers 90-95% of premium-tier performance at a third of the cost. The value tier delivers 80-85% at a tenth. The case for premium pricing gets harder to make with each update.


Coding has plateaued at the top. The best models (Opus 4.6, Gemini 3.1 Pro, GPT-5.4) all cluster between 77-81% on SWE-Bench Verified. Mythos Preview's 93.9% suggests the ceiling is much higher, but that capability isn't commercially available.

Arena Elo is diverging from benchmarks. Grok 4.20 ranks 4th in Arena Elo despite having the least published benchmark data on this list. Mistral Large 3 scores 43.9% on GPQA Diamond but posts 1418 Elo - above several models that score twice as high. Human preference captures something benchmarks miss.

Open-weight models are now in five of twelve spots. Gemma 4, Qwen 3.5, DeepSeek V3.2, Mistral Large 3, and Llama 4 Maverick. Two years ago, no open model would have appeared on a frontier ranking. The gap hasn't closed entirely - no open model matches the top three on coding or reasoning - but for production use cases where 85% accuracy suffices, the commercial case for proprietary APIs is eroding.

Google is the value play. Gemini 3.1 Pro at $2/$12 matches or beats models costing 2-5x more on most benchmarks. Gemma 4 at $0.14/$0.40 (or free) rivals Sonnet 4.6 on Arena Elo. Google's strategy of competitive pricing + open-weight releases is applying more pressure to Anthropic and OpenAI than any other lab.


How to Use These Rankings

For detailed guidance, see our guide to choosing an LLM in 2026. The quick version:

  • Best overall: GPT-5.4 - strongest all-around profile at a reasonable price
  • Best coder: Claude Opus 4.6 - still #1 on SWE-Bench, best agentic coding performance
  • Best reasoning per dollar: Gemini 3.1 Pro - highest GPQA Diamond at the lowest flagship price
  • Best speed: Grok 4.20 - 235 tok/s with 2M context at $2/$6
  • Best open-weight: Gemma 4 31B - Apache 2.0, beats proprietary mid-tier models
  • Best value: DeepSeek V3.2 - near-frontier performance at 10-30x less cost

We update these rankings monthly. Check back in May - DeepSeek V4 should be out by then.

FAQ

Which LLM is the best overall in April 2026?

GPT-5.4 holds the top overall ranking, combining strong scores across reasoning (92.8% GPQA Diamond), coding (77.2% SWE-Bench), and real-world usability at $2.50/$15.00 per million tokens.

Is Claude Opus 4.6 still the best coding model?

Yes. Opus 4.6 leads publicly available models on SWE-Bench Verified at 80.8%. Claude Mythos Preview scores higher (93.9%) but is restricted to Project Glasswing partners.

What is the best free LLM in April 2026?

Gemma 4 31B Dense under Apache 2.0. It scores 85.2% MMLU-Pro, 84.3% GPQA Diamond, and achieves ~1452 Arena Elo - competitive with proprietary mid-tier models at zero cost.

How do open-weight models compare to proprietary ones?

Open-weight models now occupy 5 of the top 12 spots. The top open model (Gemma 4) outperforms the proprietary Claude Sonnet 4.6 on human preference and matches it on knowledge benchmarks.

When will DeepSeek V4 be released?

Expected late April 2026. Leaked benchmarks suggest major improvements but nothing is confirmed. V3.2 remains the current production model.


Sources:

✓ Last verified April 14, 2026

Overall LLM Rankings: April 2026
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.