Overall LLM Rankings: April 2026

Comprehensive ranking of the top large language models in April 2026, combining reasoning, coding, knowledge, human preference, and cost-adjusted value across 12 frontier and open-weight models. Updated with Claude Opus 4.7 and Qwen 3.6.

Overall LLM Rankings: April 2026

Two months since our February rankings, the landscape has shifted more than any comparable period in the past year. Gemini 3.1 Pro dethroned GPT-5.2 Pro on GPQA Diamond. GPT-5.4 arrived and took the overall crown. Claude Opus 4.6 climbed to the top of Chatbot Arena - then Anthropic shipped Opus 4.7 on April 16 with 3x resolution vision, the new xhigh effort level, and 70% on CursorBench (vs. 58% for 4.6). Alibaba answered hours later with Qwen 3.6-35B-A3B - a 3B-active MoE that scores 73.4% on SWE-bench Verified under Apache 2.0. Gemma 4 crashed the open-weight party with scores that embarrass models five times its parameter count. And Anthropic quietly published Mythos Preview benchmarks that blow past everything on this list - but you can't use it, so it doesn't count.

Here's where everything stands.

The April 2026 Overall Rankings

RankModelProviderGPQA DiamondSWE-Bench VerifiedArena EloPrice (In/Out per 1M)Open?
1GPT-5.4OpenAI92.8%77.2%~1484$2.50 / $15.00No
2Claude Opus 4.7AnthropicTBDTBD*TBD$5.00 / $25.00No
3Gemini 3.1 ProGoogle94.3%80.6%~1500$2.00 / $12.00No
4Grok 4.20xAI~88%N/P~1493$2.00 / $6.00No
5GPT-5.2 ProOpenAI93.2%80.0%~1402$10.00 / $30.00No
6Gemma 4 31BGoogle84.3%N/P~1452Free (Apache 2.0)Yes
7Claude Sonnet 4.6Anthropic74.1%79.6%~1438$3.00 / $15.00No
8Qwen 3.6-35B-A3BAlibaba86.0%73.4%TBDFree (Apache 2.0)Yes
9Mistral Large 3Mistral AI43.9%N/P~1418$0.50 / $1.50Yes
10DeepSeek V3.2DeepSeek82.4%73.1%~1361$0.28 / $0.42Yes
11Grok 4 HeavyxAI88.4%N/P~1375$3.00 / $15.00No
12Llama 4 MaverickMeta69.8%55.8%~1320Free (Llama License)Yes

Opus 4.7 benchmarks: Anthropic's launch announcement did not publish GPQA Diamond, SWE-Bench Verified, or Arena Elo. Published third-party scores: 70% CursorBench (Cursor), 0.813 Finance Agent (vs. 0.767 for 4.6), 90.9% BigLaw Bench (Harvey), +13% on GitHub AI 93-task coding, 3x more Rakuten-SWE-Bench tasks resolved versus 4.6. Rank 2 reflects 4.7 as the current production successor to 4.6 (which held #2 with 91.3% / 80.8% / ~1504).

N/P = Not Published by provider. TBD = new release, benchmark not yet published at the time of writing. Arena Elo from lmarena.ai as of April 2026. Pricing reflects standard API tiers.


What Changed Since February

The new #1: GPT-5.4

GPT-5.4 landed on March 5 and immediately reshuffled the rankings. Its 77.2% on SWE-Bench Verified is a 17-point jump from GPT-5.2's 58.2%. GPQA Diamond comes in at 92.8% - not the highest (Gemini 3.1 Pro holds that at 94.3%), but paired with 75.0% on OSWorld-Verified (beating the 72.4% human baseline) and a 73.3% ARC-AGI-2 score, the all-around profile is the strongest of any publicly available model.

At $2.50/$15.00 per million tokens, it's also four times cheaper than GPT-5.2 Pro - OpenAI's clearest signal yet that frontier capability is moving down the price curve.

Gemini 3.1 Pro enters the top three

Gemini 3.1 Pro now holds the highest GPQA Diamond score of any publicly available model at 94.3% and the highest ARC-AGI-2 score at 77.1% (2.5x its predecessor). Its SWE-Bench Verified scores range from 78.8% (independent evaluation) to 80.6% (Google's claims), putting it in the same tier as Opus 4.6. At $2.00/$12.00, it offers the best price-to-reasoning ratio among frontier models by a significant margin.

Claude Opus 4.7 replaces 4.6 at the top of Anthropic's lineup

Opus 4.6 climbed from an Arena Elo of ~1398 in February to ~1504 in April - the highest of any model in human preference voting. It also led SWE-Bench Verified at 80.8% among models with consistent independent verification. Then Anthropic shipped Opus 4.7 on April 16, same $5/$25 pricing, positioned as a drop-in successor.

What's different in 4.7 is the kind of numbers Anthropic chose to emphasize. Instead of the usual GPQA Diamond / SWE-Bench Verified refresh, the launch led with third-party results: 70% on CursorBench (versus 58% for 4.6), 3x more tasks resolved on Rakuten-SWE-Bench, 0.813 on Finance Agent (up from 0.767), and 90.9% on Harvey's BigLaw Bench. Vision jumps to 3x the input resolution (up to ~3.75 megapixels), a new xhigh effort level sits between high and max, and task budgets give long-running agent sessions explicit cost control.

The trade-off: standard public benchmarks weren't part of the announcement, so the head-to-head comparison with GPT-5.4 and Gemini 3.1 Pro is still pending independent evaluation. We'll update the rankings as Arena Elo, GPQA, and SWE-Bench numbers come in. In the meantime, 4.7 ranks #2 on the "current production model to use" logic, not on head-to-head benchmark parity.

Qwen 3.6-35B-A3B ships a 3B-active agentic coder under Apache 2.0

Hours after the Opus 4.7 launch, Alibaba's Qwen team shipped Qwen 3.6-35B-A3B - a sparse MoE with 35B total and 3B active parameters, Apache 2.0. It replaces Qwen 3.5 in our rankings with numbers that matter: 73.4% on SWE-bench Verified (+3.4 over 3.5), 51.5% on Terminal-Bench 2.0 (+11 points), 86.0% on GPQA Diamond, and 92.7% on AIME 2026. The Q4_K_XL quantization fits in 22.4 GB - runnable on a single RTX 4090. For agentic coding in the open-weight tier, this is the new high-water mark.

Grok 4.20: fast, cheap, and hard to benchmark

Grok 4.20 is the ranking's biggest uncertainty. xAI hasn't published GPQA Diamond, SWE-Bench, or MMLU-Pro scores for this model specifically. What we know: it posts ~1493 on Chatbot Arena (neck and neck with Gemini 3.1 Pro), outputs at 234.9 tokens/second (fastest among flagships), has a 2M token context window (largest among flagships), and costs just $2/$6. Our ~88% GPQA estimate comes from the earlier Grok 4 scores - until xAI publishes benchmarks, the ranking carries an asterisk.

Gemma 4: the open-weight disruptor

Gemma 4 shipped April 2 with four variants under Apache 2.0. The 31B Dense hits 85.2% on MMLU-Pro, 84.3% on GPQA Diamond, 89.2% on AIME 2026, and an Arena Elo of ~1452 - placing it above Sonnet 4.6 in human preference despite being a fraction of the size. At $0.14/$0.40 via OpenRouter (or free self-hosted), it's the strongest argument yet that the open-weight tier has caught the proprietary middle class.

DeepSeek V3.2: still the price king, still waiting for V4

DeepSeek V3.2 holds its position as the best value on the board. At $0.28 input / $0.42 output per million tokens (or $0.028 on cache hits), nothing comes close for the performance delivered. V4 is expected in late April but hasn't officially launched. Leaked benchmarks suggest major jumps, but until they're confirmed, V3.2 remains DeepSeek's production entry.


The Ghost Rankings: Models You Can't Use

Two models would top this leaderboard if they were publicly available:

Claude Mythos Preview (our coverage): 93.9% SWE-Bench Verified, 94.6% GPQA Diamond, 82.0% Terminal-Bench 2.0, 64.7% Humanity's Last Exam with tools. These numbers are Anthropic self-reported and independently unverifiable since the model is restricted to Project Glasswing partners at $25/$125 per million tokens. If these scores hold under independent evaluation, Mythos would be the clear #1 by every metric.

DeepSeek V4 (unreleased): Leaked benchmarks claim ~81% SWE-bench Verified, 92.8% MMLU, 99.4% AIME 2026. None are confirmed. Expected late April.


Benchmark Reliability Notes

Several caveats for this edition:

SWE-Bench Verified scores vary by scaffolding. GPT-5.4 scores 77.2% per OpenAI and 74.9% per independent evaluators. GPT-5.2 Pro ranges from 65.8% to 80.0% depending on the agent setup. We report provider-claimed numbers with independent figures noted where available.

Chatbot Arena Elo has shifted substantially. Our February Elo figures were snapshots from mid-February. The April figures reflect two months of additional voting. Opus 4.6's jump from 1398 to ~1504 reflects genuine preference shifts as more users evaluated it, not a model update.

MMLU-Pro is becoming less useful. Multiple frontier models now cluster within 3 points of each other (85-89%), and several providers have stopped reporting it altogether. We've dropped it from the main ranking table in favor of GPQA Diamond, which still shows meaningful separation.

Grok 4.20 benchmarks are largely absent. xAI relies on Arena Elo and speed metrics over academic benchmarks. This makes direct comparison difficult. We've ranked it based on Arena placement and available Grok 4 scores as a proxy, but this is our lowest-confidence ranking.


The Cost-Performance Map

The pricing spread is now 90x between the cheapest and most expensive models on this list.

TierModelsTypical Use
Premium ($10-30 out)GPT-5.2 Pro, Claude Opus 4.7Max-accuracy reasoning, complex agentic tasks
Flagship ($6-15 out)GPT-5.4, Gemini 3.1 Pro, Grok 4.20, Sonnet 4.6Production workloads, balanced cost/quality
Value ($0.40-1.50 out)DeepSeek V3.2, Mistral Large 3, Qwen 3.6High-volume processing, cost-sensitive apps
Free/OSSGemma 4, Qwen 3.6, Llama 4, DeepSeek V3.2Self-hosted, privacy-sensitive, offline

For most production applications, the flagship tier delivers 90-95% of premium-tier performance at a third of the cost. The value tier delivers 80-85% at a tenth. The case for premium pricing gets harder to make with each update.


Coding has plateaued at the top - and labs are shifting to different benchmarks. The best publicly measured models (Opus 4.6, Gemini 3.1 Pro, GPT-5.4) all cluster between 77-81% on SWE-Bench Verified. Opus 4.7's launch sidestepped SWE-Bench Verified in favor of CursorBench (70%) and Rakuten-SWE-Bench (3x tasks resolved vs 4.6), a signal that providers increasingly see the standard harness as saturated or gameable. Mythos Preview's self-reported 93.9% suggests the ceiling is higher still, but that capability isn't commercially available.

Arena Elo is diverging from benchmarks. Grok 4.20 ranks 4th in Arena Elo despite having the least published benchmark data on this list. Mistral Large 3 scores 43.9% on GPQA Diamond but posts 1418 Elo - above several models that score twice as high. Human preference captures something benchmarks miss.

Open-weight models are now in five of twelve spots. Gemma 4, Qwen 3.6, DeepSeek V3.2, Mistral Large 3, and Llama 4 Maverick. Two years ago, no open model would have appeared on a frontier ranking. The gap hasn't closed entirely - no open model matches the top three on coding or reasoning - but for production use cases where 85% accuracy suffices, the commercial case for proprietary APIs is eroding. Qwen 3.6-35B-A3B in particular puts 73.4% SWE-bench Verified behind 3B of active parameters on a single consumer GPU.

Google is the value play. Gemini 3.1 Pro at $2/$12 matches or beats models costing 2-5x more on most benchmarks. Gemma 4 at $0.14/$0.40 (or free) rivals Sonnet 4.6 on Arena Elo. Google's strategy of competitive pricing + open-weight releases is applying more pressure to Anthropic and OpenAI than any other lab.


How to Use These Rankings

For detailed guidance, see our guide to choosing an LLM in 2026. The quick version:

  • Best overall: GPT-5.4 - strongest all-around profile at a reasonable price
  • Best coder: Claude Opus 4.7 - 70% CursorBench and 3x Rakuten-SWE-Bench vs 4.6, same $5/$25 pricing
  • Best reasoning per dollar: Gemini 3.1 Pro - highest GPQA Diamond at the lowest flagship price
  • Best speed: Grok 4.20 - 235 tok/s with 2M context at $2/$6
  • Best open-weight (dense): Gemma 4 31B - Apache 2.0, beats proprietary mid-tier models
  • Best open-weight (agentic coding): Qwen 3.6-35B-A3B - 73.4% SWE-bench Verified on a single RTX 4090
  • Best value: DeepSeek V3.2 - near-frontier performance at 10-30x less cost

We update these rankings monthly. Check back in May - DeepSeek V4 should be out by then, and independent benchmarks for Opus 4.7 should have landed.

FAQ

Which LLM is the best overall in April 2026?

GPT-5.4 holds the top overall ranking, combining strong scores across reasoning (92.8% GPQA Diamond), coding (77.2% SWE-Bench), and real-world usability at $2.50/$15.00 per million tokens. Claude Opus 4.7 is the strongest Anthropic model you can use today, but its GPQA Diamond and SWE-Bench Verified numbers weren't published at launch, so the head-to-head against GPT-5.4 on standard benchmarks is still pending.

Is Claude Opus 4.7 the best coding model?

By the metrics Anthropic published at launch, yes: 70% on CursorBench (versus 58% for Opus 4.6), 3x more Rakuten-SWE-Bench tasks resolved than 4.6, and 90.9% on Harvey's BigLaw Bench. Opus 4.6 still held the top publicly verified SWE-Bench Verified score at 80.8%, and until independent evaluators post numbers for 4.7, Gemini 3.1 Pro (80.6% SWE-Bench Verified at $2/$12) remains the strongest cost-adjusted pick. Claude Mythos Preview self-reports 93.9% but is restricted to Project Glasswing partners.

What is the best free LLM in April 2026?

Two different answers, two different use cases. For dense raw benchmark scores, Gemma 4 31B Dense under Apache 2.0 still leads - 85.2% MMLU-Pro, 84.3% GPQA Diamond, ~1452 Arena Elo. For agentic coding specifically, Qwen 3.6-35B-A3B (also Apache 2.0) posts 73.4% SWE-bench Verified with only 3B active parameters, making it runnable on a single RTX 4090.

How do open-weight models compare to proprietary ones?

Open-weight models now occupy 5 of the top 12 spots. Gemma 4 outperforms the proprietary Claude Sonnet 4.6 on human preference. Qwen 3.6-35B-A3B lands inside a few points of GPT-5.4 on SWE-bench Verified at 3B active parameters. The gap to the top three on coding and reasoning is still real, but shrinking each quarter.

When will DeepSeek V4 be released?

Expected late April 2026. Leaked benchmarks suggest major improvements but nothing is confirmed as of this update. V3.2 remains the current production model.


Sources:

✓ Last verified April 19, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.