Open-Source LLM Leaderboard: April 2026

Rankings of the best open-weight and open-source large language models in April 2026, led by DeepSeek V4, Qwen 3.6-35B-A3B, GLM-5.1, and Llama 4 Maverick.

Open-Source LLM Leaderboard: April 2026

The open-source AI revolution is no longer a promise. It is a reality. In April 2026, open-weight models routinely match or exceed the performance of proprietary models from twelve months ago, and in specialized benchmarks they compete with the very best closed models available today. For developers, researchers, and organizations that need control over their AI infrastructure, the options have never been better.

This leaderboard ranks open-weight and open-source models exclusively, covering models whose weights are publicly available for download, fine-tuning, and self-hosting. It is refreshed on a monthly cadence - the current snapshot reflects releases through 22 April 2026.

Open-Source LLM Rankings

RankModelOrganizationParametersLicenseMMLU-ProGPQA DiamondSWE-Bench VerifiedChatbot Arena Elo
1DeepSeek V4DeepSeek~1T (dense)MIT88.2%87.6%83.7%1394
2GLM-5.1Zhipu AI355B (MoE)Apache 2.083.1%80.2%79.5%1336
3DeepSeek V3.2-SpecialeDeepSeek685B (MoE)MIT85.9%85.3%77.8%1361
4Qwen 3.6-35B-A3BAlibaba35B / 3B active (MoE)Apache 2.085.2%86.0%73.4%1339
5Qwen 3.5Alibaba405B (MoE)Apache 2.084.6%82.1%62.5%1342
6Mistral 3Mistral AI240B (MoE)Apache 2.082.8%79.3%54.1%1315
7Llama 4 MaverickMeta402B (MoE)Llama 4 License83.2%78.5%55.8%1320
8Qwen 3 235BAlibaba235B (MoE)Apache 2.081.2%78.4%55.2%1305
9Gemma 4 31BGoogle31B (dense)Gemma License79.1%73.8%49.3%1282
10Llama 4 ScoutMeta109B (MoE)Llama 4 License78.5%72.1%42.3%1278

The Leaders

DeepSeek V4: The New Open-Source King

DeepSeek V4 took the top spot from V3.2-Speciale in March 2026 and has held it through the April refresh. The ~1T-parameter dense model sweeps every benchmark where it has been independently evaluated, and its 83.7 on SWE-Bench Verified leads the open-source coding field by a clear margin. The MIT license means you can use it for anything, commercial or otherwise, without restriction.

The catch is deployment. V4 is a dense model, not MoE - every parameter activates on every token. Self-hosting realistically requires multi-node tensor parallel across a cluster of H100s or equivalents, which puts it outside the reach of anyone who isn't operating production infrastructure already. For most engineering teams, the practical alternative is one of the smaller MoE entries below.

GLM-5.1: The Coding Specialist, Refreshed

GLM-5.1 from Zhipu AI is a pointed bump over February's GLM-5. Its 79.5 on SWE-Bench Verified keeps it inside the top tier of open-source coding models, and the Apache 2.0 license - combined with a parameter count that fits on a single high-end inference node - makes it the coding-first model that actually lands in production stacks. For organizations whose primary use case is code generation and agentic software engineering, GLM-5.1 is frequently the best value in this table.

DeepSeek V3.2-Speciale: Still Elite, Now Cheaper

The V3.2-Speciale release that led our February 2026 leaderboard hasn't gotten worse - the models above it have simply raised the ceiling. It remains an excellent choice for teams that already built infrastructure around its MoE architecture, and with V4's release, V3.2-Speciale is now available at a noticeable discount from hosted-inference providers.

Qwen 3.6-35B-A3B: The Small-Footprint Winner

The release that reset expectations in this size class. At 35 billion total parameters routed through a 256-expert MoE with 3 billion active per token, Qwen 3.6-35B-A3B delivers 73.4 on SWE-Bench Verified - a score that would have required a 400B+ dense model twelve months ago. The 4-bit quantised checkpoint fits on a single RTX 4090 with working room for context. It is the clearest example yet that architectural efficiency, not raw parameter count, is where the open-source frontier is now moving.

The 262,144-token native context (extensible to 1M via YaRN) also makes it the practical choice for repo-scale agentic coding workflows.

Qwen 3.5: The Multilingual Workhorse

Alibaba's 405B MoE earlier in the Qwen 3 series continues to deliver strong all-around performance, with its Apache 2.0 license as permissive as it gets. Qwen 3.5 is especially strong in multilingual tasks, outperforming most competitors in Chinese, Japanese, Korean, and Arabic. For English-only coding workloads, Qwen 3.6-35B-A3B is now the better pick in the Alibaba lineup - but for multilingual production use, 3.5 remains the reference.

Llama 4 Maverick: Meta's Flagship Holds Steady

Llama 4 Maverick's benchmark line has drifted in exactly the way you'd expect from a late-2025 release: no regressions, no leapfrogs. The 402B MoE is still a capable general-purpose model, and the Llama 4 License (with its 700M-MAU restriction) is permissive enough to be a non-issue for most organizations. For teams already standardised on the Llama tooling ecosystem, Maverick remains a solid choice. For new greenfield deployments picking an open-source stack today, the Chinese labs (DeepSeek, Alibaba, Zhipu) are shipping faster and reaching higher on the benchmarks.

Gemma 4 31B: Google Joins the Open-Weights Table

Google's Gemma 4 31B is a new entry to this leaderboard. A dense 31B model with the Gemma License (permissive, but with a use-policy carve-out rather than a pure Apache/MIT grant), it is designed for the single-GPU deployment target. Performance on general benchmarks sits a tier below the MoE heavyweights, but the dense architecture gives it more predictable inference behaviour and makes it easier to fine-tune on modest hardware.

Understanding Open-Source Licenses

Not all "open" models are equally open. Here is what the licenses in this leaderboard actually mean:

LicenseCommercial UseModificationRedistributionNotable Restrictions
MITYesYesYesNone
Apache 2.0YesYesYesPatent grant included
Llama 4 LicenseYesYesYesRestrictions above 700M MAU
Gemma LicenseYesYesYesProhibited-use policy applies

MIT (used by DeepSeek) is the most permissive. You can do anything with the model weights. Apache 2.0 (used by Qwen, GLM, Mistral) is similarly permissive but includes an explicit patent grant, which some legal teams prefer. Llama 4 License is permissive for most use cases but includes a usage threshold that large social media platforms would need to negotiate separately. Gemma License is permissive for commercial use but ties compliance to Google's prohibited-use policy, which is updated unilaterally and therefore adds a small ongoing review cost.

The Self-Hosting Economics

One of the strongest arguments for open-weight models is cost. Running DeepSeek V3.2-Speciale on your own infrastructure costs roughly $0.028 per million input tokens when amortized over reasonable use. The equivalent API call to a frontier proprietary model costs $2 to $15 per million input tokens. That's a 70x to 500x cost difference.

The calculus shifted in April 2026 with the arrival of Qwen 3.6-35B-A3B. The 3B-active MoE architecture runs at interactive latency on a single consumer GPU, and the Apache 2.0 license removes all commercial-use concerns. For organizations processing millions of tokens daily on coding or retrieval workloads, this is the first open-weight model where the self-hosting payback period is measured in days rather than months.

Self-hosting still requires upfront investment in GPU hardware, engineering expertise, and operational overhead. But the floor has moved - you no longer need a multi-node cluster to run a frontier-grade open model on your own stack.

Smaller Models Worth Watching

Not every deployment needs a 400B+ parameter model. Two models in this tier deserve specific callouts:

  • Qwen 3.6-35B-A3B (ranked #4 above) is the first open-weight model to deliver 70+ SWE-Bench Verified scores on a single-GPU deployment target. For agentic coding on consumer hardware, this is the new baseline.
  • Gemma 4 31B (ranked #9) is a dense model that fine-tunes cleanly on a single H100 and produces predictable inference latency - important for applications where p99 tail latency matters more than peak throughput.

Mistral 3 Small (24B) and Qwen 3 32B from the February snapshot remain capable for latency-sensitive applications or edge deployment, but the April releases above have raised the baseline meaningfully.

What Changed Since February 2026

Three shifts matter for readers returning to this leaderboard:

  • DeepSeek V4 displaced V3.2-Speciale at the top of the table - the first time in six months the open-source leader has changed hands
  • Qwen 3.6-35B-A3B introduced a new shape of frontier model - sparse enough to run on consumer hardware, capable enough to compete with mid-tier proprietary offerings on coding
  • GLM-5.1 surpassed DeepSeek V3.2-Speciale on coding benchmarks while sitting in a more deployable parameter class

The open-source LLM ecosystem is advancing at a pace that consistently surprises even optimistic observers. Models that would have been state-of-the-art twelve months ago are now freely downloadable. The question is no longer whether open-source can compete with proprietary models, but which axis of the frontier each new release targets - and which deployment tier finally opens up as a result.

Compare against our overall LLM rankings to see how these open-source leaders sit against the frontier of proprietary models.

Last updated

✓ Last verified April 23, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.