Leaderboards

Open-Source LLM Leaderboard: February 2026

Rankings of the best open-weight and open-source large language models in February 2026, including DeepSeek V3.2, Qwen 3.5, Llama 4 Maverick, GLM-5, and Mistral 3.

Open-Source LLM Leaderboard: February 2026

The open-source AI revolution is no longer a promise. It is a reality. In February 2026, open-weight models routinely match or exceed the performance of proprietary models from just twelve months ago, and in some specialized benchmarks, they compete with the very best closed models available today. For developers, researchers, and organizations that need control over their AI infrastructure, the options have never been better.

This leaderboard ranks open-weight and open-source models exclusively, covering models whose weights are publicly available for download, fine-tuning, and self-hosting.

Open-Source LLM Rankings

RankModelOrganizationParametersLicenseMMLU-ProGPQA DiamondSWE-Bench VerifiedChatbot Arena Elo
1DeepSeek V3.2-SpecialeDeepSeek685B (MoE)MIT85.9%85.3%77.8%1361
2Qwen 3.5Alibaba405B (MoE)Apache 2.084.6%82.1%62.5%1342
3Llama 4 MaverickMeta402B (MoE)Llama 4 License83.2%78.5%55.8%1320
4GLM-5Zhipu AI320B (MoE)Apache 2.081.5%76.8%77.8%1298
5Mistral 3Mistral AI240B (MoE)Apache 2.082.8%79.3%54.1%1315
6DeepSeek V3.2DeepSeek685B (MoE)MIT84.1%83.8%72.4%1348
7Qwen 3 235BAlibaba235B (MoE)Apache 2.081.2%78.4%55.2%1305
8Llama 4 ScoutMeta109B (MoE)Llama 4 License78.5%72.1%42.3%1278
9Mistral 3 SmallMistral AI24BApache 2.074.8%65.3%38.1%1245
10Qwen 3 32BAlibaba32BApache 2.073.5%64.8%36.5%1238

The Leaders

DeepSeek V3.2-Speciale: The Open-Source King

DeepSeek V3.2-Speciale is, by a comfortable margin, the most capable open-weight model available. Its performance on MMLU-Pro (85.9%) and GPQA Diamond (85.3%) puts it within striking distance of GPT-5.2 Pro and Claude Opus 4.6, and its 77.8% on SWE-Bench Verified actually leads the entire field, including closed models. The MIT license means you can use it for anything, commercial or otherwise, without restriction.

The catch is size. At 685 billion parameters in a mixture-of-experts architecture, running DeepSeek V3.2-Speciale requires serious hardware. You will need multiple high-end GPUs for inference, though the MoE architecture means that only a fraction of parameters are active for any given token, keeping actual compute costs manageable.

Qwen 3.5: The Most Permissive Frontier Model

Alibaba's Qwen 3.5 earns the second spot with a strong all-around performance profile. Its Apache 2.0 license is as permissive as it gets, and at 405 billion parameters (MoE), it is more practical to deploy than DeepSeek V3.2-Speciale while still delivering competitive performance. Qwen 3.5 is particularly strong in multilingual tasks, outperforming most competitors in Chinese, Japanese, Korean, and Arabic.

Llama 4 Maverick: Meta's Best

Llama 4 Maverick represents a significant step up from the Llama 3 generation. At 402 billion parameters with a MoE architecture, it delivers strong general-purpose performance. The Llama 4 License is more permissive than its predecessor but still includes some restrictions for very large commercial deployments (above 700 million monthly active users). For the vast majority of organizations, this is a non-issue.

GLM-5: The Coding Specialist

GLM-5 from Zhipu AI deserves special mention for its extraordinary coding performance. Its 77.8% on SWE-Bench Verified ties with DeepSeek V3.2-Speciale for the top spot on that benchmark across all models, open or closed. If your primary use case is code generation and software engineering, GLM-5 under its Apache 2.0 license is a compelling choice.

Mistral 3: European Excellence

Mistral AI continues to punch above its weight. Mistral 3 delivers performance comparable to models with nearly twice its parameter count, reflecting strong training data curation and architectural decisions. As the leading European AI lab, Mistral also offers advantages for organizations with EU data residency requirements.

Understanding Open-Source Licenses

Not all "open" models are equally open. Here is what the licenses in this leaderboard actually mean:

LicenseCommercial UseModificationRedistributionNotable Restrictions
MITYesYesYesNone
Apache 2.0YesYesYesPatent grant included
Llama 4 LicenseYesYesYesRestrictions above 700M MAU

MIT (used by DeepSeek) is the most permissive. You can do anything with the model weights. Apache 2.0 (used by Qwen, GLM, Mistral) is similarly permissive but includes an explicit patent grant, which some legal teams prefer. Llama 4 License is permissive for most use cases but includes a usage threshold that large social media platforms would need to negotiate separately.

The Self-Hosting Economics

One of the strongest arguments for open-weight models is cost. Running DeepSeek V3.2-Speciale on your own infrastructure costs roughly $0.028 per million input tokens when amortized over reasonable utilization. The equivalent API call to a frontier proprietary model costs $2-15 per million input tokens. That is a 70x to 500x cost difference.

Of course, self-hosting requires upfront investment in GPU hardware, engineering expertise, and operational overhead. But for organizations processing millions of tokens daily, the payback period is measured in weeks, not years.

Smaller Models Worth Watching

Not every deployment needs a 400B+ parameter model. Mistral 3 Small (24B) and Qwen 3 32B deliver impressive performance for their size class, running comfortably on a single high-end GPU. For latency-sensitive applications or edge deployment, these smaller models offer the best balance of capability and efficiency.

The open-source LLM ecosystem is advancing at a pace that consistently surprises even optimistic observers. Models that would have been state-of-the-art twelve months ago are now freely downloadable. The question is no longer whether open-source can compete with proprietary models, but how long before the gap closes entirely.

About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.