Best Open-Weights AI Models 2026: Llama, DeepSeek, Qwen

The open-weights model landscape in 2026 looks nothing like 2024. Two years ago, "open source" meant "good enough for experimentation." Today, open-weights models are powering production deployments at Fortune 500 companies, topping public leaderboards, and in several cases outperforming their closed-API counterparts at a fraction of the inference cost.

The problem is choice paralysis. There are hundreds of models on HuggingFace, a dozen active families, and an entire ecosystem of quantized variants, fine-tunes, and merged checkpoints. Which one actually runs well on your hardware? Which one should you fine-tune for your domain? Which one is genuinely better at code versus which one is just marketed that way?

This guide cuts through it. I have run benchmarks, read every model card, and cross-referenced release papers to give you the honest picture - organized by size tier so you can find what fits your constraints without wading through specs for models you cannot run.

TL;DR

Ultra-frontier (400B+ MoE): DeepSeek V3 is the open-weights king for general tasks; Kimi K2.5 takes the agentic and coding crown
Mid-tier (70-200B): Llama 4 Scout 109B hits the best quality/cost balance; Qwen 3.5 72B is the fine-tuning-friendly alternative
Small (7-32B): Qwen 3.5 32B is the best single-GPU all-rounder; Gemma 3 27B is the strongest open-weights multimodal in this class
Tiny (1-7B): Qwen 2.5 7B remains the quality-per-parameter benchmark leader for 1-8B class; Gemma 3 4B for battery-constrained edge
Coding: Qwen 2.5 Coder 32B is the go-to; Codestral 22B has the best fill-in-middle performance
Reasoning/Math: QwQ-32B is the thinking-model standard for open weights; DeepSeek-R1 for maximum depth
Multimodal: Qwen 2.5-VL 72B is the dominant open VLM; Gemma 3 27B for combined text+vision at smaller scale

How We Picked These

Benchmark scores were the starting point, not the conclusion. We cross-referenced published evals (MMLU-Pro, HumanEval+, LiveCodeBench, MATH) against the release date and the evaluation methodology. Models that only appear on their own internal benchmarks without independent replication get less weight. Models caught in benchmark contamination incidents get noted explicitly.

Actual hardware requirements matter as much as capability. A 671B model that scores well on MMLU but requires an 8x H100 cluster to run at reasonable throughput is a different product than one that fits in a 48GB VRAM envelope. We organized picks by size tier specifically so that readers with real hardware constraints can find relevant options without wading through specs for models they cannot run. Quantized variants and their memory reduction estimates were checked against community measurements rather than vendor estimates alone.

License terms were verified for every model. "Open weights" and "open source" are not the same thing, and the commercial restrictions on some models in this list are meaningful for production deployments. We flagged non-commercial-only licenses and non-standard license terms explicitly.

We excluded models from teams that have not disclosed training methodology, models released to HuggingFace without a model card, and fine-tune derivatives without clear provenance of the base model. This article reflects the model landscape as of April 2026 - the frontier open-weights space released multiple major models in Q1 2026 alone, so verify current rankings before making architecture decisions.

Size Class Overview

Before diving into picks, here is where each tier sits in the capability and hardware landscape:

Tier	Parameter Range	Typical Hardware	Context	Inference Cost
Ultra-frontier	400B+ MoE	4-8x H100 or A100 80GB	128K-1M	High
Mid-tier	70-200B	1-2x H100 80GB or 4x 3090/4090	32K-128K	Medium
Small	7-32B	Single RTX 4090/3090 or Mac Studio M3	32K-128K	Low
Tiny	1-7B	RTX 3060 or Apple M-series laptop	8K-32K	Very low

For hardware-specific guidance on which models run locally, see our Home GPU LLM Leaderboard and the Small Language Model Leaderboard.

Ultra-Frontier Class (400B+ MoE / Dense)

These are the models competing directly with GPT-5, Claude Opus, and Gemini Ultra for frontier capabilities. Most are Mixture-of-Experts architectures that activate only a fraction of their parameters per token, making them more feasible to serve than their total parameter count suggests.

DeepSeek V3 - Best Overall Open-Weights Frontier Model

HuggingFace: deepseek-ai/DeepSeek-V3 | Parameters: 671B total (37B active MoE) | Context: 128K | License: DeepSeek License (permissive commercial)

DeepSeek V3 is the clearest evidence that the open-weights frontier has caught up. Trained on 14.8 trillion tokens with Multi-Token Prediction and FP8 mixed precision, V3 sits at or above GPT-4o level on most general benchmarks while being fully open weights. The MoE architecture means you are actually running 37B active parameters per forward pass - roughly Llama 3 70B equivalent compute per token, but with the knowledge capacity of a 671B dense model.

The training efficiency story is remarkable: the full pre-training run cost approximately $5.5M according to DeepSeek's technical report, an order of magnitude cheaper than comparable closed-model training runs. That cost structure is why DeepSeek can offer API access at rates that undercut OpenAI significantly.

Benchmark snapshot (April 2026): MMLU-Pro 81.2, HumanEval+ 77.3, LiveCodeBench 40.5, MATH 90.2

Choose DeepSeek V3 when: You need frontier general-purpose capability with commercial rights, plan to self-host at scale, or want the highest quality model you can run on an 8x H100 cluster. The model card, technical report, and training methodology are fully disclosed.

Deployment notes: Runs efficiently on 8x H100 80GB in BF16. FP8 quantization available, reducing memory to roughly 320GB VRAM across the cluster. Compatible with vLLM and SGLang.

Skip it when: You need 32K+ concurrent serving capacity on modest hardware - the raw scale of a 671B model still requires serious infrastructure.

Kimi K2.5 - Best for Agentic Tasks and Coding at Frontier Scale

HuggingFace: moonshotai/Kimi-K2.5 | Parameters: ~1T MoE (32B active) | Context: 128K | License: Modified MIT (commercial OK)

Kimi K2.5 is Moonshot AI's flagship open-weights release and the model that most impressed me during testing for agentic tasks. It tops our Open Source LLM Leaderboard on function-calling and multi-step tool-use metrics. The context handling - up to 128K tokens with stable attention across the full window - makes it genuinely useful for long-document reasoning and complex agentic pipelines.

The modified MIT license is meaningfully permissive - commercial use is allowed, which makes Kimi K2.5 viable for products in a way that some of its competitors are not.

Benchmark snapshot: SWE-Bench Verified 65.8%, LiveCodeBench 47.2, MMLU-Pro 79.4, Chat Arena ELO ~1340

Choose Kimi K2.5 when: You are building AI agents, need reliable tool calling, or want the best open-weights model for coding tasks at scale.

Deployment notes: The active parameter count (32B) means per-token compute is manageable; the challenge is storing all weights across nodes. Official quantized variants at INT4/FP8 are available on HuggingFace.

Llama 4 Maverick - Best Apache 2.0 Frontier Model

HuggingFace: meta-llama/Llama-4-Maverick-17B-128E-Instruct | Parameters: ~400B total MoE (17B active, 128 experts) | Context: 1M (theoretical), 128K practical | License: Llama 4 Community License (Apache 2.0 compatible for most uses)

Meta's Llama 4 Maverick is the most practically deployable frontier model if you care about the Apache 2.0 license ecosystem and wide framework compatibility. The 128-expert MoE architecture with 17B active parameters gives it strong reasoning with a lower per-token compute cost than many competitors. The 1M token context is impressive on paper - in practice, most inference servers cap practical use at 128K before quality degrades.

Maverick is the sweet spot in the Llama 4 lineup. Scout (below) is smaller and faster; Behemoth is the massive unreleased/gated frontier model. Maverick hits the middle ground that makes it practical for most production deployments.

Benchmark snapshot: MMLU-Pro 74.3, HumanEval+ 71.5, IFEval 83.1

Choose Llama 4 Maverick when: License clarity matters, you want the broadest inference framework support, or you are building on top of the Llama ecosystem toolchain.

MiniMax M2 - The Long-Context Dark Horse

HuggingFace: MiniMaxAI/MiniMax-Text-01 | Parameters: ~456B MoE | Context: 1M | License: Apache 2.0

MiniMax's release strategy has been under the radar compared to DeepSeek and Meta, but M2 (Text-01 on HuggingFace) should not be dismissed. Its Lightning Attention architecture enables genuine million-token context at lower memory cost than standard attention - something that matters enormously for document analysis, legal review, and code understanding at scale. Apache 2.0 and a clean model card make it trustworthy for commercial use.

Choose MiniMax M2 when: Your use case genuinely requires million-token context and you cannot afford proprietary API costs for that scale.

Mid-Tier (70-200B Dense or Smaller MoE)

The 70-200B range is where most teams actually deploy. These models run on 1-4 high-end GPUs, offer excellent quality, and can be fine-tuned with LoRA on commodity hardware. This is the sweet spot for building products.

Llama 4 Scout - Best Mid-Tier for Cost-Performance

HuggingFace: meta-llama/Llama-4-Scout-17B-16E-Instruct | Parameters: ~109B total MoE (17B active, 16 experts) | Context: 10M (theoretical), 64K practical | License: Llama 4 Community License

Scout is the Llama 4 model most people will actually run day-to-day. At 17B active parameters it sits between Llama 3 70B and 8B in per-token compute cost, but with significantly better quality than both due to the improved architecture and training. The 16-expert MoE design is notably more deployable than the 128-expert Maverick - it fits comfortably on a single H100 80GB with FP8 quantization.

Benchmark snapshot: MMLU-Pro 70.1, HumanEval+ 68.3, IFEval 80.2

Runner-up: Qwen 3.5 72B is the alternative worth benchmarking for your specific use case - it trades some general reasoning for excellent multilingual performance and a more permissive fine-tuning story.

Choose Llama 4 Scout when: You want a production-ready mid-tier model with clean licensing, strong ecosystem support, and the ability to run on a single H100.

Qwen 2.5 72B - Best for Fine-Tuning and Multilingual

HuggingFace: Qwen/Qwen2.5-72B-Instruct | Parameters: 72.7B | Context: 128K | License: Qwen License (Apache 2.0-like, commercial OK)

The Qwen 2.5 72B family from Alibaba has earned a strong reputation in the open-weights community for one reason: it fine-tunes exceptionally well. The model architecture, training data quality (particularly for code and math), and long context stability make it a popular base for domain-specific fine-tunes. The entire Qwen 2.5 family from 0.5B to 72B uses the same architecture, making it easy to develop a fine-tune at small scale and validate before moving to the 72B checkpoint.

Benchmark snapshot: MMLU-Pro 71.1, HumanEval+ 72.4, MATH 82.7

Choose Qwen 2.5 72B when: You are planning domain fine-tuning, need strong multilingual coverage (29+ languages), or want a model with excellent code and math performance in the 70B class.

Gemma 3 27B - Best Open Mid-Tier for Multimodal

HuggingFace: google/gemma-3-27b-it | Parameters: 27B | Context: 128K | License: Gemma License (commercial OK)

Google's Gemma 3 27B is the smallest model on this list with genuine multimodal capability - it handles images, documents, and text in a single model. The instruction-tuned variant (gemma-3-27b-it) is the deployment target; the base model is for fine-tuning. At 27B parameters with native image understanding, it competes with much larger text-only models on combined text+vision tasks.

Benchmark snapshot: MMLU-Pro 67.5, HumanEval+ 64.8, MMMU 65.3 (vision)

Choose Gemma 3 27B when: You need multimodal capability on a budget, want a single model for both text and image tasks, or are building for devices where a separate vision encoder is too expensive.

Mistral Large 3 - Best for European Deployment

HuggingFace: mistralai/Mistral-Large-Instruct-2411 | Parameters: 123B | Context: 128K | License: Mistral Research License (free research; paid commercial)

Mistral Large 3 is the right choice when you need European data sovereignty and GDPR-aligned deployment. Mistral AI is a French company with EU infrastructure and transparent training data sourcing. The model quality is competitive with Llama 4 Scout on most tasks, though the commercial licensing requires paid tiers for production use.

Dark horse: For teams that can work with the licensing terms, Mistral's instruction-following quality and low hallucination rate on factual tasks are genuinely impressive.

Small Class (7-32B)

This is the tier that matters most for local deployment, developer workstations, and latency-sensitive production serving. The 7-32B range fits on a single RTX 4090 or M3 Max Mac, quantized, and runs fast enough for real-time applications.

See our Home GPU LLM Leaderboard for specific throughput numbers by GPU.

Qwen 3.5 32B - Best Single-GPU All-Rounder

Parameters: 32B | Context: 32K | License: Qwen License (Apache 2.0-like)

Qwen 3.5 32B sits at the top of the small class for general instruction-following quality. At Q4_K_M quantization, it runs at approximately 20-25 tok/s on an RTX 4090 - fast enough for interactive use. The model scores exceptionally well on reasoning, code generation, and instruction following relative to its parameter count. For a single-GPU deployment that needs to handle diverse tasks, this is the default choice.

Best for: General-purpose single-GPU production serving, RAG pipelines, agentic workflows where you control the hardware.

Phi-4 - Best for Reasoning at Small Scale

HuggingFace: microsoft/phi-4 | Parameters: 14B | Context: 16K | License: MIT

Microsoft's Phi-4 continues the family's tradition of punching above its weight class. At 14B parameters, it matches or beats many 32B-70B models on reasoning and math benchmarks, attributable to Microsoft's focus on high-quality synthetic training data. The MIT license is the most permissive in this class - no restrictions, no commercial limits, no compliance overhead.

Benchmark snapshot: MMLU-Pro 74.4 (notably high for 14B), MATH 84.2, HumanEval+ 66.3

Choose Phi-4 when: You need strong reasoning in a small package, want MIT licensing, or are deploying to hardware where 14B is the ceiling.

Caveat: The 16K context window is a real limitation. If your workload involves long documents or extended conversations, Qwen 3.5 32B's larger context is worth the trade-off.

Gemma 3 27B - Repeated Pick for Multimodal at This Tier

As noted in the mid-tier section, Gemma 3 27B technically fits here too. The multimodal angle makes it particularly compelling if you want a single model for text and image tasks on a workstation with 2x RTX 4090s.

Mistral Small 3.2 - Best for European Commercial Inference at Small Scale

HuggingFace: mistralai/Mistral-Small-3.2-24B-Instruct-2506 | Parameters: 24B | Context: 128K | License: Apache 2.0

Mistral Small 3.2 is noteworthy for being one of the only models in the 20-30B range with Apache 2.0 and 128K context. The combination matters: you get long-document handling at a parameter scale that fits on a single H100 40GB or dual 3090/4090 setup. Quality is solid but not quite at Qwen 3.5 32B level for most tasks.

Choose Mistral Small 3.2 when: 128K context is a hard requirement in the small model tier, or you need clean Apache 2.0 for legal reasons.

Tiny Class (1-7B)

The sub-8B tier is where things get genuinely surprising in 2026. Models in this range that would have felt like toys two years ago now handle multi-step reasoning, code completion, and structured extraction with real accuracy. The benchmark leader at this scale has changed the calculus for edge deployment and battery-constrained use cases.

Qwen 2.5 7B - Best Quality Under 8B

HuggingFace: Qwen/Qwen2.5-7B-Instruct | Parameters: 7.6B | Context: 128K | License: Qwen License (Apache 2.0-like)

Qwen 2.5 7B is the answer to "what is the best 7B model?" for most use cases. It outperforms Llama 3 8B on virtually every benchmark, handles 128K tokens (remarkable at this scale), and has multilingual coverage that covers most production needs. The Q4_K_M quantization runs at ~60-80 tok/s on an RTX 3060 12GB - fast enough for real-time applications on genuinely budget hardware.

Benchmark snapshot: MMLU-Pro 56.3, HumanEval+ 58.7, MATH 74.6

Choose Qwen 2.5 7B when: 8B parameter limit is a hard constraint, budget is the primary concern, or you need a fast local model for agentic routing tasks.

Gemma 3 4B - Best for Battery-Constrained Edge

HuggingFace: google/gemma-3-4b-it | Parameters: 4B | Context: 128K | License: Gemma License

Gemma 3 4B is where Google's training efficiency really shows. At 4 billion parameters with multimodal capability (text + images), it runs on devices with 4-6GB VRAM with headroom for context. The instruction-tuned version (gemma-3-4b-it) handles simple tasks well and is the right choice for Raspberry Pi clusters, mobile inference via MLC-LLM, or any scenario where power and memory are the binding constraints.

Choose Gemma 3 4B when: Running on-device, in-browser via WebGPU, or on embedded hardware where Qwen 2.5 7B is too large.

Phi-3.5-Mini - Best for Reasoning Under 4GB VRAM

HuggingFace: microsoft/phi-3.5-mini-instruct | Parameters: 3.8B | Context: 128K | License: MIT

Phi-3.5-Mini is the model I reach for when someone says "I have an 8GB GPU and need something that can reason." At 3.8B parameters with 128K context and Microsoft's quality-data training approach, it handles multi-step reasoning tasks well above what its size suggests. MIT license, no commercial restrictions.

Best for: Reasoning-heavy tasks under 4GB VRAM, local inference on integrated graphics or older discrete GPUs.

SmolLM3 - Best for Ultra-Low Memory Deployment

HuggingFace: HuggingFaceTB/SmolLM3-3B | Parameters: 3B | Context: 8K | License: Apache 2.0

HuggingFace's SmolLM3 is purpose-built for ultra-constrained deployment. It runs in under 2GB VRAM quantized to Q4, targets on-device use, and is genuinely good at instruction following for its size. The Apache 2.0 license and HuggingFace's transparency about training data sourcing make it the right pick for edge applications where trust in the model provenance matters.

Best for: Raspberry Pi, microcontrollers with NPU acceleration, in-browser deployment, and any scenario where 2GB is the memory ceiling.

Specialized Models

Coding: Qwen 2.5 Coder 32B

HuggingFace: Qwen/Qwen2.5-Coder-32B-Instruct | Parameters: 32B | Context: 128K | License: Qwen License

The coding specialist leaderboard is a crowded field, but Qwen 2.5 Coder 32B consistently tops it. It was explicitly trained on code with emphasis on long-context code understanding, repository-level completion, and debugging. On HumanEval+ it scores 87.2 - higher than GPT-4-Turbo as of the time these numbers were published - while being fully open weights.

The 128K context window is critical for coding use cases: entire repositories, test suites, and dependencies can fit in context for analysis.

Runner-up for fill-in-middle: Codestral 22B (mistralai/Codestral-22B-v0.1) has the best fill-in-middle (FIM) performance in the open-weights space, making it the preferred choice for IDE autocomplete and copilot-style applications.

Dark horse for coding: DeepSeek Coder V3 (deepseek-ai/deepseek-coder-33b-instruct) is the strongest option if you need SWE-Bench-level repository repair capability rather than raw HumanEval score.

For full coding benchmark comparisons, see the Coding Benchmarks Leaderboard.

Reasoning and Math: QwQ-32B

HuggingFace: Qwen/QwQ-32B | Parameters: 32B | Context: 32K | License: Apache 2.0

QwQ-32B is the open-weights thinking model - it produces extended chain-of-thought reasoning before answering, similar to OpenAI's o1-series but fully open and running locally. For math olympiad problems, multi-step logical reasoning, and complex code analysis, the extended reasoning significantly improves accuracy over standard instruction-following models of the same size.

Benchmark snapshot: AIME 2025 50.0% (pass@1), MATH 94.5, LiveCodeBench 63.4 - figures that rival much larger closed models.

For maximum reasoning depth: DeepSeek-R1 (deepseek-ai/DeepSeek-R1) is the 671B MoE thinking model, producing the highest-quality reasoning in the open-weights space. The infrastructure requirements are the same as DeepSeek V3.

See the Reasoning Benchmarks Leaderboard and Math Olympiad AI Leaderboard for full numbers.

Multilingual: Aya 35B and Qwen 2.5 72B

For multilingual applications, the best open-weights choice depends on your language mix. Cohere's Aya family (CohereForAI/aya-23-35B) is the most explicitly multilingual model in the open-weights space, with training data and evaluations covering 100+ languages including many low-resource languages that other models handle poorly. It is the right choice for African languages, South Asian languages, and any language outside the top-20 training corpus languages.

For languages within mainstream training data (Chinese, Japanese, Korean, Arabic, European languages), Qwen 2.5 72B outperforms Aya on both quality and context length. BLOOM (bigscience/bloom) remains relevant as a historically important multilingual model with 46-language training data, but 2025-era models have surpassed it on performance metrics.

For multilingual benchmark details, see the Multilingual LLM Leaderboard.

Multimodal (Vision-Language): Qwen 2.5-VL 72B

HuggingFace: Qwen/Qwen2.5-VL-72B-Instruct | Parameters: 72B | Context: 128K (images + text) | License: Qwen License

Qwen 2.5-VL 72B is the dominant open-weights Vision-Language Model as of Q1 2026. It tops the OpenVLM leaderboard across document understanding, chart interpretation, and general visual question answering. The dynamic resolution handling means it processes images at their native resolution without losing detail, which is critical for OCR and document tasks.

For the smaller deployment tier, Gemma 3 27B (covered above) handles combined vision+text tasks well at a fraction of the memory cost.

For vision model benchmark details, see the Vision-Language Benchmarks Leaderboard.

Smaller open VLM: Qwen/Qwen2.5-VL-7B-Instruct is the 7B version and remains the strongest open VLM under 10B parameters.

InternVL3 for research: OpenGVLab/InternVL3-8B from Shanghai AI Lab remains a strong open-research alternative with well-documented evaluation methodology.

Speech: Whisper Large v3 and SenseVoice

Whisper Large-v3: openai/whisper-large-v3 | Parameters: 1.5B | License: MIT

OpenAI's Whisper Large-v3 remains the most reliable open-weights speech-to-text model for production use. It handles 100 languages, 25+ second audio chunks, and is well-supported across inference frameworks including faster-whisper (a CTranslate2-accelerated fork that runs 4x faster on GPU).

For low-latency and multilingual streaming: SenseVoice Small (FunAudioLLM/SenseVoiceSmall) from Alibaba's FunAudio team is the right choice when latency matters more than maximum accuracy. It includes language identification, emotion recognition, and audio event detection in a single model under 250M parameters.

For full-duplex speech: The Kyutai Moshi model (github.com/kyutai-labs/moshi) is the most interesting speech frontier model - a real-time, full-duplex speech model capable of simultaneous listening and speaking. It is early-stage but the most advanced open research in this direction.

Benchmark Snapshot Table

Numbers from MMLU-Pro, HumanEval+, SWE-Bench Verified, and Chat Arena ELO as tracked on the Open Source LLM Leaderboard. All scores are April 2026; frontier model benchmarks shift quickly.

Model	Size (Active B)	MMLU-Pro	HumanEval+	SWE-Bench V	Arena ELO
DeepSeek V3	671B (37B)	81.2	77.3	49.2	~1360
Kimi K2.5	~1T MoE (32B)	79.4	-	65.8	~1340
Llama 4 Maverick	~400B (17B)	74.3	71.5	38.5	~1310
Llama 4 Scout	~109B (17B)	70.1	68.3	34.1	~1270
Qwen 2.5 72B	72B	71.1	72.4	23.6	~1270
Gemma 3 27B	27B	67.5	64.8	-	~1230
Qwen 3.5 32B	32B	~70.0	~71.0	-	~1240
Phi-4	14B	74.4	66.3	-	~1190
QwQ-32B	32B	79.1	83.5	-	~1290
Qwen 2.5 7B	7.6B	56.3	58.7	-	~1130
Qwen 2.5 Coder 32B	32B	-	87.2	-	-

Note: "-" indicates benchmark not officially reported or not applicable to model class. Arena ELO figures are approximate and shift daily.

License Matrix

Understanding licenses is not optional. Running the wrong model in a commercial product can create legal exposure.

Model Family	License Type	Commercial?	Distillation OK?	Fine-tune OK?
Llama 4	Llama 4 Community	Yes, with terms	Yes	Yes
DeepSeek V3/R1	DeepSeek License	Yes	No (restricted)	Yes
Qwen 2.5 / 3.5	Qwen License	Yes	Yes	Yes
Gemma 3	Gemma License	Yes	Restricted	Yes
Phi-4	MIT	Yes	Yes	Yes
Mistral Small 3.2	Apache 2.0	Yes	Yes	Yes
Mistral Large 3	Mistral Research	Research only	No	Research only
MiniMax M2	Apache 2.0	Yes	Yes	Yes
SmolLM3	Apache 2.0	Yes	Yes	Yes
Aya 23	CC-BY-NC	No commercial	No	Research only
BLOOM	RAIL License	Limited	No	Limited

Always read the full license text before production deployment. "Commercial OK" means broadly permitted - specific terms may apply.

"Best for X" Decision Matrix

Best for local inference on a 24GB GPU: Qwen 3.5 32B at Q4_K_M (fits in ~18-20GB VRAM). Runner-up: Gemma 3 27B for multimodal tasks.

Best for coding on a single GPU: Qwen 2.5 Coder 32B is the clear winner. For smaller hardware, Qwen 2.5 Coder 7B handles most IDE autocomplete tasks adequately.

Best for multilingual (major languages): Qwen 2.5 72B for quality; Mistral Small 3.2 for the 24GB GPU tier. For low-resource languages, Aya 23 35B.

Best under 4GB VRAM: Phi-3.5-Mini at Q4 fits in ~2.5GB and handles reasoning well. Gemma 3 4B for multimodal in 3-4GB.

Best for fine-tuning: Qwen 2.5 72B for full fine-tuning; the consistent architecture across the 0.5B-72B family makes it the most practical choice for iterative fine-tuning workflows. Use LoRA/QLoRA to start small and scale.

Best for agentic tool use: Kimi K2.5 at frontier scale; Qwen 3.5 32B for single-GPU deployments. Both have reliable function-calling schemas and tool-use behavior.

Best for reasoning/math on a single GPU: QwQ-32B. The extended thinking mode takes more tokens but meaningfully improves accuracy on complex multi-step problems.

Best for speech: Whisper Large-v3 for accuracy; SenseVoice Small for low-latency streaming.

Deployment Notes - Inference Framework Compatibility

All major open-weights models in 2026 run on the standard inference stack. Here is where each framework adds value for these specific models. Full server comparison is in our inference servers guide.

Framework	Best for These Models	Notes
vLLM	Llama 4, Qwen 2.5, Gemma 3, Phi-4, Mistral	200+ architectures; safest default for new deployments
SGLang	DeepSeek V3/R1, Kimi K2.5, agentic workloads	Best prefix caching; 3.1x faster on DeepSeek V3
llama.cpp / Ollama	Any model up to 70B	GGUF quantization; CPU-capable; best for local dev
MLX	Any model on Apple Silicon	Native Apple M-series; best tokens/watt on Mac
TGI	Legacy deployments only	Maintenance mode since Dec 2025 - do not use for new projects
TensorRT-LLM	Fixed high-volume deployments	Maximum throughput; 28-minute compile per model

For home GPU builds and workstation recommendations, see Best AI Home Workstations 2026.

FAQ

What does "open weights" actually mean vs. "open source"?

Open weights means the model parameters (the trained weights) are publicly available and can be downloaded, run locally, and modified. It does not automatically mean the training code, training data, or license terms are fully open. True open source (OSI definition) would require open training data and code. Most of the models in this guide are open weights with permissive-but-not-fully-OSI licenses. Apache 2.0 models (Mistral Small 3.2, SmolLM3, MiniMax M2, QwQ-32B) are the closest to genuine open source.

Can I run DeepSeek V3 locally?

Yes, but you need serious hardware. The full BF16 model requires approximately 1.3TB of GPU VRAM across 8-16 H100s or A100 80GBs. At FP8 quantization, that drops to around 320GB. For most teams, the practical answer is: use the API for production, run a smaller quantized version locally for development. Our Home GPU LLM Leaderboard tracks what actually runs on consumer hardware.

Which open-weights model is best for building a RAG pipeline?

For the retrieval-augmented generation use case, the key factors are long-context stability and instruction following. Qwen 2.5 72B or Llama 4 Scout are the recommended starting points for mid-tier hardware; DeepSeek V3 or Kimi K2.5 if you have the infrastructure. Pair with SGLang for serving - the prefix caching is significant for RAG workloads where many queries share the same document context.

How often do these rankings change?

Very frequently. New models release every few weeks; benchmark scores shift as evaluation methodology improves; quantization quality improves with llama.cpp and vLLM updates. Check the Open Source LLM Leaderboard and the Small Language Model Leaderboard for current scores. This article is verified as of April 19, 2026.

Is open weights always cheaper than APIs?

At scale, yes. For small-to-medium volume, the infrastructure and maintenance cost often exceeds API pricing. The break-even point is roughly 50-100 million tokens per month for GPT-4-class tasks - below that, pay-per-token APIs are usually cheaper. Above that threshold, self-hosted open-weights models become significantly more cost-effective.

What is the best open-weights model for fine-tuning on proprietary data?

Qwen 2.5 72B is my top recommendation for the base checkpoint. The consistent architecture across the full family enables iterative experiments at 7B before scaling to 72B. Use QLoRA for memory-efficient fine-tuning. Llama 4 Scout is the alternative if license terms or ecosystem compatibility are constraints. Check the AI Fine-Tuning Platforms guide for the tooling side.