Best Open-Weights AI Models 2026
The definitive guide to open-weights AI models in 2026 - top picks by size tier, use case, benchmark scores, and deployment hardware. From 400B+ MoE giants to 1B edge models.

Open-weights models now compete with closed APIs across nearly every task class.
The open-weights model landscape in 2026 looks nothing like 2024. Two years ago, "open source" meant "good enough for experimentation." Today, open-weights models are powering production deployments at Fortune 500 companies, topping public leaderboards, and in several cases outperforming their closed-API counterparts at a fraction of the inference cost.
The problem is choice paralysis. There are hundreds of models on HuggingFace, a dozen active families, and an entire ecosystem of quantized variants, fine-tunes, and merged checkpoints. Which one actually runs well on your hardware? Which one should you fine-tune for your domain? Which one is genuinely better at code versus which one is just marketed that way?
This guide cuts through it. I have run benchmarks, read every model card, and cross-referenced release papers to give you the honest picture - organized by size tier so you can find what fits your constraints without wading through specs for models you cannot run.
TL;DR
- Ultra-frontier (400B+ MoE): DeepSeek V3 is the open-weights king for general tasks; Kimi K2.5 takes the agentic and coding crown
- Mid-tier (70-200B): Llama 4 Scout 109B hits the best quality/cost balance; Qwen 3.5 72B is the fine-tuning-friendly alternative
- Small (7-32B): Qwen 3.5 32B is the best single-GPU all-rounder; Gemma 3 27B is the strongest open-weights multimodal in this class
- Tiny (1-7B): Qwen 2.5 7B remains the quality-per-parameter benchmark leader for 1-8B class; Gemma 3 4B for battery-constrained edge
- Coding: Qwen 2.5 Coder 32B is the go-to; Codestral 22B has the best fill-in-middle performance
- Reasoning/Math: QwQ-32B is the thinking-model standard for open weights; DeepSeek-R1 for maximum depth
- Multimodal: Qwen 2.5-VL 72B is the dominant open VLM; Gemma 3 27B for combined text+vision at smaller scale
Size Class Overview
Before diving into picks, here is where each tier sits in the capability and hardware landscape:
| Tier | Parameter Range | Typical Hardware | Context | Inference Cost |
|---|---|---|---|---|
| Ultra-frontier | 400B+ MoE | 4-8x H100 or A100 80GB | 128K-1M | High |
| Mid-tier | 70-200B | 1-2x H100 80GB or 4x 3090/4090 | 32K-128K | Medium |
| Small | 7-32B | Single RTX 4090/3090 or Mac Studio M3 | 32K-128K | Low |
| Tiny | 1-7B | RTX 3060 or Apple M-series laptop | 8K-32K | Very low |
For hardware-specific guidance on which models run locally, see our Home GPU LLM Leaderboard and the Small Language Model Leaderboard.
Ultra-Frontier Class (400B+ MoE / Dense)
These are the models competing directly with GPT-5, Claude Opus, and Gemini Ultra for frontier capabilities. Most are Mixture-of-Experts architectures that activate only a fraction of their parameters per token, making them more feasible to serve than their total parameter count suggests.
DeepSeek V3 - Best Overall Open-Weights Frontier Model
HuggingFace: deepseek-ai/DeepSeek-V3 | Parameters: 671B total (37B active MoE) | Context: 128K | License: DeepSeek License (permissive commercial)
DeepSeek V3 is the clearest evidence that the open-weights frontier has caught up. Trained on 14.8 trillion tokens with Multi-Token Prediction and FP8 mixed precision, V3 sits at or above GPT-4o level on most general benchmarks while being fully open weights. The MoE architecture means you are actually running 37B active parameters per forward pass - roughly Llama 3 70B equivalent compute per token, but with the knowledge capacity of a 671B dense model.
The training efficiency story is remarkable: the full pre-training run cost approximately $5.5M according to DeepSeek's technical report, an order of magnitude cheaper than comparable closed-model training runs. That cost structure is why DeepSeek can offer API access at rates that undercut OpenAI significantly.
Benchmark snapshot (April 2026): MMLU-Pro 81.2, HumanEval+ 77.3, LiveCodeBench 40.5, MATH 90.2
Choose DeepSeek V3 when: You need frontier general-purpose capability with commercial rights, plan to self-host at scale, or want the highest quality model you can run on an 8x H100 cluster. The model card, technical report, and training methodology are fully disclosed.
Deployment notes: Runs efficiently on 8x H100 80GB in BF16. FP8 quantization available, reducing memory to roughly 320GB VRAM across the cluster. Compatible with vLLM and SGLang.
Skip it when: You need 32K+ concurrent serving capacity on modest hardware - the raw scale of a 671B model still requires serious infrastructure.
Kimi K2.5 - Best for Agentic Tasks and Coding at Frontier Scale
HuggingFace: moonshotai/Kimi-K2.5 | Parameters: ~1T MoE (32B active) | Context: 128K | License: Modified MIT (commercial OK)
Kimi K2.5 is Moonshot AI's flagship open-weights release and the model that most impressed me during testing for agentic tasks. It tops our Open Source LLM Leaderboard on function-calling and multi-step tool-use metrics. The context handling - up to 128K tokens with stable attention across the full window - makes it genuinely useful for long-document reasoning and complex agentic pipelines.
The modified MIT license is meaningfully permissive - commercial use is allowed, which makes Kimi K2.5 viable for products in a way that some of its competitors are not.
Benchmark snapshot: SWE-Bench Verified 65.8%, LiveCodeBench 47.2, MMLU-Pro 79.4, Chat Arena ELO ~1340
Choose Kimi K2.5 when: You are building AI agents, need reliable tool calling, or want the best open-weights model for coding tasks at scale.
Deployment notes: The active parameter count (32B) means per-token compute is manageable; the challenge is storing all weights across nodes. Official quantized variants at INT4/FP8 are available on HuggingFace.
Llama 4 Maverick - Best Apache 2.0 Frontier Model
HuggingFace: meta-llama/Llama-4-Maverick-17B-128E-Instruct | Parameters: ~400B total MoE (17B active, 128 experts) | Context: 1M (theoretical), 128K practical | License: Llama 4 Community License (Apache 2.0 compatible for most uses)
Meta's Llama 4 Maverick is the most practically deployable frontier model if you care about the Apache 2.0 license ecosystem and wide framework compatibility. The 128-expert MoE architecture with 17B active parameters gives it strong reasoning with a lower per-token compute cost than many competitors. The 1M token context is impressive on paper - in practice, most inference servers cap practical use at 128K before quality degrades.
Maverick is the sweet spot in the Llama 4 lineup. Scout (below) is smaller and faster; Behemoth is the massive unreleased/gated frontier model. Maverick hits the middle ground that makes it practical for most production deployments.
Benchmark snapshot: MMLU-Pro 74.3, HumanEval+ 71.5, IFEval 83.1
Choose Llama 4 Maverick when: License clarity matters, you want the broadest inference framework support, or you are building on top of the Llama ecosystem toolchain.
MiniMax M2 - The Long-Context Dark Horse
HuggingFace: MiniMaxAI/MiniMax-Text-01 | Parameters: ~456B MoE | Context: 1M | License: Apache 2.0
MiniMax's release strategy has been under the radar compared to DeepSeek and Meta, but M2 (Text-01 on HuggingFace) should not be dismissed. Its Lightning Attention architecture enables genuine million-token context at lower memory cost than standard attention - something that matters enormously for document analysis, legal review, and code understanding at scale. Apache 2.0 and a clean model card make it trustworthy for commercial use.
Choose MiniMax M2 when: Your use case genuinely requires million-token context and you cannot afford proprietary API costs for that scale.
Mid-Tier (70-200B Dense or Smaller MoE)
The 70-200B range is where most teams actually deploy. These models run on 1-4 high-end GPUs, offer excellent quality, and can be fine-tuned with LoRA on commodity hardware. This is the sweet spot for building products.
Llama 4 Scout - Best Mid-Tier for Cost-Performance
HuggingFace: meta-llama/Llama-4-Scout-17B-16E-Instruct | Parameters: ~109B total MoE (17B active, 16 experts) | Context: 10M (theoretical), 64K practical | License: Llama 4 Community License
Scout is the Llama 4 model most people will actually run day-to-day. At 17B active parameters it sits between Llama 3 70B and 8B in per-token compute cost, but with significantly better quality than both due to the improved architecture and training. The 16-expert MoE design is notably more deployable than the 128-expert Maverick - it fits comfortably on a single H100 80GB with FP8 quantization.
Benchmark snapshot: MMLU-Pro 70.1, HumanEval+ 68.3, IFEval 80.2
Runner-up: Qwen 3.5 72B is the alternative worth benchmarking for your specific use case - it trades some general reasoning for excellent multilingual performance and a more permissive fine-tuning story.
Choose Llama 4 Scout when: You want a production-ready mid-tier model with clean licensing, strong ecosystem support, and the ability to run on a single H100.
Qwen 2.5 72B - Best for Fine-Tuning and Multilingual
HuggingFace: Qwen/Qwen2.5-72B-Instruct | Parameters: 72.7B | Context: 128K | License: Qwen License (Apache 2.0-like, commercial OK)
The Qwen 2.5 72B family from Alibaba has earned a strong reputation in the open-weights community for one reason: it fine-tunes exceptionally well. The model architecture, training data quality (particularly for code and math), and long context stability make it a popular base for domain-specific fine-tunes. The entire Qwen 2.5 family from 0.5B to 72B uses the same architecture, making it easy to develop a fine-tune at small scale and validate before moving to the 72B checkpoint.
Benchmark snapshot: MMLU-Pro 71.1, HumanEval+ 72.4, MATH 82.7
Choose Qwen 2.5 72B when: You are planning domain fine-tuning, need strong multilingual coverage (29+ languages), or want a model with excellent code and math performance in the 70B class.
Gemma 3 27B - Best Open Mid-Tier for Multimodal
HuggingFace: google/gemma-3-27b-it | Parameters: 27B | Context: 128K | License: Gemma License (commercial OK)
Google's Gemma 3 27B is the smallest model on this list with genuine multimodal capability - it handles images, documents, and text in a single model. The instruction-tuned variant (gemma-3-27b-it) is the deployment target; the base model is for fine-tuning. At 27B parameters with native image understanding, it competes with much larger text-only models on combined text+vision tasks.
Benchmark snapshot: MMLU-Pro 67.5, HumanEval+ 64.8, MMMU 65.3 (vision)
Choose Gemma 3 27B when: You need multimodal capability on a budget, want a single model for both text and image tasks, or are building for devices where a separate vision encoder is too expensive.
Mistral Large 3 - Best for European Deployment
HuggingFace: mistralai/Mistral-Large-Instruct-2411 | Parameters: 123B | Context: 128K | License: Mistral Research License (free research; paid commercial)
Mistral Large 3 is the right choice when you need European data sovereignty and GDPR-aligned deployment. Mistral AI is a French company with EU infrastructure and transparent training data sourcing. The model quality is competitive with Llama 4 Scout on most tasks, though the commercial licensing requires paid tiers for production use.
Dark horse: For teams that can work with the licensing terms, Mistral's instruction-following quality and low hallucination rate on factual tasks are genuinely impressive.
Small Class (7-32B)
This is the tier that matters most for local deployment, developer workstations, and latency-sensitive production serving. The 7-32B range fits on a single RTX 4090 or M3 Max Mac, quantized, and runs fast enough for real-time applications.
See our Home GPU LLM Leaderboard for specific throughput numbers by GPU.
Qwen 3.5 32B - Best Single-GPU All-Rounder
Parameters: 32B | Context: 32K | License: Qwen License (Apache 2.0-like)
Qwen 3.5 32B sits at the top of the small class for general instruction-following quality. At Q4_K_M quantization, it runs at approximately 20-25 tok/s on an RTX 4090 - fast enough for interactive use. The model scores exceptionally well on reasoning, code generation, and instruction following relative to its parameter count. For a single-GPU deployment that needs to handle diverse tasks, this is the default choice.
Best for: General-purpose single-GPU production serving, RAG pipelines, agentic workflows where you control the hardware.
Phi-4 - Best for Reasoning at Small Scale
HuggingFace: microsoft/phi-4 | Parameters: 14B | Context: 16K | License: MIT
Microsoft's Phi-4 continues the family's tradition of punching above its weight class. At 14B parameters, it matches or beats many 32B-70B models on reasoning and math benchmarks, attributable to Microsoft's focus on high-quality synthetic training data. The MIT license is the most permissive in this class - no restrictions, no commercial limits, no compliance overhead.
Benchmark snapshot: MMLU-Pro 74.4 (notably high for 14B), MATH 84.2, HumanEval+ 66.3
Choose Phi-4 when: You need strong reasoning in a small package, want MIT licensing, or are deploying to hardware where 14B is the ceiling.
Caveat: The 16K context window is a real limitation. If your workload involves long documents or extended conversations, Qwen 3.5 32B's larger context is worth the trade-off.
Gemma 3 27B - Repeated Pick for Multimodal at This Tier
As noted in the mid-tier section, Gemma 3 27B technically fits here too. The multimodal angle makes it particularly compelling if you want a single model for text and image tasks on a workstation with 2x RTX 4090s.
Mistral Small 3.2 - Best for European Commercial Inference at Small Scale
HuggingFace: mistralai/Mistral-Small-3.2-24B-Instruct-2506 | Parameters: 24B | Context: 128K | License: Apache 2.0
Mistral Small 3.2 is noteworthy for being one of the only models in the 20-30B range with Apache 2.0 and 128K context. The combination matters: you get long-document handling at a parameter scale that fits on a single H100 40GB or dual 3090/4090 setup. Quality is solid but not quite at Qwen 3.5 32B level for most tasks.
Choose Mistral Small 3.2 when: 128K context is a hard requirement in the small model tier, or you need clean Apache 2.0 for legal reasons.
Tiny Class (1-7B)
The sub-8B tier is where things get genuinely surprising in 2026. Models in this range that would have felt like toys two years ago now handle multi-step reasoning, code completion, and structured extraction with real accuracy. The benchmark leader at this scale has changed the calculus for edge deployment and battery-constrained use cases.
Qwen 2.5 7B - Best Quality Under 8B
HuggingFace: Qwen/Qwen2.5-7B-Instruct | Parameters: 7.6B | Context: 128K | License: Qwen License (Apache 2.0-like)
Qwen 2.5 7B is the answer to "what is the best 7B model?" for most use cases. It outperforms Llama 3 8B on virtually every benchmark, handles 128K tokens (remarkable at this scale), and has multilingual coverage that covers most production needs. The Q4_K_M quantization runs at ~60-80 tok/s on an RTX 3060 12GB - fast enough for real-time applications on genuinely budget hardware.
Benchmark snapshot: MMLU-Pro 56.3, HumanEval+ 58.7, MATH 74.6
Choose Qwen 2.5 7B when: 8B parameter limit is a hard constraint, budget is the primary concern, or you need a fast local model for agentic routing tasks.
Gemma 3 4B - Best for Battery-Constrained Edge
HuggingFace: google/gemma-3-4b-it | Parameters: 4B | Context: 128K | License: Gemma License
Gemma 3 4B is where Google's training efficiency really shows. At 4 billion parameters with multimodal capability (text + images), it runs on devices with 4-6GB VRAM with headroom for context. The instruction-tuned version (gemma-3-4b-it) handles simple tasks well and is the right choice for Raspberry Pi clusters, mobile inference via MLC-LLM, or any scenario where power and memory are the binding constraints.
Choose Gemma 3 4B when: Running on-device, in-browser via WebGPU, or on embedded hardware where Qwen 2.5 7B is too large.
Phi-3.5-Mini - Best for Reasoning Under 4GB VRAM
HuggingFace: microsoft/phi-3.5-mini-instruct | Parameters: 3.8B | Context: 128K | License: MIT
Phi-3.5-Mini is the model I reach for when someone says "I have an 8GB GPU and need something that can reason." At 3.8B parameters with 128K context and Microsoft's quality-data training approach, it handles multi-step reasoning tasks well above what its size suggests. MIT license, no commercial restrictions.
Best for: Reasoning-heavy tasks under 4GB VRAM, local inference on integrated graphics or older discrete GPUs.
SmolLM3 - Best for Ultra-Low Memory Deployment
HuggingFace: HuggingFaceTB/SmolLM3-3B | Parameters: 3B | Context: 8K | License: Apache 2.0
HuggingFace's SmolLM3 is purpose-built for ultra-constrained deployment. It runs in under 2GB VRAM quantized to Q4, targets on-device use, and is genuinely good at instruction following for its size. The Apache 2.0 license and HuggingFace's transparency about training data sourcing make it the right pick for edge applications where trust in the model provenance matters.
Best for: Raspberry Pi, microcontrollers with NPU acceleration, in-browser deployment, and any scenario where 2GB is the memory ceiling.
Specialized Models
Coding: Qwen 2.5 Coder 32B
HuggingFace: Qwen/Qwen2.5-Coder-32B-Instruct | Parameters: 32B | Context: 128K | License: Qwen License
The coding specialist leaderboard is a crowded field, but Qwen 2.5 Coder 32B consistently tops it. It was explicitly trained on code with emphasis on long-context code understanding, repository-level completion, and debugging. On HumanEval+ it scores 87.2 - higher than GPT-4-Turbo as of the time these numbers were published - while being fully open weights.
The 128K context window is critical for coding use cases: entire repositories, test suites, and dependencies can fit in context for analysis.
Runner-up for fill-in-middle: Codestral 22B (mistralai/Codestral-22B-v0.1) has the best fill-in-middle (FIM) performance in the open-weights space, making it the preferred choice for IDE autocomplete and copilot-style applications.
Dark horse for coding: DeepSeek Coder V3 (deepseek-ai/deepseek-coder-33b-instruct) is the strongest option if you need SWE-Bench-level repository repair capability rather than raw HumanEval score.
For full coding benchmark comparisons, see the Coding Benchmarks Leaderboard.
Reasoning and Math: QwQ-32B
HuggingFace: Qwen/QwQ-32B | Parameters: 32B | Context: 32K | License: Apache 2.0
QwQ-32B is the open-weights thinking model - it produces extended chain-of-thought reasoning before answering, similar to OpenAI's o1-series but fully open and running locally. For math olympiad problems, multi-step logical reasoning, and complex code analysis, the extended reasoning significantly improves accuracy over standard instruction-following models of the same size.
Benchmark snapshot: AIME 2025 50.0% (pass@1), MATH 94.5, LiveCodeBench 63.4 - figures that rival much larger closed models.
For maximum reasoning depth: DeepSeek-R1 (deepseek-ai/DeepSeek-R1) is the 671B MoE thinking model, producing the highest-quality reasoning in the open-weights space. The infrastructure requirements are the same as DeepSeek V3.
See the Reasoning Benchmarks Leaderboard and Math Olympiad AI Leaderboard for full numbers.
Multilingual: Aya 35B and Qwen 2.5 72B
For multilingual applications, the best open-weights choice depends on your language mix. Cohere's Aya family (CohereForAI/aya-23-35B) is the most explicitly multilingual model in the open-weights space, with training data and evaluations covering 100+ languages including many low-resource languages that other models handle poorly. It is the right choice for African languages, South Asian languages, and any language outside the top-20 training corpus languages.
For languages within mainstream training data (Chinese, Japanese, Korean, Arabic, European languages), Qwen 2.5 72B outperforms Aya on both quality and context length. BLOOM (bigscience/bloom) remains relevant as a historically important multilingual model with 46-language training data, but 2025-era models have surpassed it on performance metrics.
For multilingual benchmark details, see the Multilingual LLM Leaderboard.
Multimodal (Vision-Language): Qwen 2.5-VL 72B
HuggingFace: Qwen/Qwen2.5-VL-72B-Instruct | Parameters: 72B | Context: 128K (images + text) | License: Qwen License
Qwen 2.5-VL 72B is the dominant open-weights Vision-Language Model as of Q1 2026. It tops the OpenVLM leaderboard across document understanding, chart interpretation, and general visual question answering. The dynamic resolution handling means it processes images at their native resolution without losing detail, which is critical for OCR and document tasks.
For the smaller deployment tier, Gemma 3 27B (covered above) handles combined vision+text tasks well at a fraction of the memory cost.
For vision model benchmark details, see the Vision-Language Benchmarks Leaderboard.
Smaller open VLM: Qwen/Qwen2.5-VL-7B-Instruct is the 7B version and remains the strongest open VLM under 10B parameters.
InternVL3 for research: OpenGVLab/InternVL3-8B from Shanghai AI Lab remains a strong open-research alternative with well-documented evaluation methodology.
Speech: Whisper Large v3 and SenseVoice
Whisper Large-v3: openai/whisper-large-v3 | Parameters: 1.5B | License: MIT
OpenAI's Whisper Large-v3 remains the most reliable open-weights speech-to-text model for production use. It handles 100 languages, 25+ second audio chunks, and is well-supported across inference frameworks including faster-whisper (a CTranslate2-accelerated fork that runs 4x faster on GPU).
For low-latency and multilingual streaming: SenseVoice Small (FunAudioLLM/SenseVoiceSmall) from Alibaba's FunAudio team is the right choice when latency matters more than maximum accuracy. It includes language identification, emotion recognition, and audio event detection in a single model under 250M parameters.
For full-duplex speech: The Kyutai Moshi model (github.com/kyutai-labs/moshi) is the most interesting speech frontier model - a real-time, full-duplex speech model capable of simultaneous listening and speaking. It is early-stage but the most advanced open research in this direction.
Benchmark Snapshot Table
Numbers from MMLU-Pro, HumanEval+, SWE-Bench Verified, and Chat Arena ELO as tracked on the Open Source LLM Leaderboard. All scores are April 2026; frontier model benchmarks shift quickly.
| Model | Size (Active B) | MMLU-Pro | HumanEval+ | SWE-Bench V | Arena ELO |
|---|---|---|---|---|---|
| DeepSeek V3 | 671B (37B) | 81.2 | 77.3 | 49.2 | ~1360 |
| Kimi K2.5 | ~1T MoE (32B) | 79.4 | - | 65.8 | ~1340 |
| Llama 4 Maverick | ~400B (17B) | 74.3 | 71.5 | 38.5 | ~1310 |
| Llama 4 Scout | ~109B (17B) | 70.1 | 68.3 | 34.1 | ~1270 |
| Qwen 2.5 72B | 72B | 71.1 | 72.4 | 23.6 | ~1270 |
| Gemma 3 27B | 27B | 67.5 | 64.8 | - | ~1230 |
| Qwen 3.5 32B | 32B | ~70.0 | ~71.0 | - | ~1240 |
| Phi-4 | 14B | 74.4 | 66.3 | - | ~1190 |
| QwQ-32B | 32B | 79.1 | 83.5 | - | ~1290 |
| Qwen 2.5 7B | 7.6B | 56.3 | 58.7 | - | ~1130 |
| Qwen 2.5 Coder 32B | 32B | - | 87.2 | - | - |
Note: "-" indicates benchmark not officially reported or not applicable to model class. Arena ELO figures are approximate and shift daily.
License Matrix
Understanding licenses is not optional. Running the wrong model in a commercial product can create legal exposure.
| Model Family | License Type | Commercial? | Distillation OK? | Fine-tune OK? |
|---|---|---|---|---|
| Llama 4 | Llama 4 Community | Yes, with terms | Yes | Yes |
| DeepSeek V3/R1 | DeepSeek License | Yes | No (restricted) | Yes |
| Qwen 2.5 / 3.5 | Qwen License | Yes | Yes | Yes |
| Gemma 3 | Gemma License | Yes | Restricted | Yes |
| Phi-4 | MIT | Yes | Yes | Yes |
| Mistral Small 3.2 | Apache 2.0 | Yes | Yes | Yes |
| Mistral Large 3 | Mistral Research | Research only | No | Research only |
| MiniMax M2 | Apache 2.0 | Yes | Yes | Yes |
| SmolLM3 | Apache 2.0 | Yes | Yes | Yes |
| Aya 23 | CC-BY-NC | No commercial | No | Research only |
| BLOOM | RAIL License | Limited | No | Limited |
Always read the full license text before production deployment. "Commercial OK" means broadly permitted - specific terms may apply.
"Best for X" Decision Matrix
Best for local inference on a 24GB GPU: Qwen 3.5 32B at Q4_K_M (fits in ~18-20GB VRAM). Runner-up: Gemma 3 27B for multimodal tasks.
Best for coding on a single GPU: Qwen 2.5 Coder 32B is the clear winner. For smaller hardware, Qwen 2.5 Coder 7B handles most IDE autocomplete tasks adequately.
Best for multilingual (major languages): Qwen 2.5 72B for quality; Mistral Small 3.2 for the 24GB GPU tier. For low-resource languages, Aya 23 35B.
Best under 4GB VRAM: Phi-3.5-Mini at Q4 fits in ~2.5GB and handles reasoning well. Gemma 3 4B for multimodal in 3-4GB.
Best for fine-tuning: Qwen 2.5 72B for full fine-tuning; the consistent architecture across the 0.5B-72B family makes it the most practical choice for iterative fine-tuning workflows. Use LoRA/QLoRA to start small and scale.
Best for agentic tool use: Kimi K2.5 at frontier scale; Qwen 3.5 32B for single-GPU deployments. Both have reliable function-calling schemas and tool-use behavior.
Best for reasoning/math on a single GPU: QwQ-32B. The extended thinking mode takes more tokens but meaningfully improves accuracy on complex multi-step problems.
Best for speech: Whisper Large-v3 for accuracy; SenseVoice Small for low-latency streaming.
Deployment Notes - Inference Framework Compatibility
All major open-weights models in 2026 run on the standard inference stack. Here is where each framework adds value for these specific models. Full server comparison is in our inference servers guide.
| Framework | Best for These Models | Notes |
|---|---|---|
| vLLM | Llama 4, Qwen 2.5, Gemma 3, Phi-4, Mistral | 200+ architectures; safest default for new deployments |
| SGLang | DeepSeek V3/R1, Kimi K2.5, agentic workloads | Best prefix caching; 3.1x faster on DeepSeek V3 |
| llama.cpp / Ollama | Any model up to 70B | GGUF quantization; CPU-capable; best for local dev |
| MLX | Any model on Apple Silicon | Native Apple M-series; best tokens/watt on Mac |
| TGI | Legacy deployments only | Maintenance mode since Dec 2025 - do not use for new projects |
| TensorRT-LLM | Fixed high-volume deployments | Maximum throughput; 28-minute compile per model |
For home GPU builds and workstation recommendations, see Best AI Home Workstations 2026.
FAQ
What does "open weights" actually mean vs. "open source"?
Open weights means the model parameters (the trained weights) are publicly available and can be downloaded, run locally, and modified. It does not automatically mean the training code, training data, or license terms are fully open. True open source (OSI definition) would require open training data and code. Most of the models in this guide are open weights with permissive-but-not-fully-OSI licenses. Apache 2.0 models (Mistral Small 3.2, SmolLM3, MiniMax M2, QwQ-32B) are the closest to genuine open source.
Can I run DeepSeek V3 locally?
Yes, but you need serious hardware. The full BF16 model requires approximately 1.3TB of GPU VRAM across 8-16 H100s or A100 80GBs. At FP8 quantization, that drops to around 320GB. For most teams, the practical answer is: use the API for production, run a smaller quantized version locally for development. Our Home GPU LLM Leaderboard tracks what actually runs on consumer hardware.
Which open-weights model is best for building a RAG pipeline?
For the retrieval-augmented generation use case, the key factors are long-context stability and instruction following. Qwen 2.5 72B or Llama 4 Scout are the recommended starting points for mid-tier hardware; DeepSeek V3 or Kimi K2.5 if you have the infrastructure. Pair with SGLang for serving - the prefix caching is significant for RAG workloads where many queries share the same document context.
How often do these rankings change?
Very frequently. New models release every few weeks; benchmark scores shift as evaluation methodology improves; quantization quality improves with llama.cpp and vLLM updates. Check the Open Source LLM Leaderboard and the Small Language Model Leaderboard for current scores. This article is verified as of April 19, 2026.
Is open weights always cheaper than APIs?
At scale, yes. For small-to-medium volume, the infrastructure and maintenance cost often exceeds API pricing. The break-even point is roughly 50-100 million tokens per month for GPT-4-class tasks - below that, pay-per-token APIs are usually cheaper. Above that threshold, self-hosted open-weights models become significantly more cost-effective.
What is the best open-weights model for fine-tuning on proprietary data?
Qwen 2.5 72B is my top recommendation for the base checkpoint. The consistent architecture across the full family enables iterative experiments at 7B before scaling to 72B. Use QLoRA for memory-efficient fine-tuning. Llama 4 Scout is the alternative if license terms or ecosystem compatibility are constraints. Check the AI Fine-Tuning Platforms guide for the tooling side.
Sources
- DeepSeek V3 HuggingFace
- Kimi K2.5 HuggingFace
- Llama 4 Maverick HuggingFace
- Llama 4 Scout HuggingFace
- Qwen 2.5 72B HuggingFace
- Gemma 3 27B HuggingFace
- Mistral Large Instruct 2411 HuggingFace
- Mistral Small 3.2 HuggingFace
- Qwen 2.5 7B HuggingFace
- Gemma 3 4B HuggingFace
- Phi-4 HuggingFace
- Phi-3.5-Mini HuggingFace
- SmolLM3-3B HuggingFace
- SmolLM2-1.7B HuggingFace
- QwQ-32B HuggingFace
- DeepSeek-R1 HuggingFace
- Qwen 2.5 Coder 32B HuggingFace
- Codestral 22B HuggingFace
- DeepSeek Coder 33B HuggingFace
- Aya 23 35B HuggingFace
- BLOOM HuggingFace
- Qwen 2.5-VL 72B HuggingFace
- Qwen 2.5-VL 7B HuggingFace
- InternVL3-8B HuggingFace
- Whisper Large-v3 HuggingFace
- SenseVoice Small HuggingFace
- Moshi GitHub
- MiniMax Text-01 HuggingFace
✓ Last verified April 19, 2026
