Mistral Small 4
Mistral AI's unified MoE model - 119B total parameters, 6B active per token, 128 experts, 256K context, configurable reasoning, Apache 2.0 license.

Mistral Small 4 is Mistral AI's first model to unify fast inference and deep reasoning in a single architecture. At 119 billion total parameters with only 6 billion active per token (128 experts, 4 active), it delivers the knowledge capacity of a large model at the inference cost of a small one. The Apache 2.0 license makes it fully open for commercial use, and the configurable reasoning_effort parameter lets developers toggle between speed and depth without switching models.
TL;DR
- 119B total / 6B active MoE with 128 experts, 256K context, Apache 2.0
- Configurable reasoning via
reasoning_effortparameter (none to high) - 40% faster latency and 3x throughput vs Mistral Small 3
- Outperforms GPT-OSS 120B on LiveCodeBench with 20% less output
- First product of the new Mistral-NVIDIA Nemotron Coalition partnership
Key Specifications
| Specification | Details |
|---|---|
| Provider | Mistral AI |
| Model Family | Mistral Small |
| Total Parameters | 119 billion |
| Active Parameters | 6B per token (8B with embeddings) |
| Architecture | Mixture of Experts (MoE) |
| Experts | 128 total, 4 active per token |
| Context Window | 256,000 tokens |
| Multimodal | Text + image input |
| Configurable Reasoning | reasoning_effort parameter (none/high) |
| Latency | 40% faster than Small 3 |
| Throughput | 3x more requests/sec than Small 3 |
| Release Date | March 16, 2026 |
| License | Apache 2.0 |
| Open Weights | Yes (Hugging Face) |
Benchmark Performance
| Benchmark | Mistral Small 4 | GPT-OSS 120B | Notes |
|---|---|---|---|
| LiveCodeBench | Higher | Lower | Small 4 produces 20% less output |
| LCR (Logical Reasoning) | 0.72 accuracy | N/A | 1.6K chars vs Qwen's 5.8-6.1K |
| AIME 2025 | Competitive | Competitive | Comparable scores |
| Output Efficiency | 1.6K chars | N/A | 3.5-4x less than Qwen at same accuracy |
Key Capabilities
Configurable Reasoning
The defining feature. A single reasoning_effort parameter toggles between fast responses and deep chain-of-thought reasoning:
reasoning_effort="none"- Direct, fast responses matching Small 3.2 stylereasoning_effort="high"- Step-by-step reasoning equivalent to Magistral
One model, one deployment, one API. No routing between fast and reasoning models.
128-Expert MoE
The 128-expert design (vs typical 8-16 in earlier MoE models) enables finer-grained specialization. Each token activates 4 experts, selected by the routing mechanism based on input content. The result: 119B of total knowledge capacity at 6B inference cost.
Multimodal
Native text and image input support. The model processes visual inputs alongside text for document understanding, chart analysis, and multimodal reasoning tasks.
Pricing and Availability
| Access Point | Pricing | Status |
|---|---|---|
| Open Weights (Hugging Face) | Free | Available |
| Mistral API / AI Studio | API pricing | Available |
| NVIDIA build.nvidia.com | Free prototyping | Available |
| NVIDIA NIM | Production deployment | Available |
| vLLM, SGLang, llama.cpp | Free (local) | Supported |
Hardware Requirements
| Setup | Configuration |
|---|---|
| Minimum | 4x NVIDIA HGX H100, 2x HGX H200, or 1x DGX B200 |
| Recommended | 4x HGX H100, 4x HGX H200, or 2x DGX B200 |
Strengths
- Configurable reasoning eliminates the need for separate fast/reasoning model deployments
- 128-expert MoE provides 119B knowledge at 6B inference cost
- Apache 2.0 license with full commercial rights
- 256K context window
- 40% faster and 3x higher throughput than predecessor
- Output-efficient: achieves comparable accuracy with significantly fewer tokens
- Backed by NVIDIA Nemotron Coalition partnership
Weaknesses
- "Small" branding is misleading - requires 4x H100s minimum for local deployment
- Full 119B parameter set must reside in VRAM despite only 6B active
- Published benchmarks limited to reasoning-heavy tasks (LCR, LiveCodeBench, AIME)
- Broader evaluations (MMLU, GPQA Diamond, multilingual) not disclosed
- Configurable reasoning quality vs purpose-built reasoning models untested at scale
Related Coverage
- Mistral Small 4: 128 Experts, 6B Active, Apache 2.0 - Our launch coverage
- Claude Opus 4.6 - Competing frontier model
- Codex vs Claude Code - AI coding tool comparison
Sources:
✓ Last verified March 16, 2026
