Name: Mistral Small 4
Author: Mistral AI

Mistral Small 4 is Mistral AI's first model to unify fast inference and deep reasoning in a single architecture. At 119 billion total parameters with only 6 billion active per token (128 experts, 4 active), it delivers the knowledge capacity of a large model at the inference cost of a small one. The Apache 2.0 license makes it fully open for commercial use, and the configurable reasoning_effort parameter lets developers toggle between speed and depth without switching models.

TL;DR

119B total / 6B active MoE with 128 experts, 256K context, Apache 2.0
Configurable reasoning via reasoning_effort parameter (none to high)
40% faster latency and 3x throughput vs Mistral Small 3
Outperforms GPT-OSS 120B on LiveCodeBench with 20% less output
First product of the new Mistral-NVIDIA Nemotron Coalition partnership

Key Specifications

Specification	Details
Provider	Mistral AI
Model Family	Mistral Small
Total Parameters	119 billion
Active Parameters	6B per token (8B with embeddings)
Architecture	Mixture of Experts (MoE)
Experts	128 total, 4 active per token
Context Window	256,000 tokens
Multimodal	Text + image input
Configurable Reasoning	`reasoning_effort` parameter (none/high)
Latency	40% faster than Small 3
Throughput	3x more requests/sec than Small 3
Release Date	March 16, 2026
License	Apache 2.0
Open Weights	Yes (Hugging Face)

Benchmark Performance

Benchmark	Mistral Small 4	GPT-OSS 120B	Notes
LiveCodeBench	Higher	Lower	Small 4 produces 20% less output
LCR (Logical Reasoning)	0.72 accuracy	N/A	1.6K chars vs Qwen's 5.8-6.1K
AIME 2025	Competitive	Competitive	Comparable scores
Output Efficiency	1.6K chars	N/A	3.5-4x less than Qwen at same accuracy

Key Capabilities

Configurable Reasoning

The defining feature. A single reasoning_effort parameter toggles between fast responses and deep chain-of-thought reasoning:

reasoning_effort="none" - Direct, fast responses matching Small 3.2 style
reasoning_effort="high" - Step-by-step reasoning equivalent to Magistral

One model, one deployment, one API. No routing between fast and reasoning models.

128-Expert MoE

The 128-expert design (vs typical 8-16 in earlier MoE models) enables finer-grained specialization. Each token activates 4 experts, selected by the routing mechanism based on input content. The result: 119B of total knowledge capacity at 6B inference cost.

Multimodal

Native text and image input support. The model processes visual inputs alongside text for document understanding, chart analysis, and multimodal reasoning tasks.

Pricing and Availability

Access Point	Pricing	Status
Open Weights (Hugging Face)	Free	Available
Mistral API / AI Studio	API pricing	Available
NVIDIA build.nvidia.com	Free prototyping	Available
NVIDIA NIM	Production deployment	Available
vLLM, SGLang, llama.cpp	Free (local)	Supported

Hardware Requirements

Setup	Configuration
Minimum	4x NVIDIA HGX H100, 2x HGX H200, or 1x DGX B200
Recommended	4x HGX H100, 4x HGX H200, or 2x DGX B200

Strengths

Configurable reasoning eliminates the need for separate fast/reasoning model deployments
128-expert MoE provides 119B knowledge at 6B inference cost
Apache 2.0 license with full commercial rights
256K context window
40% faster and 3x higher throughput than predecessor
Output-efficient: achieves comparable accuracy with significantly fewer tokens
Backed by NVIDIA Nemotron Coalition partnership

Weaknesses

"Small" branding is misleading - requires 4x H100s minimum for local deployment
Full 119B parameter set must reside in VRAM despite only 6B active
Published benchmarks limited to reasoning-heavy tasks (LCR, LiveCodeBench, AIME)
Broader evaluations (MMLU, GPQA Diamond, multilingual) not disclosed
Configurable reasoning quality vs purpose-built reasoning models untested at scale

Mistral Small 4: 128 Experts, 6B Active, Apache 2.0 - Our launch coverage
Claude Opus 4.6 - Competing frontier model
Codex vs Claude Code - AI coding tool comparison

Sources: