Mistral Small 4

Mistral AI's unified MoE model - 119B total parameters, 6B active per token, 128 experts, 256K context, configurable reasoning, Apache 2.0 license.

Mistral Small 4

Mistral Small 4 is Mistral AI's first model to unify fast inference and deep reasoning in a single architecture. At 119 billion total parameters with only 6 billion active per token (128 experts, 4 active), it delivers the knowledge capacity of a large model at the inference cost of a small one. The Apache 2.0 license makes it fully open for commercial use, and the configurable reasoning_effort parameter lets developers toggle between speed and depth without switching models.

TL;DR

  • 119B total / 6B active MoE with 128 experts, 256K context, Apache 2.0
  • Configurable reasoning via reasoning_effort parameter (none to high)
  • 40% faster latency and 3x throughput vs Mistral Small 3
  • Outperforms GPT-OSS 120B on LiveCodeBench with 20% less output
  • First product of the new Mistral-NVIDIA Nemotron Coalition partnership

Key Specifications

SpecificationDetails
ProviderMistral AI
Model FamilyMistral Small
Total Parameters119 billion
Active Parameters6B per token (8B with embeddings)
ArchitectureMixture of Experts (MoE)
Experts128 total, 4 active per token
Context Window256,000 tokens
MultimodalText + image input
Configurable Reasoningreasoning_effort parameter (none/high)
Latency40% faster than Small 3
Throughput3x more requests/sec than Small 3
Release DateMarch 16, 2026
LicenseApache 2.0
Open WeightsYes (Hugging Face)

Benchmark Performance

BenchmarkMistral Small 4GPT-OSS 120BNotes
LiveCodeBenchHigherLowerSmall 4 produces 20% less output
LCR (Logical Reasoning)0.72 accuracyN/A1.6K chars vs Qwen's 5.8-6.1K
AIME 2025CompetitiveCompetitiveComparable scores
Output Efficiency1.6K charsN/A3.5-4x less than Qwen at same accuracy

Key Capabilities

Configurable Reasoning

The defining feature. A single reasoning_effort parameter toggles between fast responses and deep chain-of-thought reasoning:

  • reasoning_effort="none" - Direct, fast responses matching Small 3.2 style
  • reasoning_effort="high" - Step-by-step reasoning equivalent to Magistral

One model, one deployment, one API. No routing between fast and reasoning models.

128-Expert MoE

The 128-expert design (vs typical 8-16 in earlier MoE models) enables finer-grained specialization. Each token activates 4 experts, selected by the routing mechanism based on input content. The result: 119B of total knowledge capacity at 6B inference cost.

Multimodal

Native text and image input support. The model processes visual inputs alongside text for document understanding, chart analysis, and multimodal reasoning tasks.

Pricing and Availability

Access PointPricingStatus
Open Weights (Hugging Face)FreeAvailable
Mistral API / AI StudioAPI pricingAvailable
NVIDIA build.nvidia.comFree prototypingAvailable
NVIDIA NIMProduction deploymentAvailable
vLLM, SGLang, llama.cppFree (local)Supported

Hardware Requirements

SetupConfiguration
Minimum4x NVIDIA HGX H100, 2x HGX H200, or 1x DGX B200
Recommended4x HGX H100, 4x HGX H200, or 2x DGX B200

Strengths

  • Configurable reasoning eliminates the need for separate fast/reasoning model deployments
  • 128-expert MoE provides 119B knowledge at 6B inference cost
  • Apache 2.0 license with full commercial rights
  • 256K context window
  • 40% faster and 3x higher throughput than predecessor
  • Output-efficient: achieves comparable accuracy with significantly fewer tokens
  • Backed by NVIDIA Nemotron Coalition partnership

Weaknesses

  • "Small" branding is misleading - requires 4x H100s minimum for local deployment
  • Full 119B parameter set must reside in VRAM despite only 6B active
  • Published benchmarks limited to reasoning-heavy tasks (LCR, LiveCodeBench, AIME)
  • Broader evaluations (MMLU, GPQA Diamond, multilingual) not disclosed
  • Configurable reasoning quality vs purpose-built reasoning models untested at scale

Sources:

✓ Last verified March 16, 2026

Mistral Small 4
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.