Mistral Small 3.2
Mistral Small 3.2 is a 24B dense model with strong function calling, multimodal vision, and 128K context under Apache 2.0 - optimized for production tool-use pipelines and EU-compliant deployments.

TL;DR
- 24B dense model with multimodal (text + images) and robust function calling - Apache 2.0 license
- Significant improvements over 3.1: instruction following up from 82.75% to 84.78%, HumanEval Plus from 88.99% to 92.90%
- Infinite generation errors cut in half (2.11% to 1.29%) - a real production reliability improvement
- API at $0.10/$0.30 per million tokens - one of the cheapest capable models available
- Fits in ~55 GB VRAM for self-hosting - practical for single-GPU production deployment
Overview
Mistral AI released Small 3.2 on June 20, 2025, and the changelog reads like a production reliability patch as much as a capability upgrade. Instruction accuracy goes from 82.75% to 84.78%. Infinite generation errors - where the model gets stuck in a repetition loop and never stops - drop from 2.11% to 1.29%. Function calling gets a more robust template format. These are not headline-grabbing benchmark jumps. They are the kind of improvements that matter when you are running a model in production and a 2% infinite generation rate is causing real customer-facing failures.
The 24B dense architecture means deployment is straightforward. At ~55 GB in bfloat16, it fits on a single A100 80GB or H100 with room for KV cache. With quantization, it can run on smaller GPUs. The Apache 2.0 license means no usage restrictions - you can deploy it commercially, modify it, and redistribute it. Combined with the $0.10/$0.30 API pricing on La Plateforme, Mistral Small 3.2 occupies a practical niche: the cheapest capable model with genuine function calling ability and multimodal support.
Where Small 3.2 sits in the capability hierarchy is important to be honest about. MMLU at 80.50% and MMLU-Pro at 69.06% are solid for a 24B model but trail the Qwen 3.5 series and the frontier by a significant margin. GPQA Diamond at 46.13% confirms this is not a reasoning powerhouse. The model's competitive advantage is not raw intelligence - it is the combination of function calling reliability, production stability, multilingual support, multimodal capability, permissive licensing, and low cost. If you need a model that reliably calls tools, follows instructions precisely, and does not break in production - and you need it under Apache 2.0 with EU data sovereignty - Small 3.2 is the strongest option at this price point.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Mistral AI |
| Model Family | Mistral Small |
| Architecture | Dense Transformer |
| Parameters | 24B |
| Base Model | Mistral-Small-3.1-24B-Base-2503 |
| Context Window | 131,072 tokens |
| Input Modalities | Text, Images (up to 10 per prompt) |
| Output Modality | Text |
| Input Price | $0.10/M tokens |
| Output Price | $0.30/M tokens |
| Release Date | June 20, 2025 |
| License | Apache 2.0 |
| VRAM Requirement | ~55 GB (bfloat16/fp16) |
| Recommended Framework | vLLM >= 0.9.1, mistral-common >= 1.6.2 |
| Model ID | mistral-small-2506 |
Benchmark Performance
| Benchmark | Mistral Small 3.2 | Mistral Small 3.1 | Gemma 3 27B | Qwen 3.5-27B |
|---|---|---|---|---|
| MMLU | 80.50% | 79.80% | 78.60% | 82.10% |
| MMLU-Pro (5-shot CoT) | 69.06% | 66.76% | 67.50% | 74.80% |
| GPQA Diamond | 46.13% | 45.96% | 42.40% | 62.50% |
| HumanEval Plus (Pass@5) | 92.90% | 88.99% | 78.50% | 88.40% |
| MBPP Plus (Pass@5) | 78.33% | 76.80% | - | 80.10% |
| WildBench v2 (instruction following) | 65.33% | 55.60% | - | 62.80% |
| Arena Hard v2 | 43.10% | 38.50% | - | 48.20% |
| DocVQA (vision) | 94.86% | 93.20% | 85.60% | 84.20% |
| ChartQA (vision) | 87.40% | 85.10% | 76.30% | 74.80% |
| Infinite Gen Rate | 1.29% | 2.11% | - | - |
Three things stand out in this table. First, the HumanEval Plus score of 92.90% is best-in-class for this parameter range - Small 3.2 is a genuinely strong code generation model. Second, the vision benchmarks (DocVQA 94.86%, ChartQA 87.40%) are exceptional for a 24B model and outperform even Gemma 3 27B, which positions itself as a vision-first model. Third, the WildBench v2 jump from 55.60% to 65.33% is nearly a 10-point improvement in instruction following - the single largest gain in the 3.1 to 3.2 upgrade.
The GPQA Diamond score of 46.13% and Arena Hard v2 at 43.10% confirm the ceiling. Small 3.2 does not compete with larger models on hard reasoning. Qwen 3.5-27B outperforms it on MMLU-Pro (74.80% vs 69.06%) and GPQA Diamond (62.50% vs 46.13%) by wide margins. The tradeoff is that Small 3.2 offers stronger function calling, better production stability, and Apache 2.0 licensing from a European company.
Key Capabilities
Function Calling and Tool Use. The headline improvement in 3.2 is a more robust function calling template. The model supports auto tool choice through vLLM's OpenAI-compatible API, handles structured JSON output reliably, and processes multi-tool workflows. Mistral specifically optimized the function calling format for integration with existing tool-use pipelines. If you are building an agent that needs to call APIs, query databases, or interact with external services, Small 3.2's function calling is more reliable than its size would suggest. The production-grade tool-use capability at 24B parameters is the model's strongest differentiator from alternatives like Gemma 3 27B, which does not emphasize tool calling.
Document Intelligence. The vision benchmarks tell a specific story: DocVQA at 94.86% and ChartQA at 87.40% make Small 3.2 one of the strongest document understanding models at any size. It supports up to 10 images per prompt, processes OCR, annotations, bounding box extraction, and document Q&A natively. For teams building document processing pipelines - invoice extraction, form analysis, chart interpretation - the combination of high accuracy and low deployment cost is compelling. The model beats Gemma 3 27B on both DocVQA (+9.3 points) and ChartQA (+11.1 points) despite being slightly smaller.
Production Reliability. The halving of infinite generation errors (from 2.11% to 1.29%) might seem like a minor detail, but at production scale it is meaningful. A 2% failure rate on a service handling thousands of requests per day means dozens of stuck responses that either time out or return garbage. Cutting that to 1.29% is a direct improvement to user experience and system reliability. Combined with the instruction following gains (WildBench v2 up 10 points), Small 3.2 is measurably more predictable than its predecessor. In production systems, predictability is often more valuable than peak capability.
Pricing and Availability
| Tier | Input | Output |
|---|---|---|
| La Plateforme API | $0.10/M tokens | $0.30/M tokens |
Mistral Small 3.2 is available through Mistral's La Plateforme, Amazon Bedrock Marketplace, Amazon SageMaker JumpStart, and various third-party providers. The Apache 2.0 license means you can also download the weights from HuggingFace and self-host.
At $0.10/$0.30 per million tokens, the API is remarkably cheap. For comparison: Gemini 3.1 Pro costs $2/$12 (20x more on input), and even DeepSeek V3.2 at $0.28/$0.42 costs nearly 3x more on input. The catch is that those models are significantly more capable on reasoning benchmarks. Small 3.2's pricing makes it viable for high-volume, lower-complexity workloads - chatbots, document processing, function calling agents, and classification tasks where you need reliable performance at scale rather than maximum intelligence.
Self-hosting on a single A100 80GB is the natural deployment target. At roughly $2-3/hour for cloud A100 instances, the break-even versus the API depends on utilization, but for sustained workloads above a few hundred thousand tokens per hour, self-hosting wins. See our open source vs proprietary AI guide for detailed cost modeling.
Strengths
- Best-in-class function calling and tool use at the 24B parameter range
- Exceptional document vision (DocVQA 94.86%, ChartQA 87.40%) - beats larger models
- Strong code generation (HumanEval Plus 92.90%) for its size
- Half the infinite generation errors of its predecessor - real production reliability gain
- Apache 2.0 license with European data sovereignty advantages
- $0.10/$0.30 API pricing - among the cheapest capable models available
- Fits on a single A100 80GB for self-hosted deployment
Weaknesses
- GPQA Diamond (46.13%) and MMLU-Pro (69.06%) trail Qwen 3.5-27B by significant margins
- Arena Hard v2 (43.10%) confirms limited hard reasoning capability
- SimpleQA factual accuracy (12.10%) is low - not suitable for factual lookup tasks
- No video or audio input - multimodal is limited to static images
- 24B dense model lacks the efficiency advantages of MoE architectures at similar capability
- Released June 2025 - newer models in the 24-27B range are starting to surpass it on text benchmarks
Related Coverage
- Qwen 3.5-27B - The strongest dense 27B competitor from Alibaba
- Open Source LLM Leaderboard - Current rankings for open-weight models
- Coding Benchmarks Leaderboard - HumanEval and coding benchmark rankings
- Open Source vs Proprietary AI - Framework for choosing deployment strategies
Sources
- Mistral Small 3.2 Documentation - Mistral AI
- Mistral-Small-3.2-24B-Instruct-2506 Model Card (HuggingFace)
- Mistral AI Releases Small 3.2 - MarkTechPost
- Mistral Small 3.2 Intelligence & Performance Analysis - Artificial Analysis
- Mistral Small 3.2 on Amazon Bedrock - AWS Blog
- Mistral AI Pricing - La Plateforme
- Simon Willison on Mistral Small 3.2
