Name: Mistral Small 3.2
Author: Mistral AI

TL;DR

24B dense model with multimodal (text + images) and solid function calling - Apache 2.0 license
Significant improvements over 3.1: instruction following up from 82.75% to 84.78%, HumanEval Plus from 88.99% to 92.90%
Infinite generation errors cut in half (2.11% to 1.29%) - a real production reliability improvement
API at $0.10/$0.30 per million tokens - one of the cheapest capable models available
Fits in ~55 GB VRAM for self-hosting - practical for single-GPU production deployment

Overview

Mistral AI released Small 3.2 on June 20, 2025, and the changelog reads like a production reliability patch as much as a capability upgrade. Instruction accuracy goes from 82.75% to 84.78%. Infinite generation errors - where the model gets stuck in a repetition loop and never stops - drop from 2.11% to 1.29%. Function calling gets a more robust template format. These aren't headline-grabbing benchmark jumps. They are the kind of improvements that matter when you are running a model in production and a 2% infinite generation rate is causing real customer-facing failures.

The 24B dense architecture means deployment is straightforward. At ~55 GB in bfloat16, it fits on a single A100 80GB or H100 with room for KV cache. With quantization, it can run on smaller GPUs. The Apache 2.0 license means no usage restrictions - you can deploy it commercially, modify it, and redistribute it. Combined with the $0.10/$0.30 API pricing on La Plateforme, Mistral Small 3.2 occupies a practical niche: the cheapest capable model with genuine function calling ability and multimodal support.

Where Small 3.2 sits in the capability hierarchy is important to be honest about. MMLU at 80.50% and MMLU-Pro at 69.06% are solid for a 24B model but trail the Qwen 3.5 series and the frontier by a significant margin. GPQA Diamond at 46.13% confirms this is not a reasoning powerhouse. The model's competitive advantage isn't raw intelligence - it's the combination of function calling reliability, production stability, multilingual support, multimodal capability, permissive licensing, and low cost. If you need a model that reliably calls tools, follows instructions exactly, and doesn't break in production - and you need it under Apache 2.0 with EU data sovereignty - Small 3.2 is the strongest option at this price point.

Key Specifications

Specification	Details
Provider	Mistral AI
Model Family	Mistral Small
Architecture	Dense Transformer
Parameters	24B
Base Model	Mistral-Small-3.1-24B-Base-2503
Context Window	131,072 tokens
Input Modalities	Text, Images (up to 10 per prompt)
Output Modality	Text
Input Price	$0.10/M tokens
Output Price	$0.30/M tokens
Release Date	June 20, 2025
License	Apache 2.0
VRAM Requirement	~55 GB (bfloat16/fp16)
Recommended Framework	vLLM >= 0.9.1, mistral-common >= 1.6.2
Model ID	`mistral-small-2506`

Benchmark Performance

Benchmark	Mistral Small 3.2	Mistral Small 3.1	Gemma 3 27B	Qwen 3.5-27B
MMLU	80.50%	79.80%	78.60%	82.10%
MMLU-Pro (5-shot CoT)	69.06%	66.76%	67.50%	74.80%
GPQA Diamond	46.13%	45.96%	42.40%	62.50%
HumanEval Plus (Pass@5)	92.90%	88.99%	78.50%	88.40%
MBPP Plus (Pass@5)	78.33%	76.80%	-	80.10%
WildBench v2 (instruction following)	65.33%	55.60%	-	62.80%
Arena Hard v2	43.10%	38.50%	-	48.20%
DocVQA (vision)	94.86%	93.20%	85.60%	84.20%
ChartQA (vision)	87.40%	85.10%	76.30%	74.80%
Infinite Gen Rate	1.29%	2.11%	-	-

Three things stand out in this table. First, the HumanEval Plus score of 92.90% is best-in-class for this parameter range - Small 3.2 is a truly strong code generation model. Second, the vision benchmarks (DocVQA 94.86%, ChartQA 87.40%) are exceptional for a 24B model and beat even Gemma 3 27B, which positions itself as a vision-first model. Third, the WildBench v2 jump from 55.60% to 65.33% is nearly a 10-point improvement in instruction following - the single largest gain in the 3.1 to 3.2 upgrade.

The GPQA Diamond score of 46.13% and Arena Hard v2 at 43.10% confirm the ceiling. Small 3.2 does not compete with larger models on hard reasoning. Qwen 3.5-27B beats it on MMLU-Pro (74.80% vs 69.06%) and GPQA Diamond (62.50% vs 46.13%) by wide margins. The tradeoff is that Small 3.2 offers stronger function calling, better production stability, and Apache 2.0 licensing from an European company.

Key Capabilities

Function Calling and Tool Use. The headline improvement in 3.2 is a more robust function calling template. The model supports auto tool choice through vLLM's OpenAI-compatible API, handles structured JSON output reliably, and processes multi-tool workflows. Mistral specifically optimized the function calling format for integration with existing tool-use pipelines. If you are building an agent that needs to call APIs, query databases, or interact with external services, Small 3.2's function calling is more reliable than its size would suggest. The production-grade tool-use capability at 24B parameters is the model's strongest differentiator from alternatives like Gemma 3 27B, which doesn't stress tool calling.

Document Intelligence. The vision benchmarks tell a specific story: DocVQA at 94.86% and ChartQA at 87.40% make Small 3.2 one of the strongest document understanding models at any size. It supports up to 10 images per prompt, processes OCR, annotations, bounding box extraction, and document Q&A natively. For teams building document processing pipelines - invoice extraction, form analysis, chart interpretation - the combination of high accuracy and low deployment cost is compelling. The model beats Gemma 3 27B on both DocVQA (+9.3 points) and ChartQA (+11.1 points) despite being slightly smaller.

Production Reliability. The halving of infinite generation errors (from 2.11% to 1.29%) might seem like a minor detail, but at production scale it's meaningful. A 2% failure rate on a service handling thousands of requests per day means dozens of stuck responses that either time out or return garbage. Cutting that to 1.29% is a direct improvement to user experience and system reliability. Combined with the instruction following gains (WildBench v2 up 10 points), Small 3.2 is measurably more predictable than its predecessor. In production systems, predictability is often more valuable than peak capability.

Pricing and Availability

Tier	Input	Output
La Plateforme API	$0.10/M tokens	$0.30/M tokens

Mistral Small 3.2 is available through Mistral's La Plateforme, Amazon Bedrock Marketplace, Amazon SageMaker JumpStart, and various third-party providers. The Apache 2.0 license means you can also download the weights from HuggingFace and self-host.

At $0.10/$0.30 per million tokens, the API is remarkably cheap. For comparison: Gemini 3.1 Pro costs $2/$12 (20x more on input), and even DeepSeek V3.2 at $0.28/$0.42 costs nearly 3x more on input. The catch is that those models are clearly more capable on reasoning benchmarks. Small 3.2's pricing makes it viable for high-volume, lower-complexity workloads - chatbots, document processing, function calling agents, and classification tasks where you need reliable performance at scale rather than maximum intelligence.

Self-hosting on a single A100 80GB is the natural deployment target. At roughly $2-3/hour for cloud A100 instances, the break-even versus the API depends on use, but for sustained workloads above a few hundred thousand tokens per hour, self-hosting wins. See our open source vs proprietary AI guide for detailed cost modeling.

Strengths

Best-in-class function calling and tool use at the 24B parameter range
Exceptional document vision (DocVQA 94.86%, ChartQA 87.40%) - beats larger models
Strong code generation (HumanEval Plus 92.90%) for its size
Half the infinite generation errors of its predecessor - real production reliability gain
Apache 2.0 license with European data sovereignty advantages
$0.10/$0.30 API pricing - among the cheapest capable models available
Fits on a single A100 80GB for self-hosted deployment

Weaknesses

GPQA Diamond (46.13%) and MMLU-Pro (69.06%) trail Qwen 3.5-27B by significant margins
Arena Hard v2 (43.10%) confirms limited hard reasoning capability
SimpleQA factual accuracy (12.10%) is low - not suitable for factual lookup tasks
No video or audio input - multimodal is limited to static images
24B dense model lacks the efficiency advantages of MoE architectures at similar capability
Released June 2025 - newer models in the 24-27B range are starting to surpass it on text benchmarks

Qwen 3.5-27B - The strongest dense 27B competitor from Alibaba
Open Source LLM Leaderboard - Current rankings for open-weight models
Coding Benchmarks Leaderboard - HumanEval and coding benchmark rankings
Open Source vs Proprietary AI - Framework for choosing deployment strategies