Name: Mistral Medium 3.5
Author: Mistral AI

Mistral Medium 3.5 is Mistral AI's first "merged" flagship model - one set of weights that handles instruction-following, reasoning, and coding without switching between separate specialized models. Previous Mistral deployments required routing between Magistral for reasoning and Devstral for coding. Medium 3.5 replaces both, plus the older Medium 3.1, with a single 128B dense model.

TL;DR

Dense 128B, 256K context, open weights under modified MIT, self-hostable on 4 GPUs at FP8
77.6% SWE-Bench Verified - clears Devstral 2 (72.2%) and Devstral Small 2 (68.0%) by a clean margin
$1.50/M input tokens, 40% cheaper than GPT-4o, OpenAI-compatible API

Key Specifications

Specification	Details
Provider	Mistral AI
Model Family	Mistral Medium
Parameters	128B (dense)
Architecture	Dense transformer (not MoE)
Context Window	256,000 tokens
Tensor Formats	BF16, F8_E4M3 (FP8)
Multimodal	Text + image input
Configurable Reasoning	`reasoning_effort` parameter (none/high)
Multilingual	24+ languages
Release Date	April 29, 2026
License	Modified MIT (open weights)
Open Weights	Yes (Hugging Face)

The "modified MIT" label needs a note: the license is permissive for most organizations but includes revenue-based restrictions for large companies. If your company clears a certain revenue threshold, check the license text before launching commercially. For startups and mid-size businesses the restrictions are effectively moot.

Benchmark Performance

Mistral published benchmark results across three categories: SWE-Bench coding, τ³ agentic task suites, and math/reasoning. MMLU and GPQA Diamond aren't disclosed.

SWE-Bench Verified: Mistral Medium 3.5 vs Devstral 2 vs Devstral Small 2 SWE-Bench Verified scores: Medium 3.5 at 77.6% vs Devstral 2 at 72.2% vs Devstral Small 2 at 68.0%. Source: huggingface.co

Coding

Model	SWE-Bench Verified
Mistral Medium 3.5	77.6%
Devstral 2	72.2%
Devstral Small 2	68.0%

The 5.4-point gap over Devstral 2 is significant in SWE-Bench terms - it translates to noticeably fewer failed patches in a real agentic loop. Medium 3.5 also beats Qwen 3.5 397B A17B on this benchmark per Mistral's own reporting, though the 397B MoE is a much larger model by total parameter count.

Agentic Tasks (τ³ Suite)

Agentic benchmarks vs previous Mistral models across Telecom, Airline, Retail, Banking, and BrowseComp tasks τ³ agentic benchmark scores across five task domains: Medium 3.5 leads every category versus Magistral Medium 1.2, Mistral Small 4, and Medium 3.1. Source: huggingface.co

Benchmark	Mistral Medium 3.5	Magistral Medium 1.2	Mistral Small 4	Medium 3.1
τ³-Telecom	91.4	60.5	47.1	46.9
τ³-Airline	72.0	53.5	38.5	41.5
τ³-Retail	76.1	70.2	67.8	64.3
τ³-Banking	13.4	7.7	7.0	5.7
BrowseComp	48.6	10.0	21.3	7.8

The τ³ numbers are striking. Telecom jumps 31 points over Magistral Medium 1.2 - the previous dedicated reasoning model. BrowseComp, which tests open-ended web research tasks, goes from 10.0 to 48.6. These aren't incremental gains.

Math and Instruction Following

$Math and instruction following benchmarks vs Kimi K2.5, GLM-5, Qwen3.5$ Math benchmarks (AIME25, AllenAI IfBench, Collie, Beyond AIME) vs Kimi K2.5, GLM-5, and Qwen3.5 397B - all run at maximum reasoning settings. Source: huggingface.co

On AIME25 avg@16, Medium 3.5 scores 86.3, roughly matching Claude Sonnet 4.5 (86.7) and Sonnet 4.x (88.9). Kimi K2.5 at 84.8 and GLM-5 at 83.1 trail by a few points. On the Collie instruction-following benchmark, Medium 3.5 peaks at 95.8, ahead of competing models in that group. All math scores were measured at maximum reasoning settings.

Important caveat: Mistral ran these benchmarks themselves. Independent third-party replications aren't available at launch. The SWE-Bench score is more trustworthy since it uses standardized tooling, but the τ³ suite is Mistral-designed and Mistral-run.

Key Capabilities

Merged Architecture

The key architectural decision is merging instruction-following, reasoning, and coding into a single checkpoint. Earlier Mistral deployments split these across Magistral (reasoning), Devstral (coding), and Medium 3.1 (general). Now you deploy one model and control behavior via the reasoning_effort parameter:

# Fast, direct responses
reasoning_effort = "none"

# Step-by-step chain-of-thought for complex tasks
reasoning_effort = "high"

This matters operationally. One model to monitor, one model to version, one bill.

Vision

Medium 3.5 includes a vision encoder trained from scratch. It accepts images at variable sizes and aspect ratios - no forced resizing to a fixed resolution. That's relevant for document parsing, where tall screenshots and wide tables lose information when forced into a square. The model supports OCR with structured annotations and bounding box extraction through the API.

Agentic Tooling

Native function calling, parallel tool use, structured JSON output, and async cloud-based agent execution are all part of the base package. The model powers Mistral's own Vibe 2.0 remote agents - sessions that run in the cloud, can be spawned from the CLI or Le Chat, and let you inspect file diffs, tool calls, and progress states mid-run. This is the same model powering Le Chat's new Work Mode, which runs multi-step tasks by calling multiple tools in parallel.

Self-Hosting

At FP8 precision, the 128B weights need roughly 128GB VRAM. Four H100 80GB cards (320GB total) is the practical minimum with room for the KV cache during production inference. Mistral also ships an EAGLE speculative decoding model for local inference acceleration. Supported runtimes include vLLM, SGLang, Transformers, and Ollama.

vllm serve mistralai/Mistral-Medium-3.5-128B \
  --tensor-parallel-size 8 \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --reasoning-parser mistral \
  --max_num_batched_tokens 16384 \
  --gpu_memory_utilization 0.8

BF16 serving requires 8 GPUs to leave enough headroom. FP8 is the recommended path for most production setups on H100.

Pricing and Availability

Access	Pricing
Mistral API	$1.50/M input, $7.50/M output
Open weights (Hugging Face)	Free
NVIDIA build.nvidia.com	Free (prototyping)
NVIDIA NIM	Production container
Le Chat Pro/Team/Enterprise	Included

At $1.50/M input, Medium 3.5 is cheaper than GPT-4o ($2.50/M input) and Gemini 3.1 Pro ($2.00/M input). The output rate at $7.50/M is higher than some competitors, so workloads with long outputs should factor that in. The API is OpenAI-compatible - swap the base URL, keep your existing client code.

For cost-efficiency comparisons at scale, see our cost efficiency leaderboard.

Strengths

Single merged model replaces three separate Mistral deployments
77.6% SWE-Bench Verified - best in the Mistral family by a large margin
Large agentic performance gains over Magistral and Medium 3.1
Open weights under modified MIT, self-hostable on 4 GPUs (FP8)
Vision encoder handles variable image sizes and aspect ratios
OpenAI-compatible API, no client migration needed
$1.50/M input is competitive for a 128B model

Weaknesses

No published MMLU, GPQA Diamond, or broad general-knowledge benchmarks
τ³ benchmarks are Mistral-designed and Mistral-run - no independent replication at launch
"Modified MIT" has revenue-based restrictions that some large enterprise users need to read carefully
BF16 self-hosting needs 8x 80GB GPUs; FP8 minimum is 4x H100s - not accessible for most on-prem setups
Dense 128B is a larger footprint than MoE alternatives like Qwen 3.5 397B A17B (17B active per token)

Review: Mistral Vibe 2.0 - Medium 3.5 powers the remote agent runner in Vibe 2.0
Mistral Small 4 - Smaller MoE sibling, 119B total / 6B active, Apache 2.0
Coding Benchmarks Leaderboard - Full SWE-Bench rankings across models
Agentic AI Benchmarks Leaderboard - Where Medium 3.5 sits against broader competition

Sources: