Mistral Medium 3.5

Mistral's first flagship merged model: a dense 128B with configurable reasoning, vision, and 77.6% SWE-Bench Verified, self-hostable on 4 GPUs.

Mistral Medium 3.5

Mistral Medium 3.5 is Mistral AI's first "merged" flagship model - one set of weights that handles instruction-following, reasoning, and coding without switching between separate specialized models. Previous Mistral deployments required routing between Magistral for reasoning and Devstral for coding. Medium 3.5 replaces both, plus the older Medium 3.1, with a single 128B dense model.

TL;DR

  • Dense 128B, 256K context, open weights under modified MIT, self-hostable on 4 GPUs at FP8
  • 77.6% SWE-Bench Verified - clears Devstral 2 (72.2%) and Devstral Small 2 (68.0%) by a clean margin
  • $1.50/M input tokens, 40% cheaper than GPT-4o, OpenAI-compatible API

Key Specifications

SpecificationDetails
ProviderMistral AI
Model FamilyMistral Medium
Parameters128B (dense)
ArchitectureDense transformer (not MoE)
Context Window256,000 tokens
Tensor FormatsBF16, F8_E4M3 (FP8)
MultimodalText + image input
Configurable Reasoningreasoning_effort parameter (none/high)
Multilingual24+ languages
Release DateApril 29, 2026
LicenseModified MIT (open weights)
Open WeightsYes (Hugging Face)

The "modified MIT" label needs a note: the license is permissive for most organizations but includes revenue-based restrictions for large companies. If your company clears a certain revenue threshold, check the license text before launching commercially. For startups and mid-size businesses the restrictions are effectively moot.

Benchmark Performance

Mistral published benchmark results across three categories: SWE-Bench coding, τ³ agentic task suites, and math/reasoning. MMLU and GPQA Diamond aren't disclosed.

SWE-Bench Verified: Mistral Medium 3.5 vs Devstral 2 vs Devstral Small 2 SWE-Bench Verified scores: Medium 3.5 at 77.6% vs Devstral 2 at 72.2% vs Devstral Small 2 at 68.0%. Source: huggingface.co

Coding

ModelSWE-Bench Verified
Mistral Medium 3.577.6%
Devstral 272.2%
Devstral Small 268.0%

The 5.4-point gap over Devstral 2 is significant in SWE-Bench terms - it translates to noticeably fewer failed patches in a real agentic loop. Medium 3.5 also beats Qwen 3.5 397B A17B on this benchmark per Mistral's own reporting, though the 397B MoE is a much larger model by total parameter count.

Agentic Tasks (τ³ Suite)

Agentic benchmarks vs previous Mistral models across Telecom, Airline, Retail, Banking, and BrowseComp tasks τ³ agentic benchmark scores across five task domains: Medium 3.5 leads every category versus Magistral Medium 1.2, Mistral Small 4, and Medium 3.1. Source: huggingface.co

BenchmarkMistral Medium 3.5Magistral Medium 1.2Mistral Small 4Medium 3.1
τ³-Telecom91.460.547.146.9
τ³-Airline72.053.538.541.5
τ³-Retail76.170.267.864.3
τ³-Banking13.47.77.05.7
BrowseComp48.610.021.37.8

The τ³ numbers are striking. Telecom jumps 31 points over Magistral Medium 1.2 - the previous dedicated reasoning model. BrowseComp, which tests open-ended web research tasks, goes from 10.0 to 48.6. These aren't incremental gains.

Math and Instruction Following

Math and instruction following benchmarks vs Kimi K2.5, GLM-5, Qwen3.5 Math benchmarks (AIME25, AllenAI IfBench, Collie, Beyond AIME) vs Kimi K2.5, GLM-5, and Qwen3.5 397B - all run at maximum reasoning settings. Source: huggingface.co

On AIME25 avg@16, Medium 3.5 scores 86.3, roughly matching Claude Sonnet 4.5 (86.7) and Sonnet 4.x (88.9). Kimi K2.5 at 84.8 and GLM-5 at 83.1 trail by a few points. On the Collie instruction-following benchmark, Medium 3.5 peaks at 95.8, ahead of competing models in that group. All math scores were measured at maximum reasoning settings.

Important caveat: Mistral ran these benchmarks themselves. Independent third-party replications aren't available at launch. The SWE-Bench score is more trustworthy since it uses standardized tooling, but the τ³ suite is Mistral-designed and Mistral-run.

Key Capabilities

Merged Architecture

The key architectural decision is merging instruction-following, reasoning, and coding into a single checkpoint. Earlier Mistral deployments split these across Magistral (reasoning), Devstral (coding), and Medium 3.1 (general). Now you deploy one model and control behavior via the reasoning_effort parameter:

# Fast, direct responses
reasoning_effort = "none"

# Step-by-step chain-of-thought for complex tasks
reasoning_effort = "high"

This matters operationally. One model to monitor, one model to version, one bill.

Vision

Medium 3.5 includes a vision encoder trained from scratch. It accepts images at variable sizes and aspect ratios - no forced resizing to a fixed resolution. That's relevant for document parsing, where tall screenshots and wide tables lose information when forced into a square. The model supports OCR with structured annotations and bounding box extraction through the API.

Agentic Tooling

Native function calling, parallel tool use, structured JSON output, and async cloud-based agent execution are all part of the base package. The model powers Mistral's own Vibe 2.0 remote agents - sessions that run in the cloud, can be spawned from the CLI or Le Chat, and let you inspect file diffs, tool calls, and progress states mid-run. This is the same model powering Le Chat's new Work Mode, which runs multi-step tasks by calling multiple tools in parallel.

Self-Hosting

At FP8 precision, the 128B weights need roughly 128GB VRAM. Four H100 80GB cards (320GB total) is the practical minimum with room for the KV cache during production inference. Mistral also ships an EAGLE speculative decoding model for local inference acceleration. Supported runtimes include vLLM, SGLang, Transformers, and Ollama.

vllm serve mistralai/Mistral-Medium-3.5-128B \
  --tensor-parallel-size 8 \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --reasoning-parser mistral \
  --max_num_batched_tokens 16384 \
  --gpu_memory_utilization 0.8

BF16 serving requires 8 GPUs to leave enough headroom. FP8 is the recommended path for most production setups on H100.

Pricing and Availability

AccessPricing
Mistral API$1.50/M input, $7.50/M output
Open weights (Hugging Face)Free
NVIDIA build.nvidia.comFree (prototyping)
NVIDIA NIMProduction container
Le Chat Pro/Team/EnterpriseIncluded

At $1.50/M input, Medium 3.5 is cheaper than GPT-4o ($2.50/M input) and Gemini 3.1 Pro ($2.00/M input). The output rate at $7.50/M is higher than some competitors, so workloads with long outputs should factor that in. The API is OpenAI-compatible - swap the base URL, keep your existing client code.

For cost-efficiency comparisons at scale, see our cost efficiency leaderboard.

Strengths

  • Single merged model replaces three separate Mistral deployments
  • 77.6% SWE-Bench Verified - best in the Mistral family by a large margin
  • Large agentic performance gains over Magistral and Medium 3.1
  • Open weights under modified MIT, self-hostable on 4 GPUs (FP8)
  • Vision encoder handles variable image sizes and aspect ratios
  • OpenAI-compatible API, no client migration needed
  • $1.50/M input is competitive for a 128B model

Weaknesses

  • No published MMLU, GPQA Diamond, or broad general-knowledge benchmarks
  • τ³ benchmarks are Mistral-designed and Mistral-run - no independent replication at launch
  • "Modified MIT" has revenue-based restrictions that some large enterprise users need to read carefully
  • BF16 self-hosting needs 8x 80GB GPUs; FP8 minimum is 4x H100s - not accessible for most on-prem setups
  • Dense 128B is a larger footprint than MoE alternatives like Qwen 3.5 397B A17B (17B active per token)

Sources:

✓ Last verified April 30, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.