Mistral Medium 3.5 Review: Open Agent, Sharp Teeth

Mistral's 128B open-weight model consolidates reasoning, coding, and vision into one checkpoint, with remote agents that file pull requests autonomously.

Mistral Medium 3.5 Review: Open Agent, Sharp Teeth

When Mistral released Medium 3.5 on April 29, 2026, the pitch was disarmingly simple: stop maintaining three separate model checkpoints and collapse them into one. Magistral (reasoning), Devstral 2 (coding), and Mistral Medium 3.1 (instruction-following) are all gone. Medium 3.5 replaces every one of them.

TL;DR

  • 8.3/10 - the best open-weight coding agent you can actually self-host
  • 77.6% on SWE-Bench Verified, async PR-filing via Vibe remote agents, and a custom vision encoder in a single 128B dense checkpoint
  • API pricing ($1.50/$7.50 per million tokens) is steep for an open model, and inference speed runs below the median for its class
  • Strong choice for teams that want Claude Sonnet 4.6-tier coding performance at half the API cost, or who need to run frontier weights on-premises

That consolidation is more interesting than it sounds. The AI world has spent the last two years teaching frontier labs to specialize - one model for chat, a reasoning variant for hard problems, a coding model for software tasks. Mistral did the opposite and trained all three capabilities into one dense 128B set of weights, with a per-request toggle to dial reasoning effort up or down. That choice has real consequences for teams that actually deploy models.

Background: Three Models, One Checkpoint

The previous Mistral lineup had a coherent logic: you picked the right model for the right job and paid accordingly. Medium 3.1 was the generalist. Magistral was the thinking model. Devstral 2 was the software engineer. The problem was that real workloads don't stay in one lane. An agentic coding session needs instruction-following, reasoning, and coding ability in the same turn, often switching between all three mid-task.

Medium 3.5 solves this by abandoning the specialization strategy entirely. The model is dense - not a mixture-of-experts architecture - so the full 128B parameters are always active. That keeps routing simple and removes the token-routing overhead you see in MoE models, but it also means you pay the full compute cost on every request regardless of task complexity.

The tradeoff is visible in the speed numbers. The model generates output at 76.7 tokens per second on typical inference hardware, against a median of 81.6 t/s for open-weight models of similar size. Time to first token averages 2.26 seconds, above the 1.57-second median for its class. Dense 128B is fast enough for interactive use, but not a low-latency champion.

The Mistral Vibe interface showing a remote coding session in progress The Vibe Code interface launches remote agents that run in isolated cloud sandboxes, with full session history preserved if you beam a local CLI session to the cloud mid-task. Source: heise.de

Benchmarks in Context

The headline number is 77.6% on SWE-Bench Verified, which measures whether a model can resolve real GitHub issues in Python repositories - ambiguous specs, missing context, test suites that have to actually pass. That score sits above the 63.4% average across the 99 models currently on the leaderboard, though it trails Claude Sonnet 4.6 (79.6%), Gemini 3.5 Flash (78.8%), and the top-tier closed models like Claude Fable 5 (95.0%) and Claude Opus 4.8 (88.6%).

The more meaningful comparison is within open-weight models. Medium 3.5 beats Devstral 2 (Mistral's previous open coding specialist) and Qwen 3.5 397B A17B - a model with three times the parameter count in a mixture-of-experts configuration. That's the actual headline: a 128B dense model beating a 397B sparse one on coding tasks.

On agentic benchmarks the picture brightens further. The τ³-Telecom score is 91.4, measuring multi-tool calling and long-horizon task completion. Artificial Analysis gives it an Intelligence Index of 30, placing it well above the median for open-weight models at comparable size. For tool use, function calling, and multi-step reasoning, this model outperforms its SWE-Bench number suggests.

One caveat worth noting: SWE-Bench Verified is showing signs of inflation. OpenAI stopped publishing Verified scores in early 2026 and shifted to SWE-bench Pro, where the gap between labs narrows sharply - Claude Opus 4.5 scores 80.9% on Verified but only 45.9% on Pro. Mistral hasn't published Pro scores yet. Keep that in mind when benchmarks get cited in vendor pitches.


Vibe Remote Agents: The Part That Actually Matters

The raw model is only half the story. Alongside Medium 3.5, Mistral shipped remote agents in Vibe - its coding agent platform - and this is where the product becomes interesting for engineering teams.

Sessions run in isolated cloud sandboxes. You can spawn them from the CLI or Le Chat, run multiple jobs in parallel, and receive the output as a GitHub pull request.

The workflow is more practical than most agentic coding demos. You hand the agent a task from the CLI, it queues and executes asynchronously while you work on something else, and when it finishes it opens a PR for review. The agent can also connect to Sentry (for debugging production errors), Linear and Jira (for issue tracking), Slack and Microsoft Teams (for notifications), and GitHub for all version control operations.

The teleportation feature is genuinely useful. If you start a coding session locally and need to leave your desk, you can beam the session state to Mistral's cloud mid-task - full history, pending approvals, and tool state transfer intact. This is a meaningfully different model from "run the agent, wait, check back."

Mistral Vibe CLI running on Linux, showing agent session management Mistral Vibe CLI on Linux - remote agent sessions can be started, paused, and teleported to cloud without losing history or task context. Source: heise.de

The constraint is cost. Vibe remote agents require a Pro, Team, or Enterprise subscription. The open-weight checkpoint is free to download, but running it efficiently enough for production agent workloads requires serious infrastructure.

Reasoning on Demand

Configurable reasoning effort is one of the more practical design decisions in this release. Set reasoning_effort="high" for complex agentic runs, reasoning_effort="none" for fast chat replies. The temperature recommendations reflect this: 0.7 with high reasoning effort, 0.0-0.7 depending on task with reasoning off.

This is more useful in practice than it sounds on paper. A model that forces extended thinking on every request adds latency everywhere. A model that never thinks deeply enough misses hard problems. The per-request toggle lets you pay for reasoning only where it helps, which matters when you're running thousands of API calls in an agentic loop.

The published model card recommends keeping top-p at 0.95 in reasoning mode, which is standard. One technical note for self-hosters: the original Transformers config shipped with an incorrect entry that caused long-context performance degradation above certain sequence lengths. Mistral patched this, but make sure your inference framework is pulling the corrected checkpoint version before assuming full 256k context fidelity.

Vision: A Custom Encoder for Real-World Images

Mistral trained the vision encoder from scratch rather than grafting on an existing one, and the reason is practical: most vision encoders assume a fixed input resolution, which means they either distort tall screenshots or lose information in wide diagrams. Medium 3.5's encoder handles variable image sizes and aspect ratios natively.

For software teams, this matters specifically for visual debugging workflows - passing mobile screenshots with unusual aspect ratios, reading wide architecture diagrams, or processing system UI images that don't fit a 1:1 crop. The model outputs text only, so it won't create images or modify them, but the perception quality on non-standard inputs is meaningfully better than retrofitted encoders.

Self-Hosting: The Four-GPU Reality

The model ships under a modified MIT license with open weights on HuggingFace. Self-hosting is real but it comes with a price: the practical minimum is four H100 80GB (or H200) data center GPUs to hold the 128B parameters in FP8 precision, plus additional VRAM headroom for the 256K context window's KV cache.

That's roughly $400K-$600K in hardware at current market rates, or around $8-15 per hour on GPU cloud providers. For most teams, the hosted API will be cheaper below a certain usage threshold. The crossover point - where buying the infrastructure beats the API bill - sits somewhere around 5-10 billion tokens per month depending on cloud pricing.

For regulated industries (healthcare, finance, defense contracting) where data can't leave your perimeter, the self-hosting path is truly valuable regardless of cost. Medium 3.5 is one of very few frontier-tier open-weight models where that option exists at 77%+ SWE-Bench capability.

Pricing vs. the Competition

ModelInput (per 1M)Output (per 1M)SWE-BenchContextOpen Weights
Mistral Medium 3.5$1.50$7.5077.6%256KYes (modified MIT)
Claude Sonnet 4.6$3.00$15.0079.6%200K in / 64K outNo
Qwen 3.6 Max$2.50$7.50-1MNo
Mistral Small 4$0.10$0.30~55%32KYes (Apache 2.0)

The comparison with Claude Sonnet 4.6 is where the value case crystallizes. You get 2x cheaper tokens, a longer context window (256K vs 200K input, and crucially 256K output vs 64K), and roughly equal coding performance at the SWE-Bench level - with the option to self-host entirely. The gap in raw benchmark performance is real but narrow enough that cost and deployment flexibility dominate the decision for most teams.

Strengths

  • Single checkpoint replaces three separate Mistral models with no capability regressions
  • 77.6% SWE-Bench Verified is the highest published score among self-hostable open-weight models
  • Configurable reasoning effort lets you tune latency vs. depth per request
  • Vibe remote agents with GitHub PR integration are truly usable, not a demo
  • Custom vision encoder handles non-standard image aspect ratios cleanly
  • Modified MIT license allows commercial use with self-hosting

Weaknesses

  • 76.7 t/s generation speed falls below the open-weight median for this size class
  • API pricing at $7.50/M output tokens is expensive relative to comparable open models
  • SWE-Bench Verified may overstate real-world advantage; Pro scores not yet published
  • Four H100 80GB GPUs is a significant barrier for most self-hosting teams
  • Officially noted weakness in non-agentic benchmarks vs. closed-source competitors
  • Vibe remote agents locked behind paid subscription tiers

Verdict

Mistral Medium 3.5 makes a coherent argument. It's the highest-capability open-weight model for software engineering tasks at a price point meaningfully below Claude Sonnet 4.6, and it ships with a remote agent system that does more than fill a benchmark. The decision to unify three models into one dense checkpoint is truly useful for agentic deployments where the same session needs instruction-following, reasoning, and coding in rapid succession.

The limits are real. It isn't the fastest model in its class. The API cost is high for an open-weight model - you're partly paying for the agent infrastructure, not just the tokens. And the four-GPU self-hosting requirement is a practical ceiling for all but well-resourced engineering teams.

For teams that need self-hosted frontier coding capability, it's the best option available. For teams comfortable with API inference, it offers Claude Sonnet 4.6-grade coding at half the price, with a longer output context window. At 8.3/10, it earns that score on the weight of what it actually delivers - not what the benchmark sheet implies.

Sources

Elena Marchetti
About the author Senior AI Editor & Investigative Journalist

Elena is a technology journalist with over eight years of experience covering artificial intelligence, machine learning, and the startup ecosystem.