Name: Devstral 2
Author: Mistral AI

Devstral 2 is Mistral AI's second-generation agentic coding model, released December 9, 2025. It's a 123-billion parameter dense transformer that scores 72.2% on SWE-bench Verified, placing it among the top open-weight coding models. Unlike Codestral, which handles autocomplete tasks, Devstral 2 is built to operate as a software engineering agent - reading codebases, editing multiple files, calling tools, and closing tasks without human intervention mid-flight.

TL;DR

123B dense transformer for agentic software engineering, scores 72.2% on SWE-bench Verified
256K token context window, $0.40/M input and $2.00/M output tokens via Mistral API
Beats models 5-28x larger (DeepSeek V3.2, Kimi K2) on cost-efficiency; slightly behind Claude Sonnet 4.5 (77.2%) on SWE-bench

Mistral launched the model with Devstral Small 2 (24B, Apache 2.0) and a companion open-source CLI called Mistral Vibe. Both model sizes share the same 256K context window - wide enough to hold roughly 500-700 source files in a single pass. The 123B model is released under a modified MIT license that permits commercial use; on-premises deployment is supported.

Key Specifications

Specification	Details
Provider	Mistral AI
Model Family	Devstral
Parameters	123B (dense transformer)
Context Window	256K tokens
Input Price	$0.40/M tokens
Output Price	$2.00/M tokens
Release Date	2025-12-09
License	Modified MIT (commercial use allowed)
Open Source	Yes (weights on HuggingFace)
Quantization	BF16, FP8 supported
Companion Small	Devstral Small 2 (24B, Apache 2.0)

Benchmark Performance

Devstral 2 is assessed mostly on SWE-bench Verified and SWE-bench Multilingual, the two benchmarks most directly tied to autonomous software engineering. The results below come from Mistral's published evaluation using the OpenHands scaffold (provided by All Hands AI).

Benchmark	Devstral 2 (123B)	Devstral Small 2 (24B)	Claude Sonnet 4.5	DeepSeek V3.2
SWE-bench Verified	72.2%	68.0%	77.2%	73.1%
SWE-bench Multilingual	61.3%	Not disclosed	Not disclosed	Not disclosed
Terminal Bench 2	32.6%	Not disclosed	Not disclosed	Not disclosed
LiveCodeBench V6	44.8	Not disclosed	Not disclosed	Not disclosed
MMLU-Pro	76.2	Not disclosed	Not disclosed	Not disclosed
GPQA	59.4	Not disclosed	Not disclosed	Not disclosed

The 72.2% SWE-bench score puts Devstral 2 above DeepSeek V3.2 (73.1% is the closed-weight version) but below Claude Sonnet 4.5 at 77.2%. For a model that fits on two A100 80GB GPUs and runs fully on-premises, that gap is narrow enough to matter. The 68.0% scored by Devstral Small 2 is a bigger surprise - a 24B model within 4 percentage points of the 123B version, running on consumer hardware.

The Terminal Bench 2 score of 32.6% is worth flagging. When tasks shift from application coding toward DevOps and infrastructure automation, Devstral 2 loses ground to some proprietary models. If your agent workflows lean heavily on shell commands and sysadmin tasks rather than Python or TypeScript, the numbers don't favor it as clearly.

Mistral also ran a human preference evaluation comparing Devstral 2 to DeepSeek V3.2. Devstral 2 won 42.8% of head-to-head comparisons, losing 28.6%, with the remainder as ties. Claude Sonnet 4.5 remained preferred overall in that study, consistent with the SWE-bench gap.

For broader context on how these scores rank across the field, see our SWE-bench coding agent leaderboard.

Key Capabilities

Devstral 2 was fine-tuned for agentic software engineering rather than chat or autocomplete. It calls tools to read directory trees, open files, run tests, and write patches - the model loops through plan-execute-verify cycles without needing explicit user prompts between steps. The 256K context window means it can load an entire mid-size repository into a single inference pass, which matters for tasks like cross-file refactoring or understanding a codebase's internal dependency graph.

Multi-file and Multi-language Editing

The model supports multi-file edits natively through function calling and can track changes across files within the same context window. Mistral's announcement claims it outperforms models that are 5-28x larger on cost-efficiency metrics. The SWE-bench Multilingual score of 61.3% - measuring ability to close GitHub issues across repositories in non-English codebases - isn't directly comparable across models yet, but it's a metric worth watching for teams running in Python, Go, Java, or Rust.

Framework Integration

Devstral 2 integrates with the major agentic coding scaffolds: OpenHands, SWE-agent, Cline, Kilo Code, and Claude Code (as the model backend). For AI coding tools that support custom model endpoints, Devstral 2 slots in via the Mistral API or via self-hosted vLLM using the standard OpenAI-compatible interface.

The Mistral Vibe CLI - released alongside this model - is an open-source terminal agent that wraps Devstral 2 with project-aware context scanning, git integration, and IDE support via a Zed extension. It's Apache 2.0 and installable via pip.

Local Deployment

Running the 123B model on-premises requires about 128GB of combined RAM/VRAM. In practice that means either 2x A100 80GB GPUs or a high-memory CPU node for quantized inference. The Devstral Small 2 variant drops that to around 25GB VRAM, which fits a single consumer-grade H100 or A10. Both models are available as GGUF quantizations on HuggingFace for llama.cpp and LM Studio.

Pricing and Availability

Devstral 2 is available via the Mistral API at $0.40/M input tokens and $2.00/M output tokens. Devstral Small 2 costs $0.10/M input and $0.30/M output. By comparison, GPT-5.3 Codex and Claude Sonnet 4.5 are priced much higher per token, which is where Mistral's 7x cost-efficiency claim comes from.

The model was free during the initial launch period; current pricing shown above is the standard API rate as of verification date.

For on-premises deployment, HuggingFace hosts the weights at mistralai/Devstral-2-123B-Instruct-2512 (BF16 and FP8). vLLM is the recommended inference backend, with tensor-parallel serving across 8 GPUs for maximum throughput. SGLang and HuggingFace Transformers are also supported, and community GGUF quantizations are available for CPU-based inference.

Strengths

72.2% SWE-bench Verified is competitive with top proprietary models at a fraction of the cost
256K context window - wider than most models in its class, handles large codebases in a single pass
Fully open weights under a permissive modified MIT license; self-hostable on two A100s
Devstral Small 2 companion at 24B/Apache 2.0 closes most of the gap at a fraction of the compute
Native function calling and tool use across all major agentic frameworks

Weaknesses

SWE-bench score (72.2%) is still 5 percentage points below Claude Sonnet 4.5 (77.2%)
Terminal Bench 2 score of 32.6% shows clear weakness in DevOps and infrastructure automation tasks
123B weight size requires significant GPU infrastructure for local inference
Human preference evaluations still favor Claude Sonnet 4.5 over Devstral 2 on head-to-head tasks
No multimodal support in the 123B model (Devstral Small 2 has image input capability)

Mistral Vibe 2.0 Review - Our hands-on with the CLI coding agent powered by Devstral 2
SWE-bench Coding Agent Leaderboard - Full rankings across all models
Coding Benchmarks Leaderboard - Broader code benchmark comparison
Mistral Medium 3.5 - Mistral's flagship merged model at 77.6% SWE-bench
Mistral Small 4 - Mistral's lightweight general-purpose model
DeepSeek V3.2 - Primary open-weight competitor

FAQ

What is Devstral 2 best for?

Autonomous software engineering tasks: closing GitHub issues, multi-file refactoring, debugging, and running agentic workflows that require reading and editing an entire codebase without human prompts between steps.

Can Devstral 2 run locally?

Yes. The 123B model requires roughly 128GB VRAM (2x A100 80GB). Devstral Small 2 runs in 25GB VRAM. Community GGUF quantizations enable CPU inference via llama.cpp or LM Studio.

How does Devstral 2 compare to Claude Code?

On SWE-bench Verified, Devstral 2 scores 72.2% vs Claude Sonnet 4.5 at 77.2%. In head-to-head human evaluation, Claude Sonnet 4.5 is still preferred. Devstral 2 wins on cost and local deployment flexibility.

What license does Devstral 2 use?

A modified MIT license that allows commercial use. Devstral Small 2 uses Apache 2.0. Neither requires attribution or restricts commercial deployment.

What is the context window?

256K tokens, the same for both Devstral 2 (123B) and Devstral Small 2 (24B). That's large enough to hold roughly 500-700 typical source files in one pass.

Is Devstral 2 the same as Codestral?

No. Codestral is optimized for code autocomplete and fill-in-the-middle tasks in IDEs. Devstral 2 is a distinct model trained for multi-step agentic workflows with tool use and multi-file editing.

Sources: