Devstral 2

Mistral's open-weight coding agent model - 123B parameters, 256K context window, 72.2% on SWE-bench Verified, priced at $0.40/M input tokens.

Devstral 2

Devstral 2 is Mistral AI's second-generation agentic coding model, released December 9, 2025. It's a 123-billion parameter dense transformer that scores 72.2% on SWE-bench Verified, placing it among the top open-weight coding models. Unlike Codestral, which handles autocomplete tasks, Devstral 2 is built to operate as a software engineering agent - reading codebases, editing multiple files, calling tools, and closing tasks without human intervention mid-flight.

TL;DR

  • 123B dense transformer for agentic software engineering, scores 72.2% on SWE-bench Verified
  • 256K token context window, $0.40/M input and $2.00/M output tokens via Mistral API
  • Beats models 5-28x larger (DeepSeek V3.2, Kimi K2) on cost-efficiency; slightly behind Claude Sonnet 4.5 (77.2%) on SWE-bench

Mistral launched the model with Devstral Small 2 (24B, Apache 2.0) and a companion open-source CLI called Mistral Vibe. Both model sizes share the same 256K context window - wide enough to hold roughly 500-700 source files in a single pass. The 123B model is released under a modified MIT license that permits commercial use; on-premises deployment is supported.

Key Specifications

SpecificationDetails
ProviderMistral AI
Model FamilyDevstral
Parameters123B (dense transformer)
Context Window256K tokens
Input Price$0.40/M tokens
Output Price$2.00/M tokens
Release Date2025-12-09
LicenseModified MIT (commercial use allowed)
Open SourceYes (weights on HuggingFace)
QuantizationBF16, FP8 supported
Companion SmallDevstral Small 2 (24B, Apache 2.0)

Benchmark Performance

Devstral 2 is assessed mostly on SWE-bench Verified and SWE-bench Multilingual, the two benchmarks most directly tied to autonomous software engineering. The results below come from Mistral's published evaluation using the OpenHands scaffold (provided by All Hands AI).

BenchmarkDevstral 2 (123B)Devstral Small 2 (24B)Claude Sonnet 4.5DeepSeek V3.2
SWE-bench Verified72.2%68.0%77.2%73.1%
SWE-bench Multilingual61.3%Not disclosedNot disclosedNot disclosed
Terminal Bench 232.6%Not disclosedNot disclosedNot disclosed
LiveCodeBench V644.8Not disclosedNot disclosedNot disclosed
MMLU-Pro76.2Not disclosedNot disclosedNot disclosed
GPQA59.4Not disclosedNot disclosedNot disclosed

The 72.2% SWE-bench score puts Devstral 2 above DeepSeek V3.2 (73.1% is the closed-weight version) but below Claude Sonnet 4.5 at 77.2%. For a model that fits on two A100 80GB GPUs and runs fully on-premises, that gap is narrow enough to matter. The 68.0% scored by Devstral Small 2 is a bigger surprise - a 24B model within 4 percentage points of the 123B version, running on consumer hardware.

The Terminal Bench 2 score of 32.6% is worth flagging. When tasks shift from application coding toward DevOps and infrastructure automation, Devstral 2 loses ground to some proprietary models. If your agent workflows lean heavily on shell commands and sysadmin tasks rather than Python or TypeScript, the numbers don't favor it as clearly.

Mistral also ran a human preference evaluation comparing Devstral 2 to DeepSeek V3.2. Devstral 2 won 42.8% of head-to-head comparisons, losing 28.6%, with the remainder as ties. Claude Sonnet 4.5 remained preferred overall in that study, consistent with the SWE-bench gap.

For broader context on how these scores rank across the field, see our SWE-bench coding agent leaderboard.

Key Capabilities

Devstral 2 was fine-tuned for agentic software engineering rather than chat or autocomplete. It calls tools to read directory trees, open files, run tests, and write patches - the model loops through plan-execute-verify cycles without needing explicit user prompts between steps. The 256K context window means it can load an entire mid-size repository into a single inference pass, which matters for tasks like cross-file refactoring or understanding a codebase's internal dependency graph.

Multi-file and Multi-language Editing

The model supports multi-file edits natively through function calling and can track changes across files within the same context window. Mistral's announcement claims it outperforms models that are 5-28x larger on cost-efficiency metrics. The SWE-bench Multilingual score of 61.3% - measuring ability to close GitHub issues across repositories in non-English codebases - isn't directly comparable across models yet, but it's a metric worth watching for teams running in Python, Go, Java, or Rust.

Framework Integration

Devstral 2 integrates with the major agentic coding scaffolds: OpenHands, SWE-agent, Cline, Kilo Code, and Claude Code (as the model backend). For AI coding tools that support custom model endpoints, Devstral 2 slots in via the Mistral API or via self-hosted vLLM using the standard OpenAI-compatible interface.

The Mistral Vibe CLI - released alongside this model - is an open-source terminal agent that wraps Devstral 2 with project-aware context scanning, git integration, and IDE support via a Zed extension. It's Apache 2.0 and installable via pip.

Local Deployment

Running the 123B model on-premises requires about 128GB of combined RAM/VRAM. In practice that means either 2x A100 80GB GPUs or a high-memory CPU node for quantized inference. The Devstral Small 2 variant drops that to around 25GB VRAM, which fits a single consumer-grade H100 or A10. Both models are available as GGUF quantizations on HuggingFace for llama.cpp and LM Studio.

Pricing and Availability

Devstral 2 is available via the Mistral API at $0.40/M input tokens and $2.00/M output tokens. Devstral Small 2 costs $0.10/M input and $0.30/M output. By comparison, GPT-5.3 Codex and Claude Sonnet 4.5 are priced much higher per token, which is where Mistral's 7x cost-efficiency claim comes from.

The model was free during the initial launch period; current pricing shown above is the standard API rate as of verification date.

For on-premises deployment, HuggingFace hosts the weights at mistralai/Devstral-2-123B-Instruct-2512 (BF16 and FP8). vLLM is the recommended inference backend, with tensor-parallel serving across 8 GPUs for maximum throughput. SGLang and HuggingFace Transformers are also supported, and community GGUF quantizations are available for CPU-based inference.


Strengths

  • 72.2% SWE-bench Verified is competitive with top proprietary models at a fraction of the cost
  • 256K context window - wider than most models in its class, handles large codebases in a single pass
  • Fully open weights under a permissive modified MIT license; self-hostable on two A100s
  • Devstral Small 2 companion at 24B/Apache 2.0 closes most of the gap at a fraction of the compute
  • Native function calling and tool use across all major agentic frameworks

Weaknesses

  • SWE-bench score (72.2%) is still 5 percentage points below Claude Sonnet 4.5 (77.2%)
  • Terminal Bench 2 score of 32.6% shows clear weakness in DevOps and infrastructure automation tasks
  • 123B weight size requires significant GPU infrastructure for local inference
  • Human preference evaluations still favor Claude Sonnet 4.5 over Devstral 2 on head-to-head tasks
  • No multimodal support in the 123B model (Devstral Small 2 has image input capability)

FAQ

What is Devstral 2 best for?

Autonomous software engineering tasks: closing GitHub issues, multi-file refactoring, debugging, and running agentic workflows that require reading and editing an entire codebase without human prompts between steps.

Can Devstral 2 run locally?

Yes. The 123B model requires roughly 128GB VRAM (2x A100 80GB). Devstral Small 2 runs in 25GB VRAM. Community GGUF quantizations enable CPU inference via llama.cpp or LM Studio.

How does Devstral 2 compare to Claude Code?

On SWE-bench Verified, Devstral 2 scores 72.2% vs Claude Sonnet 4.5 at 77.2%. In head-to-head human evaluation, Claude Sonnet 4.5 is still preferred. Devstral 2 wins on cost and local deployment flexibility.

What license does Devstral 2 use?

A modified MIT license that allows commercial use. Devstral Small 2 uses Apache 2.0. Neither requires attribution or restricts commercial deployment.

What is the context window?

256K tokens, the same for both Devstral 2 (123B) and Devstral Small 2 (24B). That's large enough to hold roughly 500-700 typical source files in one pass.

Is Devstral 2 the same as Codestral?

No. Codestral is optimized for code autocomplete and fill-in-the-middle tasks in IDEs. Devstral 2 is a distinct model trained for multi-step agentic workflows with tool use and multi-file editing.


Sources:

✓ Last verified June 8, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.