Models

MiniMax M2.5

MiniMax M2.5 is a 230B MoE model (10B active) that scores 80.2% on SWE-Bench Verified while costing 1/10th to 1/20th of frontier competitors like Claude Opus 4.6 and GPT-5.2.

MiniMax M2.5

Overview

MiniMax M2.5 is the latest language model from Shanghai-based MiniMax, a company that IPO'd on the Hong Kong Stock Exchange in January 2026 at a valuation exceeding HK$90 billion. Released on February 12, 2026, M2.5 is a Mixture-of-Experts model with 230 billion total parameters but only 10 billion active during inference - a design that delivers frontier-level coding and agentic performance at a fraction of the cost of proprietary competitors.

TL;DR

  • Best-in-class on SWE-Bench Verified (80.2%), matching Claude Opus 4.6 within 0.6 percentage points
  • 230B MoE (10B active), 200K context, $0.15/M input and $1.20/M output tokens
  • Costs 1/10th to 1/20th of Opus, Gemini 3 Pro, and GPT-5 on output pricing - the strongest cost-performance ratio of any frontier-class model

The headline number is 80.2% on SWE-Bench Verified - the highest of any open-weights model and just 0.6 points behind Claude Opus 4.6 (80.8%). But what makes M2.5 genuinely interesting is the economics: output tokens cost $1.20 per million, compared to $25/M for Opus 4.6 and $60/M for GPT-5.2. MiniMax claims you can run M2.5-Lightning continuously for an hour at 100 tokens/second for about $1.

M2.5 ships in two API variants - a standard endpoint at 50 tokens/second and a Lightning endpoint at 100 tokens/second - and the model weights are fully open-sourced on HuggingFace under a Modified-MIT license. The underlying parameters are identical between the two; Lightning simply allocates more inference compute for higher throughput.

Key Specifications

SpecificationDetails
ProviderMiniMax (Shanghai, China)
Model FamilyM2 series
Parameters230B total / 10B active (Mixture-of-Experts)
Context Window200K tokens (architecture supports up to 1M)
Max Output128K tokens
Input Price (Standard)$0.15/M tokens
Output Price (Standard)$1.20/M tokens
Input Price (Lightning)$0.30/M tokens
Output Price (Lightning)$2.40/M tokens
Release DateFebruary 12, 2026
LicenseModified-MIT (open weights, commercial use allowed)
Inference FrameworksSGLang, vLLM, Transformers, KTransformers

Benchmark Performance

MiniMax self-reports comparisons against major frontier models. I have cross-checked these against third-party evaluations where available, but note that community members on Hacker News have flagged MiniMax's history of benchmark optimization with earlier M2 and M2.1 releases - so independent verification matters here.

Coding and Agentic Benchmarks

BenchmarkMiniMax M2.5Claude Opus 4.6Gemini 3 ProGPT-5.2
SWE-Bench Verified80.2%80.8%--
Multi-SWE-Bench51.3%50.3%--
Droid (SWE-Bench)79.7%78.9%--
OpenCode (SWE-Bench)76.1%75.9%--
BrowseComp76.3%---
BFCL Multi-Turn76.8%63.3%--

Reasoning and Knowledge Benchmarks

BenchmarkMiniMax M2.5Claude Opus 4.6Gemini 3 ProGPT-5.2
AIME 202586.395.696.098.0
GPQA Diamond85.290.091.090.0
SciCode44.452.056.052.0
IFBench70.053.070.075.0

The pattern is clear: M2.5 is optimized heavily for coding and agentic tasks, where it trades blows with Opus 4.6 at a 20x lower cost. On pure reasoning benchmarks like AIME 2025 (86.3 vs 95.6+) and scientific knowledge like SciCode (44.4 vs 52+), it falls meaningfully behind frontier proprietary models. This is a model built for developers who need a fast, cheap coding agent - not a general-purpose reasoning powerhouse.

Artificial Analysis ranks M2.5 at an Intelligence Index of 42, placing it #5 among 66 open-weights models and well above the median of 26. Its output speed of 53.9 tokens/second on the standard endpoint ranks #19 across all models they track.

Key Capabilities

Coding and software engineering is where M2.5 earns its keep. Trained on reinforcement learning across 200,000+ real-world environments using MiniMax's Forge RL framework, the model excels at resolving real GitHub issues, multi-file refactoring, and agentic development workflows. It completed SWE-Bench Verified 37% faster than its predecessor M2.1 (22.8 minutes average vs 31.3), using 20% fewer agentic rounds and fewer tokens per task (3.52M vs 3.72M). MiniMax reports that 80% of newly committed code at the company is now generated by M2.5.

Tool calling and search is another strong suit. The BFCL Multi-Turn score of 76.8% significantly outpaces Opus 4.6's 63.3%, meaning M2.5 handles function calls, file operations, and API interactions more reliably in multi-step workflows. On BrowseComp, which tests autonomous web search and information synthesis, M2.5 scores 76.3% with context management.

Office productivity is a less common but notable capability. MiniMax evaluated M2.5 on GDPval-MM, a benchmark for Excel, PowerPoint, and Word tasks, where it achieved a 59.0% average win rate. The model also integrates "Experts" - user-created agent templates on MiniMax's platform, with over 10,000 already deployed.

The model supports 10+ programming languages including Python, TypeScript, Go, Rust, C/C++, Java, Kotlin, and Ruby. It uses the CISPO algorithm for MoE training stability and a process reward mechanism to handle credit assignment in long-context agent rollouts.

Pricing and Availability

M2.5's pricing is its most disruptive feature. A workload of 10 million input tokens and 2 million output tokens per day costs roughly $4.70/day with M2.5 Standard, compared to about $100/day with Claude Opus 4.6 - a 21x difference.

VariantInputOutputThroughputHourly Cost
M2.5 Standard$0.15/M$1.20/M~50 tok/s~$0.30
M2.5 Lightning$0.30/M$2.40/M~100 tok/s~$1.00

For context, Gemini 3.1 Pro charges $2.00/M input and $8.00/M output. Opus 4.6 runs $5.00/M input and $25.00/M output. Even among the recent wave of aggressively priced Chinese models like the Qwen 3.5 series, M2.5 Standard is among the cheapest options available at the frontier performance tier.

The model is available through MiniMax's own API platform, OpenRouter, Together AI, and Lambda. Open weights on HuggingFace mean you can self-host with SGLang or vLLM, though at 230B total parameters (457 GB in bf16), you will need serious hardware. Community-provided GGUF quantizations from Unsloth are available for more accessible deployments. Check our guide to running open-source LLMs locally for hardware requirements at this scale.

Strengths and Weaknesses

Strengths

  • SWE-Bench leader among open models - 80.2% verified, matching Opus 4.6 within measurement noise on coding tasks
  • Extraordinary cost-performance ratio - 10-20x cheaper than comparable frontier models on output pricing
  • Fast inference - Lightning variant at 100 tok/s is roughly double typical frontier model throughput
  • Strong tool calling - BFCL 76.8% significantly outperforms Opus 4.6 on multi-turn function calling
  • Open weights - Modified-MIT license allows commercial use, self-hosting, and fine-tuning
  • Efficient agentic execution - 20% fewer rounds and 37% faster than predecessor M2.1

Weaknesses

  • Weaker reasoning - AIME 2025 at 86.3 lags behind Opus (95.6), Gemini 3 Pro (96.0), and GPT-5.2 (98.0) significantly
  • Limited scientific knowledge - SciCode at 44.4 trails frontier models by 8-12 points
  • Verbose outputs - Artificial Analysis measured 56M tokens during evaluation vs a 15M median, inflating effective costs
  • Non-deterministic quality - Users report meaningful variation in output quality across identical prompts
  • Benchmark trust concerns - Community skepticism persists from M2/M2.1 benchmark reward-hacking history
  • Large self-hosting footprint - 457 GB unquantized makes local deployment impractical without multi-GPU setups

Sources

MiniMax M2.5
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.