Name: MiniMax M2.5
Author: MiniMax

Overview

MiniMax M2.5 is the latest language model from Shanghai-based MiniMax, a company that IPO'd on the Hong Kong Stock Exchange in January 2026 at a valuation exceeding HK$90 billion. Released on February 12, 2026, M2.5 is a Mixture-of-Experts model with 230 billion total parameters but only 10 billion active during inference - a design that delivers frontier-level coding and agentic performance at a fraction of the cost of proprietary competitors.

TL;DR

Best-in-class on SWE-Bench Verified (80.2%), matching Claude Opus 4.6 within 0.6 percentage points
230B MoE (10B active), 200K context, $0.15/M input and $1.20/M output tokens
Costs 1/10th to 1/20th of Opus, Gemini 3 Pro, and GPT-5 on output pricing - the strongest cost-performance ratio of any frontier-class model

The headline number is 80.2% on SWE-Bench Verified - the highest of any open-weights model and just 0.6 points behind Claude Opus 4.6 (80.8%). But what makes M2.5 genuinely interesting is the economics: output tokens cost $1.20 per million, compared to $25/M for Opus 4.6 and $60/M for GPT-5.2. MiniMax claims you can run M2.5-Lightning continuously for an hour at 100 tokens/second for about $1.

M2.5 ships in two API variants - a standard endpoint at 50 tokens/second and a Lightning endpoint at 100 tokens/second - and the model weights are fully open-sourced on HuggingFace under a Modified-MIT license. The underlying parameters are identical between the two; Lightning simply allocates more inference compute for higher throughput.

Key Specifications

Specification	Details
Provider	MiniMax (Shanghai, China)
Model Family	M2 series
Parameters	230B total / 10B active (Mixture-of-Experts)
Context Window	200K tokens (architecture supports up to 1M)
Max Output	128K tokens
Input Price (Standard)	$0.15/M tokens
Output Price (Standard)	$1.20/M tokens
Input Price (Lightning)	$0.30/M tokens
Output Price (Lightning)	$2.40/M tokens
Release Date	February 12, 2026
License	Modified-MIT (open weights, commercial use allowed)
Inference Frameworks	SGLang, vLLM, Transformers, KTransformers

Benchmark Performance

MiniMax self-reports comparisons against major frontier models. I have cross-checked these against third-party evaluations where available, but note that community members on Hacker News have flagged MiniMax's history of benchmark optimization with earlier M2 and M2.1 releases - so independent verification matters here.

Coding and Agentic Benchmarks

Benchmark	MiniMax M2.5	Claude Opus 4.6	Gemini 3 Pro	GPT-5.2
SWE-Bench Verified	80.2%	80.8%	-	-
Multi-SWE-Bench	51.3%	50.3%	-	-
Droid (SWE-Bench)	79.7%	78.9%	-	-
OpenCode (SWE-Bench)	76.1%	75.9%	-	-
BrowseComp	76.3%	-	-	-
BFCL Multi-Turn	76.8%	63.3%	-	-

Reasoning and Knowledge Benchmarks

Benchmark	MiniMax M2.5	Claude Opus 4.6	Gemini 3 Pro	GPT-5.2
AIME 2025	86.3	95.6	96.0	98.0
GPQA Diamond	85.2	90.0	91.0	90.0
SciCode	44.4	52.0	56.0	52.0
IFBench	70.0	53.0	70.0	75.0

The pattern is clear: M2.5 is optimized heavily for coding and agentic tasks, where it trades blows with Opus 4.6 at a 20x lower cost. On pure reasoning benchmarks like AIME 2025 (86.3 vs 95.6+) and scientific knowledge like SciCode (44.4 vs 52+), it falls meaningfully behind frontier proprietary models. This is a model built for developers who need a fast, cheap coding agent - not a general-purpose reasoning powerhouse.

Artificial Analysis ranks M2.5 at an Intelligence Index of 42, placing it #5 among 66 open-weights models and well above the median of 26. Its output speed of 53.9 tokens/second on the standard endpoint ranks #19 across all models they track.

Key Capabilities

Coding and software engineering is where M2.5 earns its keep. Trained on reinforcement learning across 200,000+ real-world environments using MiniMax's Forge RL framework, the model excels at resolving real GitHub issues, multi-file refactoring, and agentic development workflows. It completed SWE-Bench Verified 37% faster than its predecessor M2.1 (22.8 minutes average vs 31.3), using 20% fewer agentic rounds and fewer tokens per task (3.52M vs 3.72M). MiniMax reports that 80% of newly committed code at the company is now generated by M2.5.

Tool calling and search is another strong suit. The BFCL Multi-Turn score of 76.8% significantly outpaces Opus 4.6's 63.3%, meaning M2.5 handles function calls, file operations, and API interactions more reliably in multi-step workflows. On BrowseComp, which tests autonomous web search and information synthesis, M2.5 scores 76.3% with context management.

Office productivity is a less common but notable capability. MiniMax evaluated M2.5 on GDPval-MM, a benchmark for Excel, PowerPoint, and Word tasks, where it achieved a 59.0% average win rate. The model also integrates "Experts" - user-created agent templates on MiniMax's platform, with over 10,000 already deployed.

The model supports 10+ programming languages including Python, TypeScript, Go, Rust, C/C++, Java, Kotlin, and Ruby. It uses the CISPO algorithm for MoE training stability and a process reward mechanism to handle credit assignment in long-context agent rollouts.

Pricing and Availability

M2.5's pricing is its most disruptive feature. A workload of 10 million input tokens and 2 million output tokens per day costs roughly $4.70/day with M2.5 Standard, compared to about $100/day with Claude Opus 4.6 - a 21x difference.

Variant	Input	Output	Throughput	Hourly Cost
M2.5 Standard	$0.15/M	$1.20/M	~50 tok/s	~$0.30
M2.5 Lightning	$0.30/M	$2.40/M	~100 tok/s	~$1.00

For context, Gemini 3.1 Pro charges $2.00/M input and $8.00/M output. Opus 4.6 runs $5.00/M input and $25.00/M output. Even among the recent wave of aggressively priced Chinese models like the Qwen 3.5 series, M2.5 Standard is among the cheapest options available at the frontier performance tier.

The model is available through MiniMax's own API platform, OpenRouter, Together AI, and Lambda. Open weights on HuggingFace mean you can self-host with SGLang or vLLM, though at 230B total parameters (457 GB in bf16), you will need serious hardware. Community-provided GGUF quantizations from Unsloth are available for more accessible deployments. Check our guide to running open-source LLMs locally for hardware requirements at this scale.

Strengths and Weaknesses

Strengths

SWE-Bench leader among open models - 80.2% verified, matching Opus 4.6 within measurement noise on coding tasks
Extraordinary cost-performance ratio - 10-20x cheaper than comparable frontier models on output pricing
Fast inference - Lightning variant at 100 tok/s is roughly double typical frontier model throughput
Strong tool calling - BFCL 76.8% significantly outperforms Opus 4.6 on multi-turn function calling
Open weights - Modified-MIT license allows commercial use, self-hosting, and fine-tuning
Efficient agentic execution - 20% fewer rounds and 37% faster than predecessor M2.1

Weaknesses

Weaker reasoning - AIME 2025 at 86.3 lags behind Opus (95.6), Gemini 3 Pro (96.0), and GPT-5.2 (98.0) significantly
Limited scientific knowledge - SciCode at 44.4 trails frontier models by 8-12 points
Verbose outputs - Artificial Analysis measured 56M tokens during evaluation vs a 15M median, inflating effective costs
Non-deterministic quality - Users report meaningful variation in output quality across identical prompts
Benchmark trust concerns - Community skepticism persists from M2/M2.1 benchmark reward-hacking history
Large self-hosting footprint - 457 GB unquantized makes local deployment impractical without multi-GPU setups

Claude Opus 4.6 - The closest competitor on coding benchmarks at a much higher price point
GPT-5.3 Codex - OpenAI's agentic coding model for comparison
Coding Benchmarks Leaderboard - Full SWE-Bench and coding benchmark rankings
Cost Efficiency Leaderboard - Price-performance comparisons across all models
Open Source LLM Leaderboard - Rankings for open-weights models
How to Run Open-Source LLMs Locally - Hardware guide for self-hosting 230B-class models
Open Source vs Proprietary AI - The broader debate M2.5 feeds into