This is the most interesting matchup in the Chinese open-weight MoE space right now. Kimi K2.5 from Moonshot AI brings 1 trillion total parameters, 384 experts, frontier-class benchmarks across math and reasoning, a dedicated vision encoder, and an Agent Swarm system. MiniMax M2.5 from MiniMax brings 230 billion total parameters, 10 billion active per token, a SWE-bench Verified score of 80.2%, and pricing that undercuts K2.5 by a factor of four on input.

Read that SWE-bench number again. MiniMax M2.5 scores 80.2% on SWE-bench Verified. Kimi K2.5 scores 76.8%. The model with 4.3x fewer total parameters and 3.2x fewer active parameters per token beats the trillion-parameter giant on the benchmark that most directly measures real-world software engineering capability. This is not what the parameter counts would predict.

The question is whether SWE-bench is all that matters to you - because on everything else, K2.5 pulls ahead, often by significant margins.

TL;DR

Choose Kimi K2.5 if you need the best overall reasoning across math, science, coding, and vision, plus agentic capabilities. K2.5 wins on nearly every benchmark except SWE-bench.
Choose MiniMax M2.5 if your primary workload is software engineering and you want the highest SWE-bench score in the open-weight space at 4x lower input cost.

Quick Comparison

Feature	Kimi K2.5	MiniMax M2.5
Developer	Moonshot AI	MiniMax
Architecture	MoE (384 experts, 8 active)	MoE
Total Parameters	1T	230B
Active Parameters	32B	10B
License	Modified MIT	Open Source
Context Window	256K	200K
API Pricing (Input)	$0.60/1M tokens	$0.15/1M tokens
API Pricing (Output)	$3.00/1M tokens	$1.20/1M tokens
SWE-bench Verified	76.8%	80.2%
AIME 2025	96.1	Not published
GPQA Diamond	87.6	Not published
MMLU-Pro	87.1	Not published
Vision	MoonViT-3D (400M params)	No
Agentic	Agent Swarm (up to 100)	No

Kimi K2.5: The Generalist That Does Everything Well

K2.5 is Moonshot AI's statement that you can build a single model that excels at math, science, coding, vision, and agentic tasks simultaneously. The 384-expert architecture with 8 active per token across 61 layers gives the model an enormous capacity for specialization - each forward pass selects a small, task-relevant slice of the full 1 trillion parameters.

The math benchmarks are where K2.5 flexes hardest. AIME 2025 at 96.1 and HMMT at 95.4 are competition-grade scores that rival the best from OpenAI and Google. GPQA Diamond 87.6 extends this to graduate-level scientific reasoning. MMLU-Pro 87.1 confirms broad knowledge competency. These are not narrow wins - they represent a consistently high ceiling across diverse intellectual tasks. For a full rundown on reasoning performance across models, see our reasoning benchmarks leaderboard.

On coding specifically, K2.5 posts LiveCodeBench v6 at 85.0 and SWE-bench Verified at 76.8%. The LiveCodeBench score is excellent - among the best available. But SWE-bench 76.8% is where M2.5 has the edge, and that edge matters because SWE-bench tests the ability to resolve real GitHub issues in real codebases, not just solve algorithm puzzles. For the latest coding comparisons, check our coding benchmarks leaderboard.

The multimodal system is a clear differentiator. MoonViT-3D is a purpose-built 400-million-parameter vision encoder handling native-resolution images and video. MMMU-Pro 78.5 and OCRBench 92.3 show strong visual reasoning and document understanding. M2.5 does not offer vision capabilities at all - if your workflow involves images, charts, or documents with visual elements, K2.5 is the only option here.

Agent Swarm adds another dimension. Trained with PARL, K2.5 can coordinate up to 100 sub-agents, pushing BrowseComp from 60.6% single-agent to 78.4% with the swarm. OSWorld 63.3 and WebArena 58.9 demonstrate real agentic task completion. M2.5 has no comparable multi-agent capability. For context on agent frameworks, see our best AI agent frameworks guide.

MiniMax M2.5: The SWE-bench Champion

M2.5's headline number is 80.2% on SWE-bench Verified. That is the highest score among open-weight models - higher than K2.5's 76.8%, higher than DeepSeek V3.2's 67.8%, and competitive with the best proprietary models. For a model with only 230B total parameters and 10B active per token, this is a remarkable achievement in software engineering capability.

The architecture is lean. At 230B total / 10B active, M2.5 processes each token with roughly 3.2x fewer active parameters than K2.5. This translates directly to lower inference costs: $0.15 per million input tokens and $1.20 per million output tokens. Input is 4x cheaper than K2.5. Output is 2.5x cheaper. For teams running coding assistants, automated PR review, or code generation pipelines at scale, the cost savings are substantial.

The 200K context window is slightly shorter than K2.5's 256K but still generous enough for most software engineering tasks. You can fit large codebases, lengthy documentation, and extensive conversation histories within the window without issue.

What makes M2.5's SWE-bench performance especially interesting is the efficiency angle. With 10B active parameters, M2.5 is delivering 80.2% SWE-bench accuracy at a fraction of K2.5's compute cost. Per active parameter, M2.5 is extracting roughly 3.3x more SWE-bench performance. This suggests that MiniMax's training methodology - whatever they did for code-specific capability - is exceptionally efficient. For broader context on coding model performance, see our review of DeepSeek V3.2 which benchmarks the leading code models.

The limitations are clear, though. M2.5 does not publish competitive benchmarks on math (AIME, HMMT), scientific reasoning (GPQA), or general knowledge (MMLU-Pro). It has no vision capabilities. It has no agentic features. This is a model that appears to be heavily optimized for software engineering at the expense of breadth. If your workload is purely coding, that trade-off is excellent. If you need a generalist, it is a limitation.

Benchmark Comparison

Benchmark	Kimi K2.5	MiniMax M2.5	Delta
SWE-bench Verified	76.8%	80.2%	M2.5 +3.4
AIME 2025	96.1	Not published	K2.5 by default
HMMT	95.4	Not published	K2.5 by default
GPQA Diamond	87.6	Not published	K2.5 by default
MMLU-Pro	87.1	Not published	K2.5 by default
LiveCodeBench v6	85.0	Not published	K2.5 by default
MMMU-Pro	78.5	No vision	K2.5 by default
BrowseComp (Swarm)	78.4%	No agentic	K2.5 by default
Active Params	32B	10B	M2.5 (3.2x fewer)
Total Params	1T	230B	M2.5 (4.3x fewer)

The single most important row in this table is SWE-bench Verified. M2.5 wins by 3.4 percentage points on the benchmark that most directly maps to real-world software development. For a model with 3.2x fewer active parameters, this is a significant upset. On everything else - math, science, general reasoning, vision, agentic tasks - K2.5 wins by default because M2.5 simply does not compete in those categories.

Pricing Analysis

Cost Factor	Kimi K2.5	MiniMax M2.5
API Input (per 1M tokens)	$0.60	$0.15
API Output (per 1M tokens)	$3.00	$1.20
1M tokens round-trip estimate	~$3.60	~$1.35
Context Window	256K	200K
License	Modified MIT	Open Source
SWE-bench per dollar	~21.3% per $1	~59.4% per $1

The "SWE-bench per dollar" metric tells the story. For every dollar spent on a round-trip million-token request, M2.5 delivers about 2.8x more SWE-bench performance than K2.5. If your workload is primarily software engineering - code generation, bug fixing, code review - M2.5 gives you more capability per dollar spent. For latest pricing across all providers, see our cost efficiency leaderboard.

Kimi K2.5: Pros and Cons

Pros:

AIME 96.1, GPQA Diamond 87.6 - frontier reasoning across math and science
Agent Swarm with up to 100 sub-agents for complex task decomposition
MoonViT-3D provides native-resolution vision for images and video
LiveCodeBench 85.0 shows strong algorithmic coding capability
384-expert MoE enables deep per-task specialization
Terminal Bench 50.8 demonstrates real-world tool usage proficiency

Cons:

SWE-bench 76.8% trails M2.5's 80.2% on real-world coding
4x more expensive on input tokens than M2.5
1T total parameters makes self-hosting prohibitive
Modified MIT license is more restrictive than fully open alternatives
Overkill for pure coding workloads where M2.5 is both cheaper and better
Smaller ecosystem than established Chinese AI providers

MiniMax M2.5: Pros and Cons

Pros:

SWE-bench Verified 80.2% - highest among open-weight models
$0.15/$1.20 per million tokens - 4x cheaper input than K2.5
230B/10B MoE is compact and relatively efficient
200K context window is sufficient for most coding tasks
Exceptional parameter efficiency on software engineering tasks
Open-source license for commercial use

Cons:

No published benchmarks on math, science, or general reasoning
No multimodal or vision capabilities
No agentic features or multi-agent coordination
Narrower capability profile limits use beyond coding
200K context is shorter than K2.5's 256K
Smaller community and ecosystem compared to larger providers

Verdict

Choose Kimi K2.5 if you need a model that handles everything - math, science, coding, vision, and agentic tasks - at a consistently high level. K2.5 is the right choice when your workload is diverse, when you need multimodal understanding, or when you are building agent pipelines that benefit from swarm coordination. The SWE-bench gap is 3.4 points in M2.5's favor, but K2.5 compensates with dramatically broader capability. See the full Kimi K2.5 model card for specifications.

Choose MiniMax M2.5 if software engineering is your primary use case and you want the best open-weight model for the job at the lowest cost. The 80.2% SWE-bench score is not a rounding error - it represents a meaningful advantage over K2.5 on real GitHub issues. At 4x cheaper input pricing, M2.5 gives you better coding performance for less money. If you are running automated code review, PR generation, or bug-fixing pipelines, this is the model to benchmark first. Full details at the MiniMax M2.5 model card.

The bottom line: For coding-focused teams, M2.5 is the surprising winner - smaller, cheaper, and empirically better on the benchmark that matters most. For everyone else, K2.5's breadth of capability across reasoning, vision, and agentic tasks makes it the more versatile investment. For how both fit into the broader open-source landscape, see our open-source LLM leaderboard.

Kimi K2.5 vs MiniMax M2.5: Chinese MoE Rivals Battle for the Coding Crown

Quick Comparison

Kimi K2.5: The Generalist That Does Everything Well

MiniMax M2.5: The SWE-bench Champion

Benchmark Comparison

Pricing Analysis

Kimi K2.5: Pros and Cons

MiniMax M2.5: Pros and Cons

Verdict

Sources

Quick Comparison

Kimi K2.5: The Generalist That Does Everything Well

MiniMax M2.5: The SWE-bench Champion

Benchmark Comparison

Pricing Analysis

Kimi K2.5: Pros and Cons

MiniMax M2.5: Pros and Cons

Verdict

Sources

Google Analytics