Name: Kimi K2.5
Author: Moonshot AI

TL;DR

1 trillion total parameters / 32B active per token MoE with 384 experts (8 active per token) across 61 layers
Native multimodal via MoonViT-3D (400M parameter vision encoder) - processes images and video at native resolution with 4x temporal compression
Agent Swarm trained with PARL (Parallel-Agent Reinforcement Learning) - coordinates up to 100 sub-agents across 1,500 steps, reducing end-to-end runtime by 80%
Best-in-class open-weight math: AIME 2025 96.1%, HMMT 95.4%, GPQA Diamond 87.6%
Strong agentic scores: SWE-bench Verified 76.8%, BrowseComp 78.4% (Agent Swarm), OSWorld 63.3%

Overview

Kimi K2.5 is Moonshot AI's flagship open-weight model, released on January 27, 2026. It extends the Kimi K2 base through continual pretraining on roughly 15 trillion mixed visual and text tokens, producing a natively multimodal system that handles text, images, and video without bolted-on adapters. The architecture runs 1 trillion total parameters through a Mixture-of-Experts design with 384 experts across 61 layers, activating only 32 billion parameters and 8 experts per token. The result is a model that competes with the proprietary frontier on reasoning, coding, and agentic tasks while remaining open-weight under a modified MIT license.

The headline capability is Agent Swarm - a self-directed multi-agent coordination system trained with PARL (Parallel-Agent Reinforcement Learning). Instead of running tasks sequentially, K2.5 learns to decompose complex goals into parallel subtasks executed by up to 100 frozen sub-agents simultaneously. This is not a wrapper or scaffolding - parallelism is a learned skill baked into the model weights. On BrowseComp, Agent Swarm mode pushes K2.5 from 60.6% (single-agent) to 78.4%, closing the gap with Claude Opus 4.6's 84.0%. On complex web research tasks, the swarm approach reduces minimum critical steps by 3-4.5x compared to single-agent execution.

The math and reasoning numbers are genuinely exceptional. AIME 2025 at 96.1% beats every other open-weight model and most proprietary ones - only Gemini 3.1 Pro is competitive at that tier. GPQA Diamond at 87.6% and MMLU-Pro at 87.1% put K2.5 firmly in frontier territory. On coding, SWE-bench Verified at 76.8% trails Claude Opus 4.6's 80.8% but beats DeepSeek V3.2's 73.1% and most other open-weight alternatives. LiveCodeBench v6 at 85.0% confirms the coding strength holds on fresh evaluations.

Where K2.5 falls short is on single-agent agentic execution without the swarm. BrowseComp drops to 60.6% in single-agent mode, and Terminal Bench 2.0 at 50.8% lags behind GPT-5.3 Codex's 77.3%. The model is designed around multi-agent coordination - if your deployment requires a single model instance making sequential decisions, the proprietary frontier still holds advantages on sustained agentic reliability.

Key Specifications

Specification	Details
Provider	Moonshot AI (Dark Side of the Moon)
Model Family	Kimi K2
Architecture	Transformer MoE with MoonViT-3D vision encoder
Total Parameters	~1T
Active Parameters	32B per token
Experts	384 total, 8 active per token
Layers	61
Vision Encoder	MoonViT-3D (400M parameters)
Context Window	256,000 tokens
Input Price	$0.60/M tokens
Output Price	$3.00/M tokens
Release Date	January 27, 2026
License	Modified MIT
Input Modalities	Text, Image, Video
Output Modality	Text
Agent Swarm	Up to 100 sub-agents, 1,500 coordinated steps
Model ID	`kimi-k2.5` (Moonshot API) / `moonshotai/kimi-k2.5` (OpenRouter, NVIDIA NIM)

Benchmark Performance

Benchmark	Kimi K2.5	Claude Opus 4.6	GPT-5.3 Codex	Gemini 3.1 Pro
MMLU-Pro (knowledge/reasoning)	87.1	85.8	86.2	90.1
GPQA Diamond (PhD-level science)	87.6	91.3	93.2	94.3
AIME 2025 (competition math)	96.1	87.2	88.5	91.0
HMMT Feb 2025 (math olympiad)	95.4	-	-	93.8
SWE-bench Verified (GitHub issues)	76.8%	80.8%	80.0%	76.2%
LiveCodeBench v6 (coding)	85.0	78.5	79.2	81.0
BrowseComp (web research)	78.4*	84.0	77.9	59.2
OSWorld-Verified (computer use)	63.3	-	-	-
WebArena (web agent)	58.9	-	-	-
MMMU-Pro (multimodal reasoning)	78.5	-	-	-
OCRBench (text recognition)	92.3	-	-	-
Terminal Bench 2.0 (agentic coding)	50.8	-	77.3	-

*Agent Swarm mode; single-agent score is 60.6%

K2.5's benchmark profile has two distinct peaks. On competition math (AIME 96.1%, HMMT 95.4%) and coding evaluation (LiveCodeBench 85.0%), it leads the field outright - including proprietary models. On GPQA Diamond (87.6%) and MMLU-Pro (87.1%), it competes with but does not quite match the top proprietary offerings. The vision benchmarks are strong: MMMU-Pro 78.5% and OCRBench 92.3% show that the MoonViT-3D encoder is not a token integration - it's a truly capable vision system.

The Agent Swarm scores are the most interesting story. The 17.8-point jump from 60.6% to 78.4% on BrowseComp shows that PARL training produces real gains in multi-agent coordination. This isn't prompt engineering or tool scaffolding - it's a fundamental shift in how the model approaches complex tasks when given the ability to parallelize.

Key Capabilities

MoonViT-3D. The vision encoder is a 400M parameter model continually pre-trained from SigLIP on image-text and video-text pairs. It uses the NaViT patch packing strategy to process images at their native resolution without resizing, and extends to video through a 3D spatial-temporal compression mechanism that groups four consecutive frames and temporally averages at the patch level. This means K2.5 can process videos up to 4x longer than competitors within the same context window. The complete weight sharing between image and video encoders keeps the architecture clean - there is no separate video model.

Agent Swarm via PARL. Parallel-Agent Reinforcement Learning is the training methodology that makes K2.5's multi-agent coordination work. A trainable orchestrator learns to decompose tasks into parallelizable subtasks, each executed by dynamically instantiated frozen sub-agents. The key innovation is that parallelism itself is a learned skill - the orchestrator is rewarded for creating sub-agents, successfully completing sub-tasks, and overall task performance. Staged reward shaping encourages parallelism early in training and gradually shifts focus toward task success. In practice, this produces a 80% reduction in end-to-end runtime on complex tasks.

Thinking and Instant Modes. K2.5 ships with two inference modes. Thinking mode (recommended temperature 1.0) activates extended reasoning chains for complex problems. Instant mode (recommended temperature 0.6) provides faster responses for simpler tasks. Both modes are accessible through the same API endpoint with different configuration parameters.

Open Weights and Broad API Access. The model weights are available on HuggingFace under a modified MIT license. API access is available through Moonshot's own platform, NVIDIA NIM, OpenRouter, Together AI, and other providers. The API is OpenAI/Anthropic-compatible, making integration straightforward for existing codebases.

Pricing and Availability

Provider	Input	Output
Moonshot API	$0.60/M tokens	$3.00/M tokens
OpenRouter	$0.45/M tokens	$2.20/M tokens
NVIDIA NIM	NVIDIA pricing	NVIDIA pricing
Together AI	Together pricing	Together pricing
Self-hosted	Free (Modified MIT)	Free (Modified MIT)

K2.5 sits in the mid-range of frontier model pricing. At $0.60/$3.00 on the Moonshot API, it is significantly cheaper than Claude Opus 4.6 ($5.00/$25.00) and GPT-5.3 Codex ($3.50/$28.00), roughly comparable to Gemini 3.1 Pro ($2.00/$12.00) on input but cheaper on output, and more expensive than DeepSeek V3.2 ($0.28/$0.42) and the smaller open-weight models.

The self-hosting option is where the economics get interesting for large-scale deployments. At 1T total parameters, K2.5 requires significant GPU infrastructure - you are looking at multiple A100 or H100 GPUs with FP8 quantization. This is substantially more demanding than Qwen3.5-122B-A10B (60-70 GB at FP8) but comparable to DeepSeek V3.2 (685B total) and Mistral Large 3 (675B total). For teams already running multi-GPU inference infrastructure, the modified MIT license makes K2.5 a viable option for unlimited on-premises deployment.

Strengths

Best-in-class open-weight competition math (AIME 96.1%, HMMT 95.4%) - beats most proprietary models
Agent Swarm via PARL is a genuine architectural innovation - 80% runtime reduction with learned parallelism
Native multimodal with MoonViT-3D - vision isn't an afterthought, it's baked into the architecture
Strong coding: SWE-bench 76.8% and LiveCodeBench 85.0% compete with the proprietary frontier
Modified MIT license with broad commercial permissions
Broad API availability - Moonshot, OpenRouter, NVIDIA NIM, Together AI
Video understanding with 4x temporal compression enables longer video processing

Weaknesses

Single-agent agentic performance (BrowseComp 60.6%) drops significantly without Agent Swarm
Terminal Bench 2.0 at 50.8% trails GPT-5.3 Codex by 26 points on agentic coding
GPQA Diamond 87.6% is strong but trails the top proprietary models by 4-7 points
1T total parameters makes self-hosting demanding - multi-GPU infrastructure required
Modified MIT license has some restrictions compared to pure MIT or Apache 2.0
Agent Swarm deployment complexity - running 100 sub-agents requires orchestration infrastructure
Context window at 256K is shorter than Gemini 3.1 Pro (1M) and Claude Opus 4.6 (1M beta)

Open Source LLM Leaderboard - Current rankings for open-weight models
Coding Benchmarks Leaderboard - SWE-bench, LiveCodeBench, and Codeforces rankings
Open Source vs Proprietary AI - When to choose open-weight models over APIs
Building Your First AI Agent - Getting started with agentic AI systems

Sources

Kimi K2.5: Visual Agentic Intelligence (arXiv) - Technical report
Kimi K2.5 Tech Blog - Moonshot AI official blog
moonshotai/Kimi-K2.5 (HuggingFace) - Model weights
Kimi K2.5 (OpenRouter) - API pricing and stats
Moonshot AI Releases Kimi K2.5 (TechCrunch) - Release coverage
Kimi K2.5 Model Card (NVIDIA NIM) - NVIDIA deployment
GitHub - MoonshotAI/Kimi-K2.5 - Source repository