Kimi K2.5 vs GLM-4.7-Flash: China's Open-Source Heavyweight vs the Free API Underdog
Comparing two Chinese AI models with MIT-family licenses - Moonshot AI's trillion-parameter Kimi K2.5 against Zhipu AI's ultra-efficient GLM-4.7-Flash that punches well above its weight on coding and agentic tasks.

China's open-source AI ecosystem produced two models that share an MIT-family license and a commitment to accessibility but approach the problem from opposite ends of the scale spectrum. Kimi K2.5 is Moonshot AI's trillion-parameter flagship - 384 experts, 32 billion active parameters per token, 256K context, and benchmark scores that rival the best proprietary models. GLM-4.7-Flash is Zhipu AI's efficiency play - 30 billion total parameters with roughly 3 billion active, a free API, and SWE-bench results that have no business being as strong as they are for a model this small.
The 33x parameter gap between them tells an obvious story on raw benchmarks. K2.5 wins everywhere by large margins. But GLM-4.7-Flash's SWE-bench Verified score of 59.2% and tau2-Bench of 79.5% reveal a model that was engineered specifically for agentic coding tasks, and at its price point - which is free - it is doing something remarkable. This is a comparison between China's best open model and China's best free model, and the distance between them is smaller than you might expect on the tasks that matter most for practical AI development.
TL;DR
- Choose Kimi K2.5 if you need top-tier performance on math, science, vision, coding, and complex multi-agent workflows. You are willing to pay for API access or invest in cluster infrastructure for the best open-weight model available.
- Choose GLM-4.7-Flash if you want a free, fast, MIT-licensed model that delivers surprisingly strong coding and agentic performance on consumer hardware. Budget-constrained teams and individual developers get real value here.
Quick Comparison
| Feature | Kimi K2.5 | GLM-4.7-Flash |
|---|---|---|
| Developer | Moonshot AI | Z.AI (Zhipu AI) |
| Architecture | MoE (384 experts, 8 active) | MoE (~30B/3B) |
| Total Parameters | 1T | ~30B |
| Active Parameters | 32B | ~3B |
| License | Modified MIT | MIT |
| Context Window | 256K | 128K |
| API Pricing (Input) | $0.60/1M tokens | Free (Z.AI) |
| API Pricing (Output) | $3.00/1M tokens | Free (Z.AI) |
| SWE-bench Verified | 76.8% | 59.2% |
| GPQA Diamond | 87.6 | Not published |
| MMLU-Pro | 87.1 | Not published |
| tau2-Bench | Not published | 79.5% |
| Self-host Feasibility | Very Low | Very High (single RTX 4090) |
Kimi K2.5: The Open-Weight Ceiling
Kimi K2.5 sets the ceiling for what open-weight models can achieve in early 2026. The architecture is built for coverage - 384 specialized experts across 61 layers, with 8 experts activating per token, trained using PARL (a reinforcement learning method for learning expert routing policies). This is not just scale for the sake of scale. The expert diversity means K2.5 has dedicated parameter subsets for mathematical reasoning, code generation, scientific analysis, vision processing, and agentic planning.
The math and science benchmarks confirm this breadth. AIME 2025 at 96.1 is near-perfect on competition mathematics. HMMT at 95.4 shows similar strength on harder olympiad-style problems. GPQA Diamond at 87.6 tests graduate-level science across physics, chemistry, and biology. These scores are competitive with GPT-5 and Claude Opus 4, which cost significantly more per token. For context on how K2.5 ranks against these proprietary models, see our math olympiad leaderboard.
On coding, K2.5 scores 76.8% on SWE-bench Verified - resolving more than three-quarters of real GitHub issues from actual open-source repositories. LiveCodeBench v6 at 85.0 confirms strong competitive programming performance. Terminal Bench 2.0 at 50.8 shows it can operate in terminal environments for system administration and DevOps tasks. These are not isolated benchmark wins - they represent a consistent pattern of capability across the full software development lifecycle.
The Agent Swarm feature adds another dimension. K2.5 can orchestrate up to 100 sub-agents simultaneously, each specializing in different aspects of a complex task. BrowseComp at 78.4% in swarm mode versus 60.6% in single-model mode shows how much this multi-agent architecture contributes to real-world capability. See our guide to building AI agents for more on how these orchestration patterns work.
GLM-4.7-Flash: Free and Surprisingly Capable
GLM-4.7-Flash is Zhipu AI's statement that useful AI does not need to be expensive. The free API through Z.AI removes the primary barrier to AI adoption for individual developers, students, researchers, and startups in markets where even $0.60 per million tokens represents a meaningful cost. The MIT license means you can also download the weights and self-host on a single RTX 4090 - roughly 3 billion active parameters at FP16 fit comfortably in 24 GB of VRAM.
The standout metric is SWE-bench Verified at 59.2%. To put this in context, that score means GLM-4.7-Flash resolves nearly 60% of real-world GitHub issues - real bugs, real feature requests, real code in production repositories. This is achieved with approximately 3 billion active parameters. K2.5 with 32 billion active parameters scores 76.8%. The delta is 17.6 points, but GLM is accomplishing this with roughly 10x fewer active parameters per token. Per-parameter efficiency on coding tasks is genuinely impressive.
The tau2-Bench score of 79.5% adds another data point. This benchmark measures agentic task completion - the ability to follow multi-step instructions, use tools, and complete complex workflows. A 79.5% score from a 3B-active model suggests that Zhipu AI specifically optimized for agentic capability during training, likely through targeted instruction tuning and tool-use data.
The 128K context window is half of K2.5's 256K but sufficient for most practical applications. You can process long codebases, multi-file pull requests, and extended conversations without hitting limits. For most coding-focused workflows, 128K is more than enough. Our how to choose an LLM guide provides more guidance on matching context requirements to models.
Benchmark Comparison
| Benchmark | Kimi K2.5 | GLM-4.7-Flash | Delta |
|---|---|---|---|
| SWE-bench Verified | 76.8% | 59.2% | K2.5 +17.6 |
| GPQA Diamond | 87.6 | Not published | K2.5 by default |
| MMLU-Pro | 87.1 | Not published | K2.5 by default |
| AIME 2025 | 96.1 | Not published | K2.5 by default |
| LiveCodeBench v6 | 85.0 | Not published | K2.5 by default |
| tau2-Bench | Not published | 79.5% | GLM by default |
| BrowseComp | 78.4% (swarm) | Not published | K2.5 by default |
| Terminal Bench 2.0 | 50.8 | Not published | K2.5 by default |
| Context Window | 256K | 128K | K2.5 (2x longer) |
| Active Params | 32B | ~3B | K2.5 (10.7x more) |
| Total Params | 1T | ~30B | K2.5 (33x more) |
The benchmark table has a lot of "not published" entries for GLM-4.7-Flash, which itself tells a story. Zhipu AI focused their evaluation on the specific capabilities they optimized for - coding and agentic tasks - rather than publishing across the full benchmark suite. This is common with smaller, specialized models and reflects an honest assessment of where the model excels versus where it would not be competitive.
The SWE-bench comparison is the most meaningful direct comparison available. K2.5's 17.6-point lead is significant in absolute terms - roughly one in six additional issues resolved. But GLM achieves its 59.2% with 10.7x fewer active parameters. If you normalize by compute, GLM is extracting far more SWE-bench performance per FLOP. For teams that need good-enough coding assistance rather than the absolute best, GLM's efficiency-adjusted performance is remarkable. For a comprehensive view of how coding models rank, see our coding benchmarks leaderboard.
Pricing Analysis
| Cost Factor | Kimi K2.5 | GLM-4.7-Flash |
|---|---|---|
| API Input (per 1M tokens) | $0.60 | Free |
| API Output (per 1M tokens) | $3.00 | Free |
| Self-host VRAM | Multi-node cluster | ~8-12 GB (FP16/INT8) |
| Self-host Hardware | Enterprise GPU cluster | Single RTX 4090 or equivalent |
| License | Modified MIT | MIT |
| Marginal Inference Cost | $0.60-$3.00/1M tokens | Zero |
This is the starkest pricing contrast in any comparison I have written. GLM-4.7-Flash's free API from Z.AI means your inference budget is literally zero. Process a billion tokens and you have paid nothing. Self-host it and your only cost is a single GPU and electricity. K2.5's API runs $0.60 input and $3.00 output per million tokens - reasonable for frontier quality but infinitely more expensive than free.
For a team processing 10 million tokens daily, K2.5 costs roughly $36 per day or $1,080 per month. GLM costs nothing. Over a year, that is nearly $13,000 saved - enough to buy the RTX 4090 you would use to self-host GLM several times over. The question is whether the 17.6-point SWE-bench gap and the broader capability advantages of K2.5 justify that spending. For many coding-focused applications, they might not. For a broader comparison of model economics, see our cost efficiency leaderboard.
Kimi K2.5: Pros and Cons
Pros:
- SWE-bench Verified 76.8% - among the best in any weight class
- AIME 2025 96.1 and GPQA Diamond 87.6 - frontier reasoning
- MoonViT-3D vision with native resolution images and video
- Agent Swarm with up to 100 sub-agents for complex orchestration
- 256K context window handles long documents and codebases
- Terminal Bench 2.0 shows system administration capability
- Modified MIT license for commercial use
Cons:
- $3.00/1M output tokens versus free for GLM
- 1T parameters requires enterprise cluster infrastructure
- Not self-hostable for individuals or small teams
- API limited to Moonshot's infrastructure
- Overkill for straightforward coding assistance tasks
GLM-4.7-Flash: Pros and Cons
Pros:
- Completely free API through Z.AI - zero cost at any volume
- SWE-bench Verified 59.2% from only ~3B active parameters
- tau2-Bench 79.5% demonstrates strong agentic task completion
- Runs on a single RTX 4090 for self-hosting
- MIT license - fully permissive, no restrictions
- 128K context window covers most practical use cases
- Ideal for budget-constrained developers and startups
Cons:
- No published scores on GPQA, MMLU-Pro, AIME, or other broad benchmarks
- No vision or multimodal capability
- 33x smaller total parameter count limits knowledge breadth
- No multi-agent orchestration features
- Smaller international community and ecosystem
- Free API may have rate limits or availability constraints
Verdict
Choose Kimi K2.5 if you are building production systems where AI quality directly impacts outcomes and you can justify the cost. Complex multi-step coding workflows, scientific reasoning, multimodal document processing, and agentic orchestration all benefit from K2.5's breadth and depth. The 17.6-point SWE-bench lead matters when you need the highest possible resolution rate on real codebases. See our open-source LLM leaderboard for the full ranking.
Choose GLM-4.7-Flash if you need a capable coding and agentic model at zero cost. Individual developers, students, open-source projects, startups pre-revenue, and teams in cost-sensitive markets all benefit from a model that delivers 59.2% SWE-bench performance for free. The tau2-Bench score of 79.5% means it handles multi-step agentic tasks well enough for most development workflows. For practical guidance on setting up a free AI coding environment, see our free AI coding setup guide.
The bottom line: These two models represent opposite strategies for making AI accessible. K2.5 does it through raw capability - a model so capable it competes with proprietary offerings at a fraction of their price. GLM-4.7-Flash does it through extreme efficiency and zero cost - a model that gives every developer access to real coding assistance without a credit card. Both approaches are valid, and the best choice depends on whether your constraint is quality or budget.
