Qwen3.5 MoE vs Kimi K2.5 for Coding - Price Breakdown
Kimi K2.5 leads every coding benchmark, but Qwen3.5-35B-A3B delivers 87-93% of that performance at 3-4x lower cost and runs on a single consumer GPU. Here is the full breakdown.

Three open-weight MoE models, one question: which gives you the best coding quality per dollar? Kimi K2.5 wins every benchmark. Qwen3.5-122B-A10B splits the difference. Qwen3.5-35B-A3B runs on a single gaming GPU and costs a third as much as K2.5 - while scoring within 10 points on most coding evals.
The answer to "which is better for coding" depends almost entirely on which side of that cost-performance tradeoff you sit on.
TL;DR
- Kimi K2.5 wins all coding benchmarks outright: LiveCodeBench 85.0, SWE-bench Verified 76.8%, OJBench 57.4% - but costs $0.45-$0.60/M input and needs datacenter hardware to self-host
- Qwen3.5-122B-A10B is the middle option: LiveCodeBench 78.9, SWE-bench 72.0%, Codeforces 2100 - at $0.26/M input on OpenRouter, self-hostable on an A100 80GB at Q4
- Qwen3.5-35B-A3B is the cost-efficiency winner: LiveCodeBench 74.6, SWE-bench 69.2% - at $0.163/M input, runs on a single RTX 3090/4090 at Q4
- For high-volume coding pipelines where self-hosting is feasible, Qwen3.5-35B-A3B's marginal cost drops to electricity
Quick Comparison
| Kimi K2.5 | Qwen3.5-122B-A10B | Qwen3.5-35B-A3B | |
|---|---|---|---|
| Total / Active Params | 1T / 32B | 122B / 10B | 35B / 3B |
| SWE-bench Verified | 76.8% | 72.0% | 69.2% |
| LiveCodeBench v6 | 85.0 | 78.9 | 74.6 |
| Terminal Bench 2.0 | 50.8 | 49.4 | 40.5 |
| OJBench (C++) | 57.4 | 39.5 | 36.0 |
| Codeforces Rating | - | 2100 | 2028 |
| FullStackBench (en) | - | 62.6 | 58.1 |
| Context Window | 256K | 262K (1M ext.) | 262K (1M ext.) |
| API Input (best rate) | $0.45/M | $0.26/M | $0.163/M |
| API Output (best rate) | $2.25/M | $2.08/M | $1.30/M |
| Self-host VRAM (Q4) | ~315+ GB | ~70-95 GB | ~22-25 GB |
| Self-host GPU | 4x H200 | A100 80GB | RTX 3090/4090 |
| License | Modified MIT | Apache 2.0 | Apache 2.0 |
Coding Benchmarks
SWE-bench Verified: Real GitHub Issues
SWE-bench Verified is the most operationally relevant coding benchmark - it measures a model's ability to resolve real GitHub issues across production codebases.
Kimi K2.5 leads at 76.8%, ahead of Qwen3.5-122B-A10B at 72.0% and Qwen3.5-35B-A3B at 69.2%. That 7.6-point spread between the top and bottom is meaningful but not decisive - all three models are competitive with proprietary frontier offerings.
Worth noting: the dense Qwen3.5-27B actually matches the 122B-A10B exactly at 72.0% on SWE-bench, suggesting the MoE architecture doesn't give the 122B an edge on this particular benchmark over a smaller dense sibling.
LiveCodeBench v6: Algorithmic Problem Solving
LiveCodeBench v6 uses fresh competition problems, making it resistant to training contamination. K2.5 leads strongly at 85.0, with the 122B-A10B at 78.9 and the 35B-A3B at 74.6.
The 10-point gap between K2.5 and the 35B-A3B is the most significant spread in the benchmark table. For workloads centered on algorithmic problem solving - competitive programming assistants, code generation for data processing pipelines, or interview prep tools - K2.5's advantage here is real.
OJBench: Competitive Programming
OJBench assesses competitive programming across multiple languages. K2.5's 57.4% on C++ stands well above the Qwen models at 39.5% (122B) and 36.0% (35B-A3B). This is K2.5's clearest domain win - if competitive-grade algorithmic coding matters to your use case, it's the model to reach for.
FullStackBench: Real-World Multi-Language Coding
FullStackBench covers real-world coding tasks across multiple programming languages. Here the Qwen models publish scores while K2.5 doesn't. The 122B-A10B leads at 62.6% (en), ahead of the 35B-A3B at 58.1%. On multilingual tasks (zh), scores are 58.7% and 55.0% respectively. For teams building coding tools that need to handle multiple languages reliably, this benchmark favors the 122B-A10B.
Codeforces: Competitive Ratings
The 122B-A10B holds a Codeforces rating of 2100 against the 35B-A3B's 2028. Both are strong - a 2100 rating is in the top few percent of competitive programmers globally. K2.5 doesn't publish a Codeforces rating.
Benchmark Summary
| Benchmark | K2.5 | 122B-A10B | 35B-A3B | K2.5 gap vs 35B |
|---|---|---|---|---|
| SWE-bench Verified | 76.8% | 72.0% | 69.2% | +7.6 pts |
| LiveCodeBench v6 | 85.0 | 78.9 | 74.6 | +10.4 pts |
| Terminal Bench 2.0 | 50.8 | 49.4 | 40.5 | +10.3 pts |
| OJBench (C++) | 57.4 | 39.5 | 36.0 | +21.4 pts |
| Codeforces | - | 2100 | 2028 | - |
| FullStackBench (en) | - | 62.6 | 58.1 | - |
K2.5 leads everywhere it reports. The margin is small on SWE-bench (+7.6 pts) and larger on OJBench (+21 pts). On FullStackBench and Codeforces, the Qwen models lead simply because K2.5 hasn't published those numbers - not because it would necessarily lose.
Pricing Analysis
All three models are available via API. Here is the current pricing across major providers:
API Pricing
| Provider | Kimi K2.5 In | Kimi K2.5 Out | Qwen 122B In | Qwen 122B Out | Qwen 35B In | Qwen 35B Out |
|---|---|---|---|---|---|---|
| OpenRouter | $0.45 | $2.20 | $0.26 | $2.08 | $0.163 | $1.30 |
| DeepInfra | $0.45 | $2.25 | - | - | - | - |
| Alibaba Cloud | - | - | $0.40* | $1.20-$2.40* | - | - |
| Moonshot API | $0.60 | $3.00 | - | - | - | - |
*DashScope "qwen3.5-plus" endpoint; output varies by thinking mode ($1.20 standard / $2.40 thinking)
Cost Per Coding Task
For a typical agentic coding session (8K input, 4K output tokens):
| Model | Session Cost | Relative Cost |
|---|---|---|
| Qwen3.5-35B-A3B (OpenRouter) | ~$0.009 | 1x baseline |
| Qwen3.5-122B-A10B (OpenRouter) | ~$0.010 | 1.1x |
| Kimi K2.5 (DeepInfra) | ~$0.013 | 1.4x |
| Kimi K2.5 (Moonshot) | ~$0.017 | 1.9x |
At low-volume use, the per-session difference is fractions of a cent - negligible. At scale (100K sessions/month), the 35B-A3B costs roughly $900 vs K2.5's $1,700 at best-rate providers. That is a $800/month difference for 93% of the LiveCodeBench performance.
Thinking Mode Costs
All three models have thinking/non-thinking modes. Thinking mode generates a reasoning trace before the answer - it improves correctness on complex coding problems but increases token usage notably.
- Qwen3.5 models: thinking enabled by default (
enable_thinking: True). Complex coding tasks: temperature 0.6, top_p 0.95. Budget 2-4x tokens versus non-thinking. - Kimi K2.5: thinking mode at temperature 1.0; instant mode at 0.6. Moonshot API prices thinking and instant output identically at $3.00/M - some third-party providers differentiate.
For high-accuracy code generation where correctness matters more than latency, thinking mode is worth the additional token cost on all three models. For code completion or simple generation tasks, instant mode is sufficient.
Self-Hosting
The hardware gap between these three models is dramatic - and it changes the economics completely for teams that can provision their own infrastructure.
Hardware Requirements
| Model | Quantization | VRAM | Single GPU Option | Gen Speed |
|---|---|---|---|---|
| Qwen3.5-35B-A3B | Q4_K_M | ~22-25 GB | RTX 3090 / RTX 4090 | ~111-165 tok/s |
| Qwen3.5-35B-A3B | Q8_0 | ~38-41 GB | A100 40GB | ~60-80 tok/s |
| Qwen3.5-122B-A10B | Q4_K_M | ~70-95 GB | A100 80GB (tight) | ~30-50 tok/s |
| Qwen3.5-122B-A10B | Q8_0 | ~130-155 GB | 2x A100 80GB | ~15-25 tok/s |
| Kimi K2.5 | INT4 | ~315 GB | 4x H200 80GB | ~40 tok/s |
| Kimi K2.5 | UD-Q2_K_XL | ~375 GB | Not single-GPU | ~10 tok/s |
A key architectural advantage for Qwen3.5 models: the Gated DeltaNet hybrid means context scaling is unusually cheap. Going from 4K to 262K context costs only ~3 GB of additional VRAM on the 35B-A3B - remarkable for a quarter-million token window.
For Kimi K2.5, the only realistic consumer-hardware path is a Mac with 256 GB unified memory running 2-bit quantization at around 10 tokens per second. That is borderline usable for interactive sessions but not production coding pipelines.
The Self-Hosting Verdict
- Qwen3.5-35B-A3B is the only model in this comparison that a solo developer or small team can self-host on affordable hardware. Once the GPU is paid for, marginal cost is electricity.
- Qwen3.5-122B-A10B is realistic for labs or teams with a single A100 80GB or H100.
- Kimi K2.5 requires datacenter infrastructure. Self-hosting only makes sense for organizations running large-scale inference already.
Kimi Code CLI: A Practical Differentiator
Kimi K2.5 ships with an official agentic coding CLI called Kimi Code (Apache 2.0, 6,400+ GitHub stars). It reads and edits code, executes shell commands, supports MCP tools, and runs on a local terminal similar to Claude Code or OpenCode. This isn't a third-party integration - it is Moonshot's own product, built specifically around K2.5's strengths.
If your primary use case is interactive agentic coding sessions rather than raw API throughput, the Kimi Code CLI is a meaningful practical advantage that benchmark tables don't capture.
Verdict
Choose Kimi K2.5 when:
- Raw coding benchmark scores are the primary criterion
- You need competitive programming or algorithmic problem solving at the frontier (OJBench 57.4, LiveCodeBench 85.0)
- The Kimi Code CLI fits your workflow and you want an integrated agentic coding experience
- API cost difference vs Qwen 35B-A3B ($0.013 vs $0.009 per session) isn't a deciding factor
- Self-hosting is not a requirement
Choose Qwen3.5-122B-A10B when:
- You want the best Qwen coding performance (Codeforces 2100, FullStackBench 62.6)
- Multi-language and multi-task coding coverage matters (FullStackBench includes more language diversity)
- You have an A100 80GB and want to self-host at reasonable quality
- Apache 2.0 licensing with zero commercial friction is required
Choose Qwen3.5-35B-A3B when:
- Cost efficiency is the primary constraint - 3x cheaper than K2.5 at the API layer
- You can self-host: RTX 3090/4090 is sufficient at Q4, with 262K context for only +3 GB VRAM
- SWE-bench 69.2% and LiveCodeBench 74.6 are sufficient for your use case (they usually are)
- High-volume coding pipelines where per-token cost compounds over millions of requests
The practical reality: for most coding applications, Qwen3.5-35B-A3B delivers 87-93% of Kimi K2.5's benchmark performance at one-third the API cost - and it's the only model in this comparison that runs well on a single consumer GPU. K2.5's advantage is real and consistent, but the gap rarely justifies the cost difference unless you specifically need frontier-level competitive programming or want the Kimi Code ecosystem.
Sources: Kimi K2.5 model card (HuggingFace) - Qwen3.5-122B-A10B model card (HuggingFace) - Qwen3.5-35B-A3B model card (HuggingFace) - OpenRouter Kimi K2.5 pricing - OpenRouter Qwen3.5-122B-A10B pricing - OpenRouter Qwen3.5-35B-A3B pricing - InsiderLLM Qwen3.5 local guide - Unsloth Kimi K2.5 local guide
