Three open-weight MoE models, one question: which gives you the best coding quality per dollar? Kimi K2.5 wins every benchmark. Qwen3.5-122B-A10B splits the difference. Qwen3.5-35B-A3B runs on a single gaming GPU and costs a third as much as K2.5 - while scoring within 10 points on most coding evals.

The answer to "which is better for coding" depends almost entirely on which side of that cost-performance tradeoff you sit on.

TL;DR

Kimi K2.5 wins all coding benchmarks outright: LiveCodeBench 85.0, SWE-bench Verified 76.8%, OJBench 57.4% - but costs $0.45-$0.60/M input and needs datacenter hardware to self-host
Qwen3.5-122B-A10B is the middle option: LiveCodeBench 78.9, SWE-bench 72.0%, Codeforces 2100 - at $0.26/M input on OpenRouter, self-hostable on an A100 80GB at Q4
Qwen3.5-35B-A3B is the cost-efficiency winner: LiveCodeBench 74.6, SWE-bench 69.2% - at $0.163/M input, runs on a single RTX 3090/4090 at Q4
For high-volume coding pipelines where self-hosting is feasible, Qwen3.5-35B-A3B's marginal cost drops to electricity

Quick Comparison

	Kimi K2.5	Qwen3.5-122B-A10B	Qwen3.5-35B-A3B
Total / Active Params	1T / 32B	122B / 10B	35B / 3B
SWE-bench Verified	76.8%	72.0%	69.2%
LiveCodeBench v6	85.0	78.9	74.6
Terminal Bench 2.0	50.8	49.4	40.5
OJBench (C++)	57.4	39.5	36.0
Codeforces Rating	-	2100	2028
FullStackBench (en)	-	62.6	58.1
Context Window	256K	262K (1M ext.)	262K (1M ext.)
API Input (best rate)	$0.45/M	$0.26/M	$0.163/M
API Output (best rate)	$2.25/M	$2.08/M	$1.30/M
Self-host VRAM (Q4)	~315+ GB	~70-95 GB	~22-25 GB
Self-host GPU	4x H200	A100 80GB	RTX 3090/4090
License	Modified MIT	Apache 2.0	Apache 2.0

Coding Benchmarks

SWE-bench Verified: Real GitHub Issues

SWE-bench Verified is the most operationally relevant coding benchmark - it measures a model's ability to resolve real GitHub issues across production codebases.

Kimi K2.5 leads at 76.8%, ahead of Qwen3.5-122B-A10B at 72.0% and Qwen3.5-35B-A3B at 69.2%. That 7.6-point spread between the top and bottom is meaningful but not decisive - all three models are competitive with proprietary frontier offerings.

Worth noting: the dense Qwen3.5-27B actually matches the 122B-A10B exactly at 72.0% on SWE-bench, suggesting the MoE architecture doesn't give the 122B an edge on this particular benchmark over a smaller dense sibling.

LiveCodeBench v6: Algorithmic Problem Solving

LiveCodeBench v6 uses fresh competition problems, making it resistant to training contamination. K2.5 leads strongly at 85.0, with the 122B-A10B at 78.9 and the 35B-A3B at 74.6.

The 10-point gap between K2.5 and the 35B-A3B is the most significant spread in the benchmark table. For workloads centered on algorithmic problem solving - competitive programming assistants, code generation for data processing pipelines, or interview prep tools - K2.5's advantage here is real.

OJBench: Competitive Programming

OJBench assesses competitive programming across multiple languages. K2.5's 57.4% on C++ stands well above the Qwen models at 39.5% (122B) and 36.0% (35B-A3B). This is K2.5's clearest domain win - if competitive-grade algorithmic coding matters to your use case, it's the model to reach for.

FullStackBench: Real-World Multi-Language Coding

FullStackBench covers real-world coding tasks across multiple programming languages. Here the Qwen models publish scores while K2.5 doesn't. The 122B-A10B leads at 62.6% (en), ahead of the 35B-A3B at 58.1%. On multilingual tasks (zh), scores are 58.7% and 55.0% respectively. For teams building coding tools that need to handle multiple languages reliably, this benchmark favors the 122B-A10B.

Codeforces: Competitive Ratings

The 122B-A10B holds a Codeforces rating of 2100 against the 35B-A3B's 2028. Both are strong - a 2100 rating is in the top few percent of competitive programmers globally. K2.5 doesn't publish a Codeforces rating.

Benchmark Summary

Benchmark	K2.5	122B-A10B	35B-A3B	K2.5 gap vs 35B
SWE-bench Verified	76.8%	72.0%	69.2%	+7.6 pts
LiveCodeBench v6	85.0	78.9	74.6	+10.4 pts
Terminal Bench 2.0	50.8	49.4	40.5	+10.3 pts
OJBench (C++)	57.4	39.5	36.0	+21.4 pts
Codeforces	-	2100	2028	-
FullStackBench (en)	-	62.6	58.1	-

K2.5 leads everywhere it reports. The margin is small on SWE-bench (+7.6 pts) and larger on OJBench (+21 pts). On FullStackBench and Codeforces, the Qwen models lead simply because K2.5 hasn't published those numbers - not because it would necessarily lose.

Pricing Analysis

All three models are available via API. Here is the current pricing across major providers:

API Pricing

Provider	Kimi K2.5 In	Kimi K2.5 Out	Qwen 122B In	Qwen 122B Out	Qwen 35B In	Qwen 35B Out
OpenRouter	$0.45	$2.20	$0.26	$2.08	$0.163	$1.30
DeepInfra	$0.45	$2.25	-	-	-	-
Alibaba Cloud	-	-	$0.40*	$1.20-$2.40*	-	-
Moonshot API	$0.60	$3.00	-	-	-	-

*DashScope "qwen3.5-plus" endpoint; output varies by thinking mode ($1.20 standard / $2.40 thinking)

Cost Per Coding Task

For a typical agentic coding session (8K input, 4K output tokens):

Model	Session Cost	Relative Cost
Qwen3.5-35B-A3B (OpenRouter)	~$0.009	1x baseline
Qwen3.5-122B-A10B (OpenRouter)	~$0.010	1.1x
Kimi K2.5 (DeepInfra)	~$0.013	1.4x
Kimi K2.5 (Moonshot)	~$0.017	1.9x

At low-volume use, the per-session difference is fractions of a cent - negligible. At scale (100K sessions/month), the 35B-A3B costs roughly $900 vs K2.5's $1,700 at best-rate providers. That is a $800/month difference for 93% of the LiveCodeBench performance.

Thinking Mode Costs

All three models have thinking/non-thinking modes. Thinking mode generates a reasoning trace before the answer - it improves correctness on complex coding problems but increases token usage notably.

Qwen3.5 models: thinking enabled by default (enable_thinking: True). Complex coding tasks: temperature 0.6, top_p 0.95. Budget 2-4x tokens versus non-thinking.
Kimi K2.5: thinking mode at temperature 1.0; instant mode at 0.6. Moonshot API prices thinking and instant output identically at $3.00/M - some third-party providers differentiate.

For high-accuracy code generation where correctness matters more than latency, thinking mode is worth the additional token cost on all three models. For code completion or simple generation tasks, instant mode is sufficient.

Self-Hosting

The hardware gap between these three models is dramatic - and it changes the economics completely for teams that can provision their own infrastructure.

Hardware Requirements

Model	Quantization	VRAM	Single GPU Option	Gen Speed
Qwen3.5-35B-A3B	Q4_K_M	~22-25 GB	RTX 3090 / RTX 4090	~111-165 tok/s
Qwen3.5-35B-A3B	Q8_0	~38-41 GB	A100 40GB	~60-80 tok/s
Qwen3.5-122B-A10B	Q4_K_M	~70-95 GB	A100 80GB (tight)	~30-50 tok/s
Qwen3.5-122B-A10B	Q8_0	~130-155 GB	2x A100 80GB	~15-25 tok/s
Kimi K2.5	INT4	~315 GB	4x H200 80GB	~40 tok/s
Kimi K2.5	UD-Q2_K_XL	~375 GB	Not single-GPU	~10 tok/s

A key architectural advantage for Qwen3.5 models: the Gated DeltaNet hybrid means context scaling is unusually cheap. Going from 4K to 262K context costs only ~3 GB of additional VRAM on the 35B-A3B - remarkable for a quarter-million token window.

For Kimi K2.5, the only realistic consumer-hardware path is a Mac with 256 GB unified memory running 2-bit quantization at around 10 tokens per second. That is borderline usable for interactive sessions but not production coding pipelines.

The Self-Hosting Verdict

Qwen3.5-35B-A3B is the only model in this comparison that a solo developer or small team can self-host on affordable hardware. Once the GPU is paid for, marginal cost is electricity.
Qwen3.5-122B-A10B is realistic for labs or teams with a single A100 80GB or H100.
Kimi K2.5 requires datacenter infrastructure. Self-hosting only makes sense for organizations running large-scale inference already.

Kimi Code CLI: A Practical Differentiator

Kimi K2.5 ships with an official agentic coding CLI called Kimi Code (Apache 2.0, 6,400+ GitHub stars). It reads and edits code, executes shell commands, supports MCP tools, and runs on a local terminal similar to Claude Code or OpenCode. This isn't a third-party integration - it is Moonshot's own product, built specifically around K2.5's strengths.

If your primary use case is interactive agentic coding sessions rather than raw API throughput, the Kimi Code CLI is a meaningful practical advantage that benchmark tables don't capture.

Verdict

Choose Kimi K2.5 when:

Raw coding benchmark scores are the primary criterion
You need competitive programming or algorithmic problem solving at the frontier (OJBench 57.4, LiveCodeBench 85.0)
The Kimi Code CLI fits your workflow and you want an integrated agentic coding experience
API cost difference vs Qwen 35B-A3B ($0.013 vs $0.009 per session) isn't a deciding factor
Self-hosting is not a requirement

Choose Qwen3.5-122B-A10B when:

You want the best Qwen coding performance (Codeforces 2100, FullStackBench 62.6)
Multi-language and multi-task coding coverage matters (FullStackBench includes more language diversity)
You have an A100 80GB and want to self-host at reasonable quality
Apache 2.0 licensing with zero commercial friction is required

Choose Qwen3.5-35B-A3B when:

Cost efficiency is the primary constraint - 3x cheaper than K2.5 at the API layer
You can self-host: RTX 3090/4090 is sufficient at Q4, with 262K context for only +3 GB VRAM
SWE-bench 69.2% and LiveCodeBench 74.6 are sufficient for your use case (they usually are)
High-volume coding pipelines where per-token cost compounds over millions of requests

The practical reality: for most coding applications, Qwen3.5-35B-A3B delivers 87-93% of Kimi K2.5's benchmark performance at one-third the API cost - and it's the only model in this comparison that runs well on a single consumer GPU. K2.5's advantage is real and consistent, but the gap rarely justifies the cost difference unless you specifically need frontier-level competitive programming or want the Kimi Code ecosystem.

Sources: Kimi K2.5 model card (HuggingFace) - Qwen3.5-122B-A10B model card (HuggingFace) - Qwen3.5-35B-A3B model card (HuggingFace) - OpenRouter Kimi K2.5 pricing - OpenRouter Qwen3.5-122B-A10B pricing - OpenRouter Qwen3.5-35B-A3B pricing - InsiderLLM Qwen3.5 local guide - Unsloth Kimi K2.5 local guide