Kimi K2.7-Code - Moonshot's Open-Weight Coding Leap
Moonshot AI ships Kimi K2.7-Code with 30% fewer reasoning tokens and a 21.8% gain on its own coding benchmarks, but the model still trails Claude Opus 4.8 on most tests in the same table.

Moonshot AI shipped Kimi K2.7-Code on June 12, a coding-focused update to its open-weight K2.6 model that burns roughly 30% fewer reasoning tokens while posting double-digit gains on the company's internal benchmark suite. The headline numbers are real. The competitive picture they suggest isn't.
TL;DR
- K2.7-Code gains 21.8% on Kimi Code Bench v2 over K2.6, but trails Claude Opus 4.8 and GPT-5.5 on five of the six benchmarks in the same table
- 30% cut in reasoning token usage is the more useful improvement for developers running agentic pipelines at scale
- Same 1T total / 32B active MoE architecture and pricing as K2.6 - now trained specifically for coding and long-horizon agent tasks
Moonshot positions the release around "real-world long-horizon coding tasks," with MCP tooling, pull request review, and repo-scale refactoring as primary targets. The model is available on HuggingFace under a Modified MIT license and through the Kimi API at unchanged pricing.
The Benchmark Table - Read It Carefully
Moonshot published a six-benchmark comparison against K2.6, Claude Opus 4.8, and GPT-5.5. Every benchmark in the table is proprietary to Moonshot:
| Benchmark | K2.6 | K2.7-Code | Claude Opus 4.8 | GPT-5.5 |
|---|---|---|---|---|
| Kimi Code Bench v2 | 50.9 | 62.0 | 67.4 | 69.0 |
| Program Bench | 48.3 | 53.6 | 63.8 | 69.1 |
| MLS Bench Lite | 26.7 | 35.1 | 42.8 | 35.5 |
| Kimi Claw 24/7 Bench | 42.9 | 46.9 | 50.4 | 52.8 |
| MCP Atlas | 69.4 | 76.0 | 81.3 | 79.4 |
| MCP Mark Verified | 72.8 | 81.1 | 76.4 | 92.9 |
K2.7-Code beats Claude Opus 4.8 on exactly one benchmark: MCP Mark Verified, where it scores 81.1 versus Claude's 76.4. On the other five, Claude leads by margins ranging from 5 to 10 points. GPT-5.5 leads on every single benchmark.
The 21.8% improvement on Kimi Code Bench v2 is real - it's a genuine leap from 50.9 to 62.0. But it's a gain over K2.6, not a claim to overall dominance. For an open-weight model at this price point, that progress matters. The question is whether the benchmark itself measures what Moonshot implies it measures.
The Kimi K2.7-Code model card on HuggingFace, released June 12 under a Modified MIT license.
Source: huggingface.co
Architecture - What Changed
Same Foundation, Sharpened Objective
K2.7-Code shares its skeleton with the K2.6 agent swarm release: 1T-parameter MoE with 384 experts, 32B active parameters per token, 256K context window, MLA attention, SwiGLU feed-forward, and MoonViT for multimodal input. The structure is identical. What changed is the training objective - Moonshot fine-tuned the base on coding and agentic tasks. Full specs are in the Kimi K2.7-Code model card.
The key technical improvement is the 30% reduction in reasoning tokens compared to K2.6. Thinking mode is mandatory and can't be disabled. Sampling is fixed at temperature 1.0 and top_p 0.95.
Why the Token Efficiency Matters
In a single API call, 30% fewer thinking tokens produces modest savings. In an agentic loop running dozens of steps - CI check runs, ticket updates, multi-file refactors - the effect compounds. Reasoning tokens bill as output tokens at $4.00 per million on the Kimi API. A pipeline that previously spent $10 on a task now costs closer to $7. At scale, that gap is meaningful.
The tradeoff is inflexibility. Developers who need deterministic outputs or custom sampling configurations don't have that option with K2.7-Code. Fixed decoding parameters suit some use cases - automated code review pipelines, for instance - and constrain others.
Deployment Options
Self-hosting requires about 595 GB on disk in full precision. The model runs on vLLM, SGLang, and KTransformers. Moonshot also provides a Kimi Code CLI as a recommended agent framework for direct coding workflows. Notably, the model requires transformers >= 4.57.1 and below 5.0.0 - a narrow version window that teams running pinned environments will need to account for.
Via the Kimi API, access uses the model string kimi-k2.7-code on an OpenAI-compatible endpoint. Input tokens cost $0.95 per million on a cache miss, dropping to $0.19 per million for cached input - a rate that rewards long agent sessions that reuse context. Output tokens, including reasoning, cost $4.00 per million.
For reference, Claude Opus 4.8 charges $25.00 per million output tokens - more than six times K2.7-Code's $4.00 rate. Combined with the 30% reduction in reasoning tokens, the cost-per-task gap between the two models widens further for agentic workloads. For teams where cost per task matters more than top benchmark scores, that gap is hard to ignore. For teams where task quality is the deciding factor, the missing independent evaluations make that comparison harder to close.
The Kimi Code product page, which positions K2.7-Code as the default agent for coding workflows and the Kimi Code CLI.
Source: kimi.com
What It Does Not Tell You
All six benchmarks in Moonshot's comparison table are proprietary. There are no results on standard independent evaluations - no SWE-Bench Pro, no LiveCodeBench, no HumanEval++. That doesn't mean the internal numbers are wrong. It does mean there's no external reference point to calibrate them against.
The K2.6 model card showed a SWE-Bench Pro score of 58.6%, which placed it at the top among open-weight models at the time. K2.7-Code's model card doesn't include a SWE-Bench Pro score. Whether that reflects benchmark strategy or genuine performance limits isn't clear from the published information.
MCP Mark Verified, the one test where K2.7-Code edges Claude Opus 4.8, is itself a Moonshot-run evaluation. The methodology isn't published. That makes the comparison difficult to verify independently.
For developers already running K2.6 in production, K2.7-Code is a drop-in upgrade with real efficiency gains and improved performance on coding tasks. For teams assessing the model against Claude Opus 4.8 or GPT-5.5, the benchmark table is a starting point, not a verdict. Independent results on public benchmarks are absent from this release, and that gap matters for any serious head-to-head comparison.
Sources:
