Qwen3-Coder-Next
Qwen3-Coder-Next is an 80B MoE coding model from Alibaba that activates just 3B parameters per forward pass, scoring over 70% on SWE-Bench Verified with agent scaffolding under Apache 2.0.

Qwen3-Coder-Next is Alibaba's open-weight coding specialist built on the Qwen3 architecture, released February 4, 2026. It uses a hybrid Mixture-of-Experts design that keeps only 3 billion parameters active during inference from a total of 80 billion - a ratio of about 1:27 that lets it run notably faster and cheaper than dense models at comparable quality. It's designed specifically for coding agents: multi-step tool use, long-horizon task execution, and repository-scale reasoning across a 256K-token context window.
TL;DR
- 80B MoE model with just 3B active params - competes with much larger models on coding benchmarks
- 70.6% on SWE-Bench Verified (with SWE-Agent), 62.8% multilingual, 93.3 tokens/sec throughput
- Undercuts Claude Sonnet 4.6 on price by roughly 90% while closing the benchmark gap
The model has been sitting in my testing queue since launch. What finally pushed it to the top is the performance-to-cost story. Claude Sonnet 4.6 prices input at $3/M tokens and scores 62.4% on SWE-Bench Verified. Qwen3-Coder-Next prices at $0.30/M and scores 70.6% on the same benchmark - measured with the same SWE-Agent scaffolding. That isn't a minor difference worth dismissing as noise. The gap is wide enough to justify a serious look for any team running coding agents at scale.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Alibaba (Qwen Team) |
| Model Family | Qwen3 |
| Total Parameters | 80B |
| Active Parameters | 3B per forward pass |
| Architecture | Hybrid MoE - Gated DeltaNet + Gated Attention |
| Experts | 512 total, 10 activated + 1 shared per token |
| Context Window | 256K tokens native (1M via Yarn extension) |
| Languages Supported | 358 programming languages |
| Input Price | $0.30-$0.80/M tokens (tiered by context) |
| Output Price | $1.50-$4.00/M tokens (tiered by context) |
| Release Date | February 4, 2026 |
| License | Apache 2.0 |
Benchmark Performance
The numbers below come from the technical report (arXiv:2603.00729). SWE-Bench scores use the agent scaffold noted in parentheses. "Not available" means I couldn't find a verified figure for that model/benchmark combination.
| Benchmark | Qwen3-Coder-Next | DeepSeek V4 Pro | Claude Sonnet 4.6 |
|---|---|---|---|
| SWE-Bench Verified (SWE-Agent) | 70.6% | 80.6% | 62.4% |
| SWE-Bench Verified (MiniSWE-Agent) | 71.1% | N/A | N/A |
| SWE-Bench Verified (OpenHands) | 71.3% | N/A | N/A |
| SWE-Bench Multilingual | 62.8% | N/A | N/A |
| SWE-Bench Pro | 56.2% | N/A | N/A |
| Terminal-Bench 2.0 | 36.2% | N/A | N/A |
| AIME 2024 | 89.01% | N/A | N/A |
| Aider | 66.2% | N/A | N/A |
A few things stand out. First, the three agent scaffolds (SWE-Agent, MiniSWE-Agent, OpenHands) all land between 70.6% and 71.3%, which is unusually consistent - it suggests the model adapts well to different tool-calling formats rather than being tuned for one scaffold in particular. Second, SWE-Bench Multilingual at 62.8% is the most interesting number for international teams: it measures bug-fixing across codebases in Python, Java, TypeScript, C++, and Go. Third, AIME 2024 at 89% isn't a coding benchmark at all - it's math competition problems. The Qwen team included it to show that the mid-training on code also improved mathematical reasoning, which is believable given the overlap between formal reasoning and algorithmic thinking.
DeepSeek V4 Pro leads on SWE-Bench Verified at 80.6%, but it activates 37B parameters per pass and requires substantially more GPU memory. For our full rankings, see the SWE-Bench coding agent leaderboard.
Qwen3-Coder-Next is designed for agentic coding workflows - multi-step task execution, terminal use, and repository-scale reasoning.
Source: unsplash.com
Key Capabilities
Agentic Tool Use
The training pipeline for Qwen3-Coder-Next centered on roughly 800,000 verifiable coding tasks paired with executable environments. Rather than training solely on human demonstrations, the model learned from direct environment feedback - running code, reading error messages, and iterating. That process shows: on multi-step agent tasks, the model handles tool call failures and partial results more gracefully than most models its size.
It supports every major tool-calling format - JSON function calls, XML tags, Python function signatures - which means it drops into Claude Code, Qwen Code, Cline, Cursor, and similar IDEs without configuration changes.
Context and Codebase Reasoning
The native 256K-token context holds roughly 20,000 lines of code in working memory simultaneously - enough for a medium-sized monorepo. The Yarn extension to 1M tokens is available but comes with the usual caveats about attention degradation at extreme lengths. For most practical codebases, 256K is plenty.
Fill-in-the-middle completion (FIM) is supported natively, which matters for integration with tools like Cline that rely on infill rather than append-only generation.
Efficiency Advantages
At 93.3 tokens per second on the hosted API (Artificial Analysis measurement, May 2026) and a 1.65-second time to first token, the model is fast enough for interactive coding sessions where response latency matters. For comparison, the category median for similar-tier models is 1.78s TTFT. The 3B active parameter count also means local deployment is within reach: a single RTX 4090 can handle quantized versions.
With native support for 358 programming languages and multi-turn tool calling, the model targets full agentic coding workflows rather than single-shot completion.
Source: unsplash.com
Pricing and Availability
Alibaba Cloud's international pricing is tiered by context length within a single request:
| Context Range | Input | Output |
|---|---|---|
| 0 - 32K tokens | $0.30/M | $1.50/M |
| 32K - 128K tokens | $0.50/M | $2.50/M |
| 128K - 256K tokens | $0.80/M | $4.00/M |
For short-context tasks (most code completion and debugging queries), you're paying $0.30/M input. At the longest context tier, costs rise to $0.80/M - still a fraction of what Claude Sonnet 4.6 charges at any context length.
The model is available via the Alibaba Cloud Model Studio API, OpenRouter (blended $0.35/$1.20), and as open weights on Hugging Face in full-precision, FP8, and GGUF quantized variants. The Apache 2.0 license allows commercial use without restrictions.
For the broader picture of cost-efficiency across models, see our open source LLM leaderboard.
Strengths
- Clearly better cost-efficiency than proprietary alternatives on coding benchmarks
- Consistent performance across three different agent scaffolds (SWE-Agent, MiniSWE-Agent, OpenHands)
- 256K native context with 358-language support covers almost any real codebase scenario
- Fast inference at 93.3 tok/s; local deployment possible on a single high-end GPU
- Apache 2.0 license - no usage restrictions, fine-tuning permitted
Weaknesses
- DeepSeek V4 Pro leads by 10 points on SWE-Bench Verified, and that gap matters on truly novel or complex codebases
- Tiered pricing means long-context use cases cost significantly more than the base rate suggests
- No multimodal input - screenshot or UI-driven debugging workflows aren't supported
- Higher verbosity than competitors: Artificial Analysis measured it producing 26M tokens during evaluation (category median is lower), which inflates output costs in production
Related Coverage
- SWE-Bench Coding Agent Leaderboard - full rankings for coding models
- Coding Benchmarks Leaderboard - broader benchmark coverage including MBPP, HumanEval, and more
- Open Source LLM Leaderboard - where Qwen3-Coder-Next sits vs. Llama 4, Mistral, and others
- DeepSeek V4 - the leading open-weight alternative on SWE-Bench
Sources:
✓ Last verified May 21, 2026
