Qwen3-Coder-Next is Alibaba's open-weight coding specialist built on the Qwen3 architecture, released February 4, 2026. It uses a hybrid Mixture-of-Experts design that keeps only 3 billion parameters active during inference from a total of 80 billion - a ratio of about 1:27 that lets it run notably faster and cheaper than dense models at comparable quality. It's designed specifically for coding agents: multi-step tool use, long-horizon task execution, and repository-scale reasoning across a 256K-token context window.

TL;DR

80B MoE model with just 3B active params - competes with much larger models on coding benchmarks
70.6% on SWE-Bench Verified (with SWE-Agent), 62.8% multilingual, 93.3 tokens/sec throughput
Undercuts Claude Sonnet 4.6 on price by roughly 90% while closing the benchmark gap

The model has been sitting in my testing queue since launch. What finally pushed it to the top is the performance-to-cost story. Claude Sonnet 4.6 prices input at $3/M tokens and scores 62.4% on SWE-Bench Verified. Qwen3-Coder-Next prices at $0.30/M and scores 70.6% on the same benchmark - measured with the same SWE-Agent scaffolding. That isn't a minor difference worth dismissing as noise. The gap is wide enough to justify a serious look for any team running coding agents at scale.

Key Specifications

Specification	Details
Provider	Alibaba (Qwen Team)
Model Family	Qwen3
Total Parameters	80B
Active Parameters	3B per forward pass
Architecture	Hybrid MoE - Gated DeltaNet + Gated Attention
Experts	512 total, 10 activated + 1 shared per token
Context Window	256K tokens native (1M via Yarn extension)
Languages Supported	358 programming languages
Input Price	$0.30-$0.80/M tokens (tiered by context)
Output Price	$1.50-$4.00/M tokens (tiered by context)
Release Date	February 4, 2026
License	Apache 2.0

Benchmark Performance

The numbers below come from the technical report (arXiv:2603.00729). SWE-Bench scores use the agent scaffold noted in parentheses. "Not available" means I couldn't find a verified figure for that model/benchmark combination.

Benchmark	Qwen3-Coder-Next	DeepSeek V4 Pro	Claude Sonnet 4.6
SWE-Bench Verified (SWE-Agent)	70.6%	80.6%	62.4%
SWE-Bench Verified (MiniSWE-Agent)	71.1%	N/A	N/A
SWE-Bench Verified (OpenHands)	71.3%	N/A	N/A
SWE-Bench Multilingual	62.8%	N/A	N/A
SWE-Bench Pro	56.2%	N/A	N/A
Terminal-Bench 2.0	36.2%	N/A	N/A
AIME 2024	89.01%	N/A	N/A
Aider	66.2%	N/A	N/A

A few things stand out. First, the three agent scaffolds (SWE-Agent, MiniSWE-Agent, OpenHands) all land between 70.6% and 71.3%, which is unusually consistent - it suggests the model adapts well to different tool-calling formats rather than being tuned for one scaffold in particular. Second, SWE-Bench Multilingual at 62.8% is the most interesting number for international teams: it measures bug-fixing across codebases in Python, Java, TypeScript, C++, and Go. Third, AIME 2024 at 89% isn't a coding benchmark at all - it's math competition problems. The Qwen team included it to show that the mid-training on code also improved mathematical reasoning, which is believable given the overlap between formal reasoning and algorithmic thinking.

DeepSeek V4 Pro leads on SWE-Bench Verified at 80.6%, but it activates 37B parameters per pass and requires substantially more GPU memory. For our full rankings, see the SWE-Bench coding agent leaderboard.

Developer working with AI coding tools in a multi-monitor setup Qwen3-Coder-Next is designed for agentic coding workflows - multi-step task execution, terminal use, and repository-scale reasoning. Source: unsplash.com

Key Capabilities

Agentic Tool Use

The training pipeline for Qwen3-Coder-Next centered on roughly 800,000 verifiable coding tasks paired with executable environments. Rather than training solely on human demonstrations, the model learned from direct environment feedback - running code, reading error messages, and iterating. That process shows: on multi-step agent tasks, the model handles tool call failures and partial results more gracefully than most models its size.

It supports every major tool-calling format - JSON function calls, XML tags, Python function signatures - which means it drops into Claude Code, Qwen Code, Cline, Cursor, and similar IDEs without configuration changes.

Context and Codebase Reasoning

The native 256K-token context holds roughly 20,000 lines of code in working memory simultaneously - enough for a medium-sized monorepo. The Yarn extension to 1M tokens is available but comes with the usual caveats about attention degradation at extreme lengths. For most practical codebases, 256K is plenty.

Fill-in-the-middle completion (FIM) is supported natively, which matters for integration with tools like Cline that rely on infill rather than append-only generation.

Efficiency Advantages

At 93.3 tokens per second on the hosted API (Artificial Analysis measurement, May 2026) and a 1.65-second time to first token, the model is fast enough for interactive coding sessions where response latency matters. For comparison, the category median for similar-tier models is 1.78s TTFT. The 3B active parameter count also means local deployment is within reach: a single RTX 4090 can handle quantized versions.

Terminal interface showing AI agent code generation and tool calls With native support for 358 programming languages and multi-turn tool calling, the model targets full agentic coding workflows rather than single-shot completion. Source: unsplash.com

Pricing and Availability

Alibaba Cloud's international pricing is tiered by context length within a single request:

Context Range	Input	Output
0 - 32K tokens	$0.30/M	$1.50/M
32K - 128K tokens	$0.50/M	$2.50/M
128K - 256K tokens	$0.80/M	$4.00/M

For short-context tasks (most code completion and debugging queries), you're paying $0.30/M input. At the longest context tier, costs rise to $0.80/M - still a fraction of what Claude Sonnet 4.6 charges at any context length.

The model is available via the Alibaba Cloud Model Studio API, OpenRouter (blended $0.35/$1.20), and as open weights on Hugging Face in full-precision, FP8, and GGUF quantized variants. The Apache 2.0 license allows commercial use without restrictions.

For the broader picture of cost-efficiency across models, see our open source LLM leaderboard.

Strengths

Clearly better cost-efficiency than proprietary alternatives on coding benchmarks
Consistent performance across three different agent scaffolds (SWE-Agent, MiniSWE-Agent, OpenHands)
256K native context with 358-language support covers almost any real codebase scenario
Fast inference at 93.3 tok/s; local deployment possible on a single high-end GPU
Apache 2.0 license - no usage restrictions, fine-tuning permitted

Weaknesses

DeepSeek V4 Pro leads by 10 points on SWE-Bench Verified, and that gap matters on truly novel or complex codebases
Tiered pricing means long-context use cases cost significantly more than the base rate suggests
No multimodal input - screenshot or UI-driven debugging workflows aren't supported
Higher verbosity than competitors: Artificial Analysis measured it producing 26M tokens during evaluation (category median is lower), which inflates output costs in production

SWE-Bench Coding Agent Leaderboard - full rankings for coding models
Coding Benchmarks Leaderboard - broader benchmark coverage including MBPP, HumanEval, and more
Open Source LLM Leaderboard - where Qwen3-Coder-Next sits vs. Llama 4, Mistral, and others
DeepSeek V4 - the leading open-weight alternative on SWE-Bench

Sources: