Qwen3-Coder-Next

Qwen3-Coder-Next is an 80B MoE coding model from Alibaba that activates just 3B parameters per forward pass, scoring over 70% on SWE-Bench Verified with agent scaffolding under Apache 2.0.

Qwen3-Coder-Next

Qwen3-Coder-Next is Alibaba's open-weight coding specialist built on the Qwen3 architecture, released February 4, 2026. It uses a hybrid Mixture-of-Experts design that keeps only 3 billion parameters active during inference from a total of 80 billion - a ratio of about 1:27 that lets it run notably faster and cheaper than dense models at comparable quality. It's designed specifically for coding agents: multi-step tool use, long-horizon task execution, and repository-scale reasoning across a 256K-token context window.

TL;DR

  • 80B MoE model with just 3B active params - competes with much larger models on coding benchmarks
  • 70.6% on SWE-Bench Verified (with SWE-Agent), 62.8% multilingual, 93.3 tokens/sec throughput
  • Undercuts Claude Sonnet 4.6 on price by roughly 90% while closing the benchmark gap

The model has been sitting in my testing queue since launch. What finally pushed it to the top is the performance-to-cost story. Claude Sonnet 4.6 prices input at $3/M tokens and scores 62.4% on SWE-Bench Verified. Qwen3-Coder-Next prices at $0.30/M and scores 70.6% on the same benchmark - measured with the same SWE-Agent scaffolding. That isn't a minor difference worth dismissing as noise. The gap is wide enough to justify a serious look for any team running coding agents at scale.

Key Specifications

SpecificationDetails
ProviderAlibaba (Qwen Team)
Model FamilyQwen3
Total Parameters80B
Active Parameters3B per forward pass
ArchitectureHybrid MoE - Gated DeltaNet + Gated Attention
Experts512 total, 10 activated + 1 shared per token
Context Window256K tokens native (1M via Yarn extension)
Languages Supported358 programming languages
Input Price$0.30-$0.80/M tokens (tiered by context)
Output Price$1.50-$4.00/M tokens (tiered by context)
Release DateFebruary 4, 2026
LicenseApache 2.0

Benchmark Performance

The numbers below come from the technical report (arXiv:2603.00729). SWE-Bench scores use the agent scaffold noted in parentheses. "Not available" means I couldn't find a verified figure for that model/benchmark combination.

BenchmarkQwen3-Coder-NextDeepSeek V4 ProClaude Sonnet 4.6
SWE-Bench Verified (SWE-Agent)70.6%80.6%62.4%
SWE-Bench Verified (MiniSWE-Agent)71.1%N/AN/A
SWE-Bench Verified (OpenHands)71.3%N/AN/A
SWE-Bench Multilingual62.8%N/AN/A
SWE-Bench Pro56.2%N/AN/A
Terminal-Bench 2.036.2%N/AN/A
AIME 202489.01%N/AN/A
Aider66.2%N/AN/A

A few things stand out. First, the three agent scaffolds (SWE-Agent, MiniSWE-Agent, OpenHands) all land between 70.6% and 71.3%, which is unusually consistent - it suggests the model adapts well to different tool-calling formats rather than being tuned for one scaffold in particular. Second, SWE-Bench Multilingual at 62.8% is the most interesting number for international teams: it measures bug-fixing across codebases in Python, Java, TypeScript, C++, and Go. Third, AIME 2024 at 89% isn't a coding benchmark at all - it's math competition problems. The Qwen team included it to show that the mid-training on code also improved mathematical reasoning, which is believable given the overlap between formal reasoning and algorithmic thinking.

DeepSeek V4 Pro leads on SWE-Bench Verified at 80.6%, but it activates 37B parameters per pass and requires substantially more GPU memory. For our full rankings, see the SWE-Bench coding agent leaderboard.

Developer working with AI coding tools in a multi-monitor setup Qwen3-Coder-Next is designed for agentic coding workflows - multi-step task execution, terminal use, and repository-scale reasoning. Source: unsplash.com

Key Capabilities

Agentic Tool Use

The training pipeline for Qwen3-Coder-Next centered on roughly 800,000 verifiable coding tasks paired with executable environments. Rather than training solely on human demonstrations, the model learned from direct environment feedback - running code, reading error messages, and iterating. That process shows: on multi-step agent tasks, the model handles tool call failures and partial results more gracefully than most models its size.

It supports every major tool-calling format - JSON function calls, XML tags, Python function signatures - which means it drops into Claude Code, Qwen Code, Cline, Cursor, and similar IDEs without configuration changes.

Context and Codebase Reasoning

The native 256K-token context holds roughly 20,000 lines of code in working memory simultaneously - enough for a medium-sized monorepo. The Yarn extension to 1M tokens is available but comes with the usual caveats about attention degradation at extreme lengths. For most practical codebases, 256K is plenty.

Fill-in-the-middle completion (FIM) is supported natively, which matters for integration with tools like Cline that rely on infill rather than append-only generation.

Efficiency Advantages

At 93.3 tokens per second on the hosted API (Artificial Analysis measurement, May 2026) and a 1.65-second time to first token, the model is fast enough for interactive coding sessions where response latency matters. For comparison, the category median for similar-tier models is 1.78s TTFT. The 3B active parameter count also means local deployment is within reach: a single RTX 4090 can handle quantized versions.

Terminal interface showing AI agent code generation and tool calls With native support for 358 programming languages and multi-turn tool calling, the model targets full agentic coding workflows rather than single-shot completion. Source: unsplash.com

Pricing and Availability

Alibaba Cloud's international pricing is tiered by context length within a single request:

Context RangeInputOutput
0 - 32K tokens$0.30/M$1.50/M
32K - 128K tokens$0.50/M$2.50/M
128K - 256K tokens$0.80/M$4.00/M

For short-context tasks (most code completion and debugging queries), you're paying $0.30/M input. At the longest context tier, costs rise to $0.80/M - still a fraction of what Claude Sonnet 4.6 charges at any context length.

The model is available via the Alibaba Cloud Model Studio API, OpenRouter (blended $0.35/$1.20), and as open weights on Hugging Face in full-precision, FP8, and GGUF quantized variants. The Apache 2.0 license allows commercial use without restrictions.

For the broader picture of cost-efficiency across models, see our open source LLM leaderboard.


Strengths

  • Clearly better cost-efficiency than proprietary alternatives on coding benchmarks
  • Consistent performance across three different agent scaffolds (SWE-Agent, MiniSWE-Agent, OpenHands)
  • 256K native context with 358-language support covers almost any real codebase scenario
  • Fast inference at 93.3 tok/s; local deployment possible on a single high-end GPU
  • Apache 2.0 license - no usage restrictions, fine-tuning permitted

Weaknesses

  • DeepSeek V4 Pro leads by 10 points on SWE-Bench Verified, and that gap matters on truly novel or complex codebases
  • Tiered pricing means long-context use cases cost significantly more than the base rate suggests
  • No multimodal input - screenshot or UI-driven debugging workflows aren't supported
  • Higher verbosity than competitors: Artificial Analysis measured it producing 26M tokens during evaluation (category median is lower), which inflates output costs in production

Sources:

✓ Last verified May 21, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.