Alibaba's Qwen3.6 Coder: 73.4 SWE-bench, 22GB VRAM
Qwen3.6-35B-A3B lands with 73.4 on SWE-bench Verified and Apache 2.0 weights, all from 3 billion active parameters routed through a 256-expert MoE. Fits on a single consumer GPU.

Alibaba's Qwen team shipped Qwen3.6-35B-A3B on 16 April under Apache 2.0. The benchmark line reads hard: 73.4 on SWE-bench Verified, 51.5 on Terminal-Bench 2.0, 92.7 on AIME 2026. What doesn't appear in the headline is the number that makes the rest of it load-bearing: 3 billion active parameters. The model routes through a 256-expert Mixture-of-Experts layer that activates roughly 8.6% of its weights per token, which is how a checkpoint nominally 35B in size fits on a single 4090 at 4-bit and codes at a weight class it has no business reaching.
Qwen3.6-35B-A3B at a glance
| Spec | Value |
|---|---|
| Total parameters | 35B (256 experts) |
| Active per token | 3B (8 routed + 1 shared) |
| Native context | 262,144 tokens |
| Extensible context | 1,010,000 tokens (YaRN) |
| Modalities | Text, images, video |
| License | Apache 2.0 |
| Released | 16 April 2026 |
| SWE-bench Verified | 73.4 |
| Terminal-Bench 2.0 | 51.5 |
| 4-bit VRAM | ~22 GB |
See the full model card for the extended benchmark grid.
The release also closes the loop on last year's Qwen 3.5-35B-A3B - same parameter envelope, same MoE topology on paper, but a different attention stack underneath and a large jump in coding-agent scores.
What Alibaba Shipped
Weights, a modeling file, and a tokenizer on HuggingFace. No API-first launch, no closed preview, no waitlist. Qwen/Qwen3.6-35B-A3B on the Hub, Apache-2.0 license in the repo root, the reference serving configs pinned to SGLang (>= 0.5.10) and vLLM (>= 0.19.0) with eight-way tensor parallel and a 262K context ceiling. Instruct and thinking variants live in the same checkpoint and switch by sampling parameters.
The coverage here is the open-source release only. Alibaba also maintains a closed Qwen 3.6 Max tier served through Alibaba Cloud; that's a different model with different numbers and isn't what shipped on the Hub.
The Architecture
Three choices carry most of the weight.
Gated DeltaNet
The attention stack is a hybrid. The model stacks 40 layers in a fixed 10-block pattern: three layers of Gated DeltaNet followed by one layer of Gated Attention, each followed by a MoE feedforward. DeltaNet is a linear attention variant where the similarity matrix is kept implicit and updated incrementally rather than recomputed at full quadratic cost. Qwen's configuration runs 32 heads on V and 16 on QK at 128 head dimension. The practical payoff is long-context scaling: the linear term leads as context grows, so reading a 262K-token repository costs roughly linearly in tokens rather than quadratically.
Gated Attention
Every fourth attention block is full, not linear. This is where the standard softmax attention lives - 16 Q heads, 2 KV heads (grouped-query, so the KV cache stays small), 256 head dimension with a 64-dim rotary embedding. The model needs a full attention block to pick up relationships the linear pass can miss, especially for pointer-style work like reference resolution across a diff. Running it once every four layers keeps the quadratic cost bounded while preserving fidelity.
Mixture of Experts
Each feedforward is a 256-expert MoE with 8 experts routed plus 1 shared. Each expert carries a 512-dim intermediate projection. For a given token the router picks 8 of the 256 experts by learned scoring and the shared expert runs unconditionally. That's where the 3B active count comes from: the attention stack accounts for roughly 1B of the active weights, the MoE routes the rest from the sparse 35B pool.
The training pipeline also trains for multi-token prediction, which is a small but cumulative inference-time speedup at serving.
The DeltaNet/MoE hybrid isn't new, as Mamba, RWKV, and the Qwen 3.5 series all drew from the same body of linear-attention work. What's changed in the 3.6 release is that the ablation curves finally beat the dense baselines across the coding suite rather than just matching them.
Source: unsplash.com
Benchmarks Against the Field
The comparison that matters for this release isn't "does it beat GPT-5 on MMLU" - it doesn't. It is whether the coding numbers sit in a class where agentic users seriously consider a local deployment over a paid API call.
| Benchmark | Qwen3.6-35B-A3B | DeepSeek V4 | Gemma4-31B |
|---|---|---|---|
| SWE-bench Verified | 73.4 | 83.7 | not published |
| SWE-bench Pro | 49.5 | not published | 35.7 |
| Terminal-Bench 2.0 | 51.5 | not published | 42.9 |
| MMLU-Pro | 85.2 | not published | not published |
| GPQA | 86.0 | not published | not published |
| AIME 2026 | 92.7 | not published | not published |
| MMMU | 81.7 | not published | not published |
DeepSeek V4 leads SWE-bench Verified by roughly 10 points, but V4 is a ~1T-parameter dense model that you self-host only if you already own tensor parallel across a multi-node GPU cluster. The comparable deployment target for Qwen3.6-35B-A3B is a single node, sometimes a single card. On that axis, against the field it is actually competing with (dense 27-35B open-source releases and Llama 4's open weights) it wins decisively. Plus 38% over Gemma 4-31B on SWE-bench Pro at a third of the active compute is a larger architectural delta than any single dense scale-up release of the last twelve months.
What It Takes to Run It
At 4-bit the checkpoint fits in the 24GB frame of an RTX 4090 with working room for context. At 8-bit you need two cards or a single H100 slice. The thinking mode preserves reasoning context across turns, which raises the effective prompt length across long agent sessions.
Source: unsplash.com
| Precision | VRAM (weights) | Target hardware |
|---|---|---|
| INT4 (GGUF / AWQ) | ~18-22 GB | RTX 4090 / 3090 24GB |
| INT8 | ~36 GB | 2x 24GB or 1x A100 40GB |
| FP16 / BF16 | ~72 GB | 1x H100 80GB or 4x 24GB |
Qwen's reference configs expect 8-way tensor parallel for the full 262K context, which is the lower bound. At 4-bit and a 64K context ceiling the model serves from a single consumer GPU at a few tokens per second on batch size 1. That's slow for an interactive pair-programming loop, but fast enough for an agent running patch-and-test cycles in a terminal with tool use.
The thinking mode uses aggressive sampling (temperature 1.0, top_p 0.95, top_k 20, min_p 0.0) with presence_penalty 1.5. The instruct mode runs cooler (temperature 0.7). Swapping modes is a sampling-parameter change, not a checkpoint change.
What To Watch
Four things are worth pricing in before treating the numbers as settled.
Independent SWE-bench reproduction. Qwen's SWE-bench agent scaffold uses temperature 1.0, top_p 0.95, a 200K context window, and an internal bash-plus-file-edit tool runner. The 73.4 is the Qwen-scaffold number. Community reproductions on the public SWE-bench harness have run several points below first-party scores for every open-weights model published this year. The real-world operating number is more likely to sit in the high 60s once third-party evaluators publish runs.
The Max-tier shadow. Alibaba is maintaining a closed Qwen 3.6 Max tier against this Apache-2.0 release. History with the 3.5 series suggests the Max-tier numbers will sit 5 to 8 points higher on coding benchmarks and will consume the headline coverage in Chinese AI media. Buyers who want a self-hosted model should be careful not to conflate the two lines.
Vision evals are light. The multimodal numbers (MMMU 81.7, VideoMMU 83.7, RefCOCO 92.0) come from Qwen's internal eval. The vision-language community has been faster to publish independent evals than the coding community, so expect those to land within a month.
Longest-tail regressions. The release notes name a few places the model regresses relative to 3.5: pure math word problems without thinking mode, and long-form creative writing at temperature 0.7. Neither is the target workload here, but users pushing the model into generalist roles should benchmark first.
The release is open-weights, permissive-license, consumer-hardware-deployable, and competitive on the coding axis that most closed-source agent products charge for. That last fact is the load-bearing one.
Sources:
- Qwen3.6-35B-A3B on HuggingFace - official model card
- Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All - Qwen team blog
- Qwen3.6-35B-A3B: 73.4% SWE-Bench, Runs Locally - BuildFastWithAI, April 2026
- Qwen3.6-35B-A3B Complete Review - Dev.to
- Qwen 3.6 vs Gemma 4 vs Llama 4 vs GLM-5.1 vs DeepSeek V4 - Lushbinary
- QwenLM/Qwen3.6 on GitHub - reference implementation
