Alibaba's Qwen3.6 Coder: 73.4 SWE-bench, 22GB VRAM

Qwen3.6-35B-A3B lands with 73.4 on SWE-bench Verified and Apache 2.0 weights, all from 3 billion active parameters routed through a 256-expert MoE. Fits on a single consumer GPU.

Alibaba's Qwen3.6 Coder: 73.4 SWE-bench, 22GB VRAM

Alibaba's Qwen team shipped Qwen3.6-35B-A3B on 16 April under Apache 2.0. The benchmark line reads hard: 73.4 on SWE-bench Verified, 51.5 on Terminal-Bench 2.0, 92.7 on AIME 2026. What doesn't appear in the headline is the number that makes the rest of it load-bearing: 3 billion active parameters. The model routes through a 256-expert Mixture-of-Experts layer that activates roughly 8.6% of its weights per token, which is how a checkpoint nominally 35B in size fits on a single 4090 at 4-bit and codes at a weight class it has no business reaching.

Qwen3.6-35B-A3B at a glance

SpecValue
Total parameters35B (256 experts)
Active per token3B (8 routed + 1 shared)
Native context262,144 tokens
Extensible context1,010,000 tokens (YaRN)
ModalitiesText, images, video
LicenseApache 2.0
Released16 April 2026
SWE-bench Verified73.4
Terminal-Bench 2.051.5
4-bit VRAM~22 GB

See the full model card for the extended benchmark grid.

The release also closes the loop on last year's Qwen 3.5-35B-A3B - same parameter envelope, same MoE topology on paper, but a different attention stack underneath and a large jump in coding-agent scores.

What Alibaba Shipped

Weights, a modeling file, and a tokenizer on HuggingFace. No API-first launch, no closed preview, no waitlist. Qwen/Qwen3.6-35B-A3B on the Hub, Apache-2.0 license in the repo root, the reference serving configs pinned to SGLang (>= 0.5.10) and vLLM (>= 0.19.0) with eight-way tensor parallel and a 262K context ceiling. Instruct and thinking variants live in the same checkpoint and switch by sampling parameters.

The coverage here is the open-source release only. Alibaba also maintains a closed Qwen 3.6 Max tier served through Alibaba Cloud; that's a different model with different numbers and isn't what shipped on the Hub.

The Architecture

Three choices carry most of the weight.

Gated DeltaNet

The attention stack is a hybrid. The model stacks 40 layers in a fixed 10-block pattern: three layers of Gated DeltaNet followed by one layer of Gated Attention, each followed by a MoE feedforward. DeltaNet is a linear attention variant where the similarity matrix is kept implicit and updated incrementally rather than recomputed at full quadratic cost. Qwen's configuration runs 32 heads on V and 16 on QK at 128 head dimension. The practical payoff is long-context scaling: the linear term leads as context grows, so reading a 262K-token repository costs roughly linearly in tokens rather than quadratically.

Gated Attention

Every fourth attention block is full, not linear. This is where the standard softmax attention lives - 16 Q heads, 2 KV heads (grouped-query, so the KV cache stays small), 256 head dimension with a 64-dim rotary embedding. The model needs a full attention block to pick up relationships the linear pass can miss, especially for pointer-style work like reference resolution across a diff. Running it once every four layers keeps the quadratic cost bounded while preserving fidelity.

Mixture of Experts

Each feedforward is a 256-expert MoE with 8 experts routed plus 1 shared. Each expert carries a 512-dim intermediate projection. For a given token the router picks 8 of the 256 experts by learned scoring and the shared expert runs unconditionally. That's where the 3B active count comes from: the attention stack accounts for roughly 1B of the active weights, the MoE routes the rest from the sparse 35B pool.

The training pipeline also trains for multi-token prediction, which is a small but cumulative inference-time speedup at serving.

Abstract network of glowing nodes The DeltaNet/MoE hybrid isn't new, as Mamba, RWKV, and the Qwen 3.5 series all drew from the same body of linear-attention work. What's changed in the 3.6 release is that the ablation curves finally beat the dense baselines across the coding suite rather than just matching them. Source: unsplash.com

Benchmarks Against the Field

The comparison that matters for this release isn't "does it beat GPT-5 on MMLU" - it doesn't. It is whether the coding numbers sit in a class where agentic users seriously consider a local deployment over a paid API call.

BenchmarkQwen3.6-35B-A3BDeepSeek V4Gemma4-31B
SWE-bench Verified73.483.7not published
SWE-bench Pro49.5not published35.7
Terminal-Bench 2.051.5not published42.9
MMLU-Pro85.2not publishednot published
GPQA86.0not publishednot published
AIME 202692.7not publishednot published
MMMU81.7not publishednot published

DeepSeek V4 leads SWE-bench Verified by roughly 10 points, but V4 is a ~1T-parameter dense model that you self-host only if you already own tensor parallel across a multi-node GPU cluster. The comparable deployment target for Qwen3.6-35B-A3B is a single node, sometimes a single card. On that axis, against the field it is actually competing with (dense 27-35B open-source releases and Llama 4's open weights) it wins decisively. Plus 38% over Gemma 4-31B on SWE-bench Pro at a third of the active compute is a larger architectural delta than any single dense scale-up release of the last twelve months.

What It Takes to Run It

Code editor with visible syntax At 4-bit the checkpoint fits in the 24GB frame of an RTX 4090 with working room for context. At 8-bit you need two cards or a single H100 slice. The thinking mode preserves reasoning context across turns, which raises the effective prompt length across long agent sessions. Source: unsplash.com

PrecisionVRAM (weights)Target hardware
INT4 (GGUF / AWQ)~18-22 GBRTX 4090 / 3090 24GB
INT8~36 GB2x 24GB or 1x A100 40GB
FP16 / BF16~72 GB1x H100 80GB or 4x 24GB

Qwen's reference configs expect 8-way tensor parallel for the full 262K context, which is the lower bound. At 4-bit and a 64K context ceiling the model serves from a single consumer GPU at a few tokens per second on batch size 1. That's slow for an interactive pair-programming loop, but fast enough for an agent running patch-and-test cycles in a terminal with tool use.

The thinking mode uses aggressive sampling (temperature 1.0, top_p 0.95, top_k 20, min_p 0.0) with presence_penalty 1.5. The instruct mode runs cooler (temperature 0.7). Swapping modes is a sampling-parameter change, not a checkpoint change.

What To Watch

Four things are worth pricing in before treating the numbers as settled.

Independent SWE-bench reproduction. Qwen's SWE-bench agent scaffold uses temperature 1.0, top_p 0.95, a 200K context window, and an internal bash-plus-file-edit tool runner. The 73.4 is the Qwen-scaffold number. Community reproductions on the public SWE-bench harness have run several points below first-party scores for every open-weights model published this year. The real-world operating number is more likely to sit in the high 60s once third-party evaluators publish runs.

The Max-tier shadow. Alibaba is maintaining a closed Qwen 3.6 Max tier against this Apache-2.0 release. History with the 3.5 series suggests the Max-tier numbers will sit 5 to 8 points higher on coding benchmarks and will consume the headline coverage in Chinese AI media. Buyers who want a self-hosted model should be careful not to conflate the two lines.

Vision evals are light. The multimodal numbers (MMMU 81.7, VideoMMU 83.7, RefCOCO 92.0) come from Qwen's internal eval. The vision-language community has been faster to publish independent evals than the coding community, so expect those to land within a month.

Longest-tail regressions. The release notes name a few places the model regresses relative to 3.5: pure math word problems without thinking mode, and long-form creative writing at temperature 0.7. Neither is the target workload here, but users pushing the model into generalist roles should benchmark first.

The release is open-weights, permissive-license, consumer-hardware-deployable, and competitive on the coding axis that most closed-source agent products charge for. That last fact is the load-bearing one.

Sources:

Alibaba's Qwen3.6 Coder: 73.4 SWE-bench, 22GB VRAM
About the author AI Infrastructure & Open Source Reporter

Sophie is a journalist and former systems engineer who covers AI infrastructure, open-source models, and the developer tooling ecosystem.