Alibaba Drops Qwen3.5: 397B Parameters, 17B Active, and It Trades Blows with GPT-5.2
Alibaba's Qwen team releases Qwen3.5-397B-A17B, an open-weight mixture-of-experts model with native multimodal support and a hybrid attention architecture that runs 8x faster than its predecessor. Apache 2.0 licensed.

Alibaba's Qwen team released Qwen3.5-397B-A17B on February 16 - a 397 billion parameter mixture-of-experts model that activates only 17 billion parameters per token and ships under an Apache 2.0 license. The model is natively multimodal (text, images, video), supports 201 languages, and uses a hybrid attention architecture that makes it 8x faster and roughly 60% cheaper to run than its predecessor.
The benchmarks are the headline: Qwen3.5 matches or exceeds GPT-5.2, Claude 4.5 Opus, and Gemini 3 Pro across a majority of evaluated tasks, with particular dominance in vision-language understanding. It is the strongest open-weight model released to date.
TL;DR
| Detail | Value |
|---|---|
| Total parameters | 397B |
| Active parameters | 17B (~4.3% activation ratio) |
| Architecture | Hybrid MoE: Gated DeltaNet + Gated Attention (3:1 ratio) |
| Context length | 256K native, extensible to 1M |
| Modalities | Text, images, video (native early fusion) |
| Languages | 201 |
| License | Apache 2.0 |
| Thinking mode | Built-in, togglable |
| Inference speed | 8.6x faster at 32K, 19x faster at 256K vs Qwen3-Max |
The architecture: attention is no longer just attention
The most technically significant detail in Qwen3.5 is not its size. It's how the model handles attention.
Across its 60 layers, Qwen3.5 uses a 3:1 hybrid ratio of two different attention mechanisms. Three out of every four blocks use Gated DeltaNet, a linear attention variant based on research published in December 2024 that combines Mamba2's gated decay mechanism with delta rule updates for hidden states. These layers scale sub-quadratically with sequence length. The fourth block in each cycle uses Gated Attention with full softmax, providing the model with periodic access to exact attention when it needs it.
The formal layout: 15 x (3 x (Gated DeltaNet -> MoE) -> 1 x (Gated Attention -> MoE))
This is not the same approach as DeepSeek's Multi-head Latent Attention (MLA), which competitors like Kimi and Zhipu have adopted. As HuggingFace's mlabonne noted, the AI field is entering a period where "nobody agrees on attention anymore" - and Qwen is betting on a fundamentally different direction.
The practical payoff is speed. At 32K context, Qwen3.5 decodes 8.6x faster than Qwen3-Max on the same hardware. At 256K context, the gap widens to 19x. This is what makes a 397B-parameter model economically viable to serve: most of those parameters are dormant on any given token.
MoE breakdown
| Component | Value |
|---|---|
| Total experts | 512 |
| Routed experts per token | 10 |
| Shared expert | 1 (always active) |
| Expert intermediate dimension | 1,024 |
| Hidden dimension | 4,096 |
| Vocabulary size | 248,320 |
Benchmarks: where it wins and where it does not
Alibaba claims Qwen3.5 outperforms GPT-5.2, Claude 4.5 Opus, and Gemini 3 Pro on roughly 80% of their evaluated benchmarks. That is a strong claim for an open-weight model. The numbers deserve a closer look.
Language and reasoning
| Benchmark | Category | Qwen3.5 | GPT-5.2 | Claude 4.5 | Gemini 3 Pro |
|---|---|---|---|---|---|
| MMLU-Pro | Knowledge | 87.8 | 87.4 | 89.5 | 89.8 |
| SuperGPQA | Knowledge | 70.4 | 67.9 | 70.6 | 74.0 |
| IFBench | Instruction following | 76.5 | 75.4 | 58.0 | 70.4 |
| MultiChallenge | Instruction following | 67.6 | 57.9 | 54.2 | 64.2 |
| LongBench v2 | Long context | 63.2 | 54.5 | 64.4 | 68.2 |
| AIME26 | Math competition | 91.3 | 96.7 | 93.3 | 90.6 |
| LiveCodeBench v6 | Coding | 83.6 | 87.7 | 84.8 | 90.7 |
| SWE-bench Verified | Coding agent | 76.4 | 80.0 | 80.9 | 76.2 |
| BFCL-V4 | Tool use | 72.9 | 63.1 | 77.5 | 72.5 |
Qwen3.5 dominates on instruction following (IFBench, MultiChallenge) and holds its own on knowledge benchmarks. It is weaker on math competitions (AIME26, HMMT) and coding (LiveCodeBench, SWE-bench), where GPT-5.2 and Claude 4.5 still lead. The Qwen team, to their credit, does not overclaim on coding.
Vision-language: where it pulls away
| Benchmark | Category | Qwen3.5 | GPT-5.2 | Claude 4.5 | Gemini 3 Pro |
|---|---|---|---|---|---|
| MathVision | Math + vision | 88.6 | 83.0 | 74.3 | 86.6 |
| MathVista | Math + vision | 90.3 | 83.1 | 80.0 | 87.9 |
| MMBench EN | General VQA | 93.7 | 88.2 | 89.2 | 93.7 |
| OmniDocBench | Document understanding | 90.8 | 85.7 | 87.7 | 88.5 |
| OCRBench | OCR | 93.1 | 80.7 | 85.8 | 90.4 |
| V* (with CI) | Spatial intelligence | 95.8 | 75.9 | 67.0 | 88.0 |
| MLVU | Video understanding | 86.7 | 85.6 | 81.7 | 83.0 |
This is where the "native multimodal" label earns its weight. Unlike Qwen3, which shipped text and vision as separate model lines (Qwen3 and Qwen3-VL), Qwen3.5 was trained with early fusion from the ground up - images and video are not bolted on through an adapter. The result is a 20-point lead over Claude 4.5 on spatial intelligence (V*) and a 12-point lead on OCR.
What changed from Qwen3
The jump from Qwen3 to Qwen3.5 is not incremental. It is architectural.
| Qwen3 (May 2025) | Qwen3.5 (Feb 2026) | |
|---|---|---|
| Attention | Standard transformer | Hybrid DeltaNet + softmax (3:1) |
| Total parameters | 235B-A22B | 397B-A17B |
| Active parameters | 22B | 17B (fewer active, more total) |
| Multimodality | Separate models (Qwen3-VL) | Native early fusion |
| Languages | 119 | 201 |
| Context | 128K | 256K (extensible to 1M) |
| Speed | Baseline | 8x faster |
| Cost | Baseline | ~60% cheaper |
The counterintuitive part: Qwen3.5 activates fewer parameters per token (17B vs 22B) but has nearly twice the total parameter count (397B vs 235B). This means a wider expert pool with sparser activation - each token routes through more specialized experts, but fewer of them. The tradeoff works because the Gated DeltaNet layers handle the bulk of sequential context cheaply, while the periodic full-attention layers correct for any accumulation errors.
Running it
The BF16 weights need roughly 800 GB of VRAM (think 10x H100 80GB). The FP8 quantization cuts that to ~400 GB (8x H100). Community GGUF quantizations from Unsloth go down to Q4_K_M (~220 GB, fits on a 256GB Apple M3 Ultra) and even 3-bit variants.
For developers who want to kick the tires without a GPU cluster, the model is available through vLLM, SGLang, Ollama, LM Studio, and as a hosted API via Alibaba's Model Studio (as Qwen3.5-Plus, with a default 1M token context window).
A terminal-based coding agent, Qwen Code, was released alongside the model.
The competitive picture
Qwen3.5 landed during a week when multiple Chinese AI labs released new flagship models. The timing is not coincidental. What makes this release stand out is the combination of three factors that no other model currently matches simultaneously:
- Open weights under Apache 2.0 - not a restricted license, not API-only
- Native multimodal - not a text model with a vision adapter
- Frontier-competitive benchmarks - within striking distance of GPT-5.2 and Claude 4.5 on text, and ahead of both on vision
The activation ratio (4.3%) is also worth watching. Kimi K2.5 pushed this even further with a 3.25% ratio, suggesting that extreme sparsity in MoE models is becoming the default design philosophy for making large models economically deployable.
Smaller Qwen3.5 variants (9B and 35B-A3B have been spotted in pre-release) should follow, which will bring this architecture to hardware that most developers actually have access to.
Sources:
- Qwen3.5: Towards Native Multimodal Agents - Qwen Blog
- Qwen3.5-397B-A17B - HuggingFace
- Qwen3.5-397B-A17B-FP8 - HuggingFace
- Alibaba's Qwen 3.5 397B-A17B beats its larger trillion-parameter model - VentureBeat
- Alibaba unveils Qwen3.5 as China's chatbot race shifts to AI agents - CNBC
- Nobody Agrees on Attention Anymore - mlabonne, HuggingFace Blog
- Qwen3.5 GGUF quantizations - Unsloth
- QwenLM/Qwen3.5 - GitHub
- Gated Delta Networks: Improving Mamba2 with Delta Rule - arXiv
