Alibaba Drops Qwen3.5: 397B Parameters, 17B Active, and It Trades Blows with GPT-5.2

Alibaba's Qwen team released Qwen3.5-397B-A17B on February 16 - a 397 billion parameter mixture-of-experts model that activates only 17 billion parameters per token and ships under an Apache 2.0 license. The model is natively multimodal (text, images, video), supports 201 languages, and uses a hybrid attention architecture that makes it 8x faster and roughly 60% cheaper to run than its predecessor.

The benchmarks are the headline: Qwen3.5 matches or tops GPT-5.2, Claude 4.5 Opus, and Gemini 3 Pro across a majority of assessed tasks, with particular dominance in vision-language understanding. It's the strongest open-weight model released to date.

TL;DR

Detail	Value
Total parameters	397B
Active parameters	17B (~4.3% activation ratio)
Architecture	Hybrid MoE: Gated DeltaNet + Gated Attention (3:1 ratio)
Context length	256K native, extensible to 1M
Modalities	Text, images, video (native early fusion)
Languages	201
License	Apache 2.0
Thinking mode	Built-in, togglable
Inference speed	8.6x faster at 32K, 19x faster at 256K vs Qwen3-Max

The architecture: attention is no longer just attention

The most technically significant detail in Qwen3.5 isn't its size. It's how the model handles attention.

Across its 60 layers, Qwen3.5 uses a 3:1 hybrid ratio of two different attention mechanisms. Three out of every four blocks use Gated DeltaNet, a linear attention variant based on research published in December 2024 that combines Mamba2's gated decay mechanism with delta rule updates for hidden states. These layers scale sub-quadratically with sequence length. The fourth block in each cycle uses Gated Attention with full softmax, providing the model with periodic access to exact attention when it needs it.

The formal layout: 15 x (3 x (Gated DeltaNet -> MoE) -> 1 x (Gated Attention -> MoE))

This is not the same approach as DeepSeek's Multi-head Latent Attention (MLA), which competitors like Kimi and Zhipu have adopted. As HuggingFace's mlabonne noted, the AI field is entering a period where "nobody agrees on attention anymore" - and Qwen is betting on a fundamentally different direction.

The practical payoff is speed. At 32K context, Qwen3.5 decodes 8.6x faster than Qwen3-Max on the same hardware. At 256K context, the gap widens to 19x. This is what makes a 397B-parameter model economically viable to serve: most of those parameters are dormant on any given token.

MoE breakdown

Component	Value
Total experts	512
Routed experts per token	10
Shared expert	1 (always active)
Expert intermediate dimension	1,024
Hidden dimension	4,096
Vocabulary size	248,320

Benchmarks: where it wins and where it does not

Alibaba claims Qwen3.5 outperforms GPT-5.2, Claude 4.5 Opus, and Gemini 3 Pro on roughly 80% of their assessed benchmarks. That's a strong claim for an open-weight model. The numbers deserve a closer look.

Language and reasoning

Benchmark	Category	Qwen3.5	GPT-5.2	Claude 4.5	Gemini 3 Pro
MMLU-Pro	Knowledge	87.8	87.4	89.5	89.8
SuperGPQA	Knowledge	70.4	67.9	70.6	74.0
IFBench	Instruction following	76.5	75.4	58.0	70.4
MultiChallenge	Instruction following	67.6	57.9	54.2	64.2
LongBench v2	Long context	63.2	54.5	64.4	68.2
AIME26	Math competition	91.3	96.7	93.3	90.6
LiveCodeBench v6	Coding	83.6	87.7	84.8	90.7
SWE-bench Verified	Coding agent	76.4	80.0	80.9	76.2
BFCL-V4	Tool use	72.9	63.1	77.5	72.5

Qwen3.5 leads on instruction following (IFBench, MultiChallenge) and holds its own on knowledge benchmarks. It is weaker on math competitions (AIME26, HMMT) and coding (LiveCodeBench, SWE-bench), where GPT-5.2 and Claude 4.5 still lead. The Qwen team, to their credit, doesn't overclaim on coding.

Vision-language: where it pulls away

Benchmark	Category	Qwen3.5	GPT-5.2	Claude 4.5	Gemini 3 Pro
MathVision	Math + vision	88.6	83.0	74.3	86.6
MathVista	Math + vision	90.3	83.1	80.0	87.9
MMBench EN	General VQA	93.7	88.2	89.2	93.7
OmniDocBench	Document understanding	90.8	85.7	87.7	88.5
OCRBench	OCR	93.1	80.7	85.8	90.4
V* (with CI)	Spatial intelligence	95.8	75.9	67.0	88.0
MLVU	Video understanding	86.7	85.6	81.7	83.0

This is where the "native multimodal" label earns its weight. Unlike Qwen3, which shipped text and vision as separate model lines (Qwen3 and Qwen3-VL), Qwen3.5 was trained with early fusion from the ground up - images and video aren't bolted on through an adapter. The result is a 20-point lead over Claude 4.5 on spatial intelligence (V*) and a 12-point lead on OCR.

What changed from Qwen3

The jump from Qwen3 to Qwen3.5 is not incremental. It's architectural.

	Qwen3 (May 2025)	Qwen3.5 (Feb 2026)
Attention	Standard transformer	Hybrid DeltaNet + softmax (3:1)
Total parameters	235B-A22B	397B-A17B
Active parameters	22B	17B (fewer active, more total)
Multimodality	Separate models (Qwen3-VL)	Native early fusion
Languages	119	201
Context	128K	256K (extensible to 1M)
Speed	Baseline	8x faster
Cost	Baseline	~60% cheaper

The counterintuitive part: Qwen3.5 activates fewer parameters per token (17B vs 22B) but has nearly twice the total parameter count (397B vs 235B). This means a wider expert pool with sparser activation - each token routes through more specialized experts, but fewer of them. The tradeoff works because the Gated DeltaNet layers handle the bulk of sequential context cheaply, while the periodic full-attention layers correct for any accumulation errors.

Running it

The BF16 weights need roughly 800 GB of VRAM (think 10x H100 80GB). The FP8 quantization cuts that to ~400 GB (8x H100). Community GGUF quantizations from Unsloth go down to Q4_K_M (~220 GB, fits on a 256GB Apple M3 Ultra) and even 3-bit variants.

For developers who want to kick the tires without a GPU cluster, the model is available through vLLM, SGLang, Ollama, LM Studio, and as a hosted API via Alibaba's Model Studio (as Qwen3.5-Plus, with a default 1M token context window).

A terminal-based coding agent, Qwen Code, was released with the model.

The competitive picture

Qwen3.5 landed during a week when multiple Chinese AI labs released new flagship models. The timing isn't coincidental. What makes this release stand out is the combination of three factors that no other model currently matches simultaneously:

Open weights under Apache 2.0 - not a restricted license, not API-only
Native multimodal - not a text model with a vision adapter
Frontier-competitive benchmarks - within striking distance of GPT-5.2 and Claude 4.5 on text, and ahead of both on vision

The activation ratio (4.3%) is also worth watching. Kimi K2.5 pushed this even further with a 3.25% ratio, suggesting that extreme sparsity in MoE models is becoming the default design philosophy for making large models economically deployable.

Smaller Qwen3.5 variants (9B and 35B-A3B have been spotted in pre-release) should follow, which will bring this architecture to hardware that most developers actually have access to.

Sources: