News

Alibaba Drops Qwen3.5: 397B Parameters, 17B Active, and It Trades Blows with GPT-5.2

Alibaba's Qwen team releases Qwen3.5-397B-A17B, an open-weight mixture-of-experts model with native multimodal support and a hybrid attention architecture that runs 8x faster than its predecessor. Apache 2.0 licensed.

Alibaba Drops Qwen3.5: 397B Parameters, 17B Active, and It Trades Blows with GPT-5.2

Alibaba's Qwen team released Qwen3.5-397B-A17B on February 16 - a 397 billion parameter mixture-of-experts model that activates only 17 billion parameters per token and ships under an Apache 2.0 license. The model is natively multimodal (text, images, video), supports 201 languages, and uses a hybrid attention architecture that makes it 8x faster and roughly 60% cheaper to run than its predecessor.

The benchmarks are the headline: Qwen3.5 matches or exceeds GPT-5.2, Claude 4.5 Opus, and Gemini 3 Pro across a majority of evaluated tasks, with particular dominance in vision-language understanding. It is the strongest open-weight model released to date.


TL;DR

DetailValue
Total parameters397B
Active parameters17B (~4.3% activation ratio)
ArchitectureHybrid MoE: Gated DeltaNet + Gated Attention (3:1 ratio)
Context length256K native, extensible to 1M
ModalitiesText, images, video (native early fusion)
Languages201
LicenseApache 2.0
Thinking modeBuilt-in, togglable
Inference speed8.6x faster at 32K, 19x faster at 256K vs Qwen3-Max

The architecture: attention is no longer just attention

The most technically significant detail in Qwen3.5 is not its size. It's how the model handles attention.

Across its 60 layers, Qwen3.5 uses a 3:1 hybrid ratio of two different attention mechanisms. Three out of every four blocks use Gated DeltaNet, a linear attention variant based on research published in December 2024 that combines Mamba2's gated decay mechanism with delta rule updates for hidden states. These layers scale sub-quadratically with sequence length. The fourth block in each cycle uses Gated Attention with full softmax, providing the model with periodic access to exact attention when it needs it.

The formal layout: 15 x (3 x (Gated DeltaNet -> MoE) -> 1 x (Gated Attention -> MoE))

This is not the same approach as DeepSeek's Multi-head Latent Attention (MLA), which competitors like Kimi and Zhipu have adopted. As HuggingFace's mlabonne noted, the AI field is entering a period where "nobody agrees on attention anymore" - and Qwen is betting on a fundamentally different direction.

The practical payoff is speed. At 32K context, Qwen3.5 decodes 8.6x faster than Qwen3-Max on the same hardware. At 256K context, the gap widens to 19x. This is what makes a 397B-parameter model economically viable to serve: most of those parameters are dormant on any given token.

MoE breakdown

ComponentValue
Total experts512
Routed experts per token10
Shared expert1 (always active)
Expert intermediate dimension1,024
Hidden dimension4,096
Vocabulary size248,320

Benchmarks: where it wins and where it does not

Alibaba claims Qwen3.5 outperforms GPT-5.2, Claude 4.5 Opus, and Gemini 3 Pro on roughly 80% of their evaluated benchmarks. That is a strong claim for an open-weight model. The numbers deserve a closer look.

Language and reasoning

BenchmarkCategoryQwen3.5GPT-5.2Claude 4.5Gemini 3 Pro
MMLU-ProKnowledge87.887.489.589.8
SuperGPQAKnowledge70.467.970.674.0
IFBenchInstruction following76.575.458.070.4
MultiChallengeInstruction following67.657.954.264.2
LongBench v2Long context63.254.564.468.2
AIME26Math competition91.396.793.390.6
LiveCodeBench v6Coding83.687.784.890.7
SWE-bench VerifiedCoding agent76.480.080.976.2
BFCL-V4Tool use72.963.177.572.5

Qwen3.5 dominates on instruction following (IFBench, MultiChallenge) and holds its own on knowledge benchmarks. It is weaker on math competitions (AIME26, HMMT) and coding (LiveCodeBench, SWE-bench), where GPT-5.2 and Claude 4.5 still lead. The Qwen team, to their credit, does not overclaim on coding.

Vision-language: where it pulls away

BenchmarkCategoryQwen3.5GPT-5.2Claude 4.5Gemini 3 Pro
MathVisionMath + vision88.683.074.386.6
MathVistaMath + vision90.383.180.087.9
MMBench ENGeneral VQA93.788.289.293.7
OmniDocBenchDocument understanding90.885.787.788.5
OCRBenchOCR93.180.785.890.4
V* (with CI)Spatial intelligence95.875.967.088.0
MLVUVideo understanding86.785.681.783.0

This is where the "native multimodal" label earns its weight. Unlike Qwen3, which shipped text and vision as separate model lines (Qwen3 and Qwen3-VL), Qwen3.5 was trained with early fusion from the ground up - images and video are not bolted on through an adapter. The result is a 20-point lead over Claude 4.5 on spatial intelligence (V*) and a 12-point lead on OCR.

What changed from Qwen3

The jump from Qwen3 to Qwen3.5 is not incremental. It is architectural.

Qwen3 (May 2025)Qwen3.5 (Feb 2026)
AttentionStandard transformerHybrid DeltaNet + softmax (3:1)
Total parameters235B-A22B397B-A17B
Active parameters22B17B (fewer active, more total)
MultimodalitySeparate models (Qwen3-VL)Native early fusion
Languages119201
Context128K256K (extensible to 1M)
SpeedBaseline8x faster
CostBaseline~60% cheaper

The counterintuitive part: Qwen3.5 activates fewer parameters per token (17B vs 22B) but has nearly twice the total parameter count (397B vs 235B). This means a wider expert pool with sparser activation - each token routes through more specialized experts, but fewer of them. The tradeoff works because the Gated DeltaNet layers handle the bulk of sequential context cheaply, while the periodic full-attention layers correct for any accumulation errors.

Running it

The BF16 weights need roughly 800 GB of VRAM (think 10x H100 80GB). The FP8 quantization cuts that to ~400 GB (8x H100). Community GGUF quantizations from Unsloth go down to Q4_K_M (~220 GB, fits on a 256GB Apple M3 Ultra) and even 3-bit variants.

For developers who want to kick the tires without a GPU cluster, the model is available through vLLM, SGLang, Ollama, LM Studio, and as a hosted API via Alibaba's Model Studio (as Qwen3.5-Plus, with a default 1M token context window).

A terminal-based coding agent, Qwen Code, was released alongside the model.

The competitive picture

Qwen3.5 landed during a week when multiple Chinese AI labs released new flagship models. The timing is not coincidental. What makes this release stand out is the combination of three factors that no other model currently matches simultaneously:

  1. Open weights under Apache 2.0 - not a restricted license, not API-only
  2. Native multimodal - not a text model with a vision adapter
  3. Frontier-competitive benchmarks - within striking distance of GPT-5.2 and Claude 4.5 on text, and ahead of both on vision

The activation ratio (4.3%) is also worth watching. Kimi K2.5 pushed this even further with a 3.25% ratio, suggesting that extreme sparsity in MoE models is becoming the default design philosophy for making large models economically deployable.

Smaller Qwen3.5 variants (9B and 35B-A3B have been spotted in pre-release) should follow, which will bring this architecture to hardware that most developers actually have access to.


Sources:

Alibaba Drops Qwen3.5: 397B Parameters, 17B Active, and It Trades Blows with GPT-5.2
About the author AI Infrastructure & Open Source Reporter

Sophie is a journalist and former systems engineer who covers AI infrastructure, open-source models, and the developer tooling ecosystem.