Qwen 3.5 Small Series Ships Four Models From 0.8B to 9B

TL;DR

Alibaba releases four Qwen 3.5 small models - 0.8B, 2B, 4B, 9B - all Apache 2.0, natively multimodal, 262K context
Qwen3.5-9B beats the previous-gen Qwen3-30B on most benchmarks and beats GPT-5-Nano on MMMU-Pro (70.1 vs 57.2) and MathVision (78.9 vs 62.2)
Same Gated DeltaNet hybrid architecture from the 397B flagship - 3:1 linear-to-full attention ratio, multi-token prediction, 201 languages
All four models handle text, images, and video natively - no separate vision line needed
Base models also released for research and fine-tuning

Two weeks after the medium series and sixteen days after the 397B flagship, Alibaba's Qwen team has shipped the final piece: four small models that bring the Qwen 3.5 architecture to edge devices, consumer hardware, and resource-constrained environments.

🚀 Introducing the Qwen 3.5 Small Model Series
Qwen3.5-0.8B · Qwen3.5-2B · Qwen3.5-4B · Qwen3.5-9B

✨ More intelligence, less compute.
— Qwen (@Alibaba_Qwen) March 2, 2026

The models span from 0.8B (runs on a phone) to 9B (a single consumer GPU), and every one of them is natively multimodal - text, images, and video from the same set of weights. No adapter bolted on. No separate VL model. Just one architecture handling everything.

The lineup

Model	Parameters	Layers	Context	Thinking Mode	VRAM (BF16)
Qwen3.5-0.8B	0.8B	24	262K	Non-thinking	~1.6 GB
Qwen3.5-2B	2B	24	262K	Both	~4 GB
Qwen3.5-4B	4B	32	262K	Both	~8 GB
Qwen3.5-9B	9B	32	262K (1M ext.)	Both	~18 GB

All four use the same Gated DeltaNet hybrid architecture that powers the entire Qwen 3.5 family: a 3:1 ratio of linear attention (Gated DeltaNet) to full softmax attention blocks. The linear attention layers maintain constant memory complexity - which is what enables 262K native context even on tiny models - while the full attention blocks provide the precision needed for complex reasoning.

Every model also includes multi-token prediction (MTP) for faster inference and a 248K-token vocabulary covering 201 languages and dialects.

The numbers that break expectations

The 9B is the headliner. In thinking mode, it outperforms the previous generation's Qwen3-30B - a model more than 3x its size - on most language benchmarks:

Benchmark	Qwen3.5-9B	Qwen3.5-4B	Qwen3-30B	Qwen3-80B
MMLU-Pro	82.5	79.1	80.9	82.7
C-Eval	88.2	85.1	87.4	89.7
GPQA Diamond	81.7	76.2	73.4	77.2
IFEval	91.5	89.8	88.9	88.9
LongBench v2	55.2	50.0	44.8	48.0

The 9B beats the 80B on GPQA Diamond (81.7 vs 77.2), instruction following (91.5 vs 88.9), and long-context tasks (LongBench v2 55.2 vs 48.0). It matches the 80B on MMLU-Pro within a fraction of a point. That is the Qwen 3.5 architecture doing its work - better attention mechanisms, stronger RL training, native multimodal pretraining all compounding to produce a model that punches three weight classes above its size.

Vision benchmarks tell the real story

All four models handle vision natively, and even the smallest ones post competitive scores:

Benchmark	Qwen3.5-9B	Qwen3.5-4B	GPT-5-Nano	Gemini-2.5-Flash-Lite
MMMU	78.4	77.6	75.8	-
MMMU-Pro	70.1	66.3	57.2	59.7
MathVision	78.9	74.6	62.2	52.1
MathVista (mini)	85.7	85.1	71.5	72.8
RealWorldQA	80.3	79.5	71.8	72.2
OmniDocBench1.5	87.7	86.2	55.9	79.4
VideoMME (w/sub)	84.5	-	-	74.6

The 9B beats GPT-5-Nano by 13 points on MMMU-Pro, nearly 17 points on MathVision, and over 30 points on document understanding (OmniDocBench1.5). These aren't marginal gains - they're blowouts from a model less than a third the size of what OpenAI charges for.

The smallest models hold up

Even the 0.8B and 2B models post numbers that would have been impressive from much larger models a year ago:

Benchmark	Qwen3.5-2B	Qwen3.5-0.8B
MMLU-Pro (thinking)	66.5	-
IFEval (thinking)	78.6	-
MMMU (vision)	64.2	49.0
MathVista (vision)	76.7	62.2
OCRBench (vision)	84.5	74.5
VideoMME (w/sub)	75.6	63.8

The 2B model scores 84.5 on OCRBench and 75.6 on VideoMME - numbers that put it ahead of many 7B-class models from the previous generation. The 0.8B is truly useful for edge deployment: 62.2 on MathVista and 74.5 on OCRBench from under a billion parameters.

Architecture: what makes these different

These are not shrunken versions of a big model with the same architecture. They're purpose-built using the Qwen 3.5 innovations:

Gated DeltaNet hybrid attention - Based on the Gated Delta Networks paper, this combines Mamba2's gated decay mechanism with the delta rule for hidden state updates. The linear attention layers handle routine computation with constant memory complexity, while full attention layers handle precision-critical reasoning. The 3:1 ratio (three DeltaNet blocks per one full attention block) keeps memory bounded while preserving quality.

DeepStack Vision Transformer - The vision encoder uses Conv3d patch embeddings to capture temporal dynamics in video and merges features from multiple layers (not just the final layer). This is why even the 0.8B can handle video understanding.

Strong-to-weak distillation - The small models were trained using knowledge distillation from larger teacher models (the 397B and medium-series models), using both off-policy and on-policy transfer. For models at this scale, distillation beats direct RL according to the team's ablations.

Multi-token prediction (MTP) - All four models predict multiple tokens at once during inference, providing a direct speedup without quality loss.

What you can run where

Model	BF16 VRAM	4-bit VRAM	Runs On
Qwen3.5-0.8B	~1.6 GB	~0.5 GB	Phone, Raspberry Pi, any GPU
Qwen3.5-2B	~4 GB	~1.5 GB	Laptop GPU, mobile SoC
Qwen3.5-4B	~8 GB	~3 GB	RTX 3060 12GB, M1/M2 Mac
Qwen3.5-9B	~18 GB	~5 GB	RTX 3090/4090, M2 Pro+ Mac

The 9B also supports YaRN extension to approximately 1 million tokens. All models are compatible with vLLM, SGLang, llama.cpp, MLX, and Hugging Face Transformers. Quantized variants (GGUF and others) are available on HuggingFace for all four models.

The complete Qwen 3.5 family

With today's release, the full lineup now stands at:

Model	Release	Type	Active Params
Qwen3.5-397B-A17B	Feb 16	MoE (flagship)	17B
Qwen3.5-122B-A10B	Feb 24	MoE	10B
Qwen3.5-35B-A3B	Feb 24	MoE	3B
Qwen3.5-27B	Feb 24	Dense	27B
Qwen3.5-Flash	Feb 24	API	N/A
Qwen3.5-9B	Mar 2	Dense	9B
Qwen3.5-4B	Mar 2	Dense	4B
Qwen3.5-2B	Mar 2	Dense	2B
Qwen3.5-0.8B	Mar 2	Dense	0.8B

Nine models in sixteen days, spanning from a 0.8B edge model to a 397B frontier flagship. All sharing the same architecture, all natively multimodal, all Apache 2.0 (except Flash). That's a complete product line where most labs have shipped one or two.

What this means

The "small model" space is getting aggressive. Google's Gemma 3 offers 1B and 4B but without native vision across the board. Meta's Llama 3.2 small models are text-only at the smallest sizes. Microsoft's Phi-4-mini at 14B is strong but 50% larger than the 9B and text-focused.

Qwen 3.5 is the first family where a 0.8B model can process video, a 4B model can act as a lightweight multimodal agent, and a 9B model beats previous-generation 30B models across the board. The architecture innovations - Gated DeltaNet, DeepStack vision, MTP - are doing real work at every scale.

For developers building on-device AI, the calculus just changed. A year ago, running a multimodal model locally meant a 13B+ parameter model and a serious GPU. Now a 4B model with 262K context handles text, images, and video from 8GB of VRAM.

All models are available now on HuggingFace and ModelScope, with base models included for research and fine-tuning.

Sources: