Qwen 3.5 Small Series Ships Four Models From 0.8B to 9B
Alibaba completes the Qwen 3.5 lineup with four small models - 0.8B, 2B, 4B, and 9B - all natively multimodal, 262K context, Apache 2.0. The 9B outperforms last-gen Qwen3-30B and beats GPT-5-Nano on vision benchmarks.

TL;DR
- Alibaba releases four Qwen 3.5 small models - 0.8B, 2B, 4B, 9B - all Apache 2.0, natively multimodal, 262K context
- Qwen3.5-9B beats the previous-gen Qwen3-30B on most benchmarks and beats GPT-5-Nano on MMMU-Pro (70.1 vs 57.2) and MathVision (78.9 vs 62.2)
- Same Gated DeltaNet hybrid architecture from the 397B flagship - 3:1 linear-to-full attention ratio, multi-token prediction, 201 languages
- All four models handle text, images, and video natively - no separate vision line needed
- Base models also released for research and fine-tuning
Two weeks after the medium series and sixteen days after the 397B flagship, Alibaba's Qwen team has shipped the final piece: four small models that bring the Qwen 3.5 architecture to edge devices, consumer hardware, and resource-constrained environments.
🚀 Introducing the Qwen 3.5 Small Model Series
— Qwen (@Alibaba_Qwen) March 2, 2026
Qwen3.5-0.8B · Qwen3.5-2B · Qwen3.5-4B · Qwen3.5-9B
✨ More intelligence, less compute.
The models span from 0.8B (runs on a phone) to 9B (a single consumer GPU), and every one of them is natively multimodal - text, images, and video from the same set of weights. No adapter bolted on. No separate VL model. Just one architecture handling everything.
The lineup
| Model | Parameters | Layers | Context | Thinking Mode | VRAM (BF16) |
|---|---|---|---|---|---|
| Qwen3.5-0.8B | 0.8B | 24 | 262K | Non-thinking | ~1.6 GB |
| Qwen3.5-2B | 2B | 24 | 262K | Both | ~4 GB |
| Qwen3.5-4B | 4B | 32 | 262K | Both | ~8 GB |
| Qwen3.5-9B | 9B | 32 | 262K (1M ext.) | Both | ~18 GB |
All four use the same Gated DeltaNet hybrid architecture that powers the entire Qwen 3.5 family: a 3:1 ratio of linear attention (Gated DeltaNet) to full softmax attention blocks. The linear attention layers maintain constant memory complexity - which is what enables 262K native context even on tiny models - while the full attention blocks provide the precision needed for complex reasoning.
Every model also includes multi-token prediction (MTP) for faster inference and a 248K-token vocabulary covering 201 languages and dialects.
The numbers that break expectations
The 9B is the headliner. In thinking mode, it outperforms the previous generation's Qwen3-30B - a model more than 3x its size - on most language benchmarks:
| Benchmark | Qwen3.5-9B | Qwen3.5-4B | Qwen3-30B | Qwen3-80B |
|---|---|---|---|---|
| MMLU-Pro | 82.5 | 79.1 | 80.9 | 82.7 |
| C-Eval | 88.2 | 85.1 | 87.4 | 89.7 |
| GPQA Diamond | 81.7 | 76.2 | 73.4 | 77.2 |
| IFEval | 91.5 | 89.8 | 88.9 | 88.9 |
| LongBench v2 | 55.2 | 50.0 | 44.8 | 48.0 |
The 9B beats the 80B on GPQA Diamond (81.7 vs 77.2), instruction following (91.5 vs 88.9), and long-context tasks (LongBench v2 55.2 vs 48.0). It matches the 80B on MMLU-Pro within a fraction of a point. That is the Qwen 3.5 architecture doing its work - better attention mechanisms, stronger RL training, native multimodal pretraining all compounding to produce a model that punches three weight classes above its size.
Vision benchmarks tell the real story
All four models handle vision natively, and even the smallest ones post competitive scores:
| Benchmark | Qwen3.5-9B | Qwen3.5-4B | GPT-5-Nano | Gemini-2.5-Flash-Lite |
|---|---|---|---|---|
| MMMU | 78.4 | 77.6 | 75.8 | - |
| MMMU-Pro | 70.1 | 66.3 | 57.2 | 59.7 |
| MathVision | 78.9 | 74.6 | 62.2 | 52.1 |
| MathVista (mini) | 85.7 | 85.1 | 71.5 | 72.8 |
| RealWorldQA | 80.3 | 79.5 | 71.8 | 72.2 |
| OmniDocBench1.5 | 87.7 | 86.2 | 55.9 | 79.4 |
| VideoMME (w/sub) | 84.5 | - | - | 74.6 |
The 9B beats GPT-5-Nano by 13 points on MMMU-Pro, nearly 17 points on MathVision, and over 30 points on document understanding (OmniDocBench1.5). These aren't marginal gains - they're blowouts from a model less than a third the size of what OpenAI charges for.
The smallest models hold up
Even the 0.8B and 2B models post numbers that would have been impressive from much larger models a year ago:
| Benchmark | Qwen3.5-2B | Qwen3.5-0.8B |
|---|---|---|
| MMLU-Pro (thinking) | 66.5 | - |
| IFEval (thinking) | 78.6 | - |
| MMMU (vision) | 64.2 | 49.0 |
| MathVista (vision) | 76.7 | 62.2 |
| OCRBench (vision) | 84.5 | 74.5 |
| VideoMME (w/sub) | 75.6 | 63.8 |
The 2B model scores 84.5 on OCRBench and 75.6 on VideoMME - numbers that put it ahead of many 7B-class models from the previous generation. The 0.8B is truly useful for edge deployment: 62.2 on MathVista and 74.5 on OCRBench from under a billion parameters.
Architecture: what makes these different
These are not shrunken versions of a big model with the same architecture. They're purpose-built using the Qwen 3.5 innovations:
Gated DeltaNet hybrid attention - Based on the Gated Delta Networks paper, this combines Mamba2's gated decay mechanism with the delta rule for hidden state updates. The linear attention layers handle routine computation with constant memory complexity, while full attention layers handle precision-critical reasoning. The 3:1 ratio (three DeltaNet blocks per one full attention block) keeps memory bounded while preserving quality.
DeepStack Vision Transformer - The vision encoder uses Conv3d patch embeddings to capture temporal dynamics in video and merges features from multiple layers (not just the final layer). This is why even the 0.8B can handle video understanding.
Strong-to-weak distillation - The small models were trained using knowledge distillation from larger teacher models (the 397B and medium-series models), using both off-policy and on-policy transfer. For models at this scale, distillation beats direct RL according to the team's ablations.
Multi-token prediction (MTP) - All four models predict multiple tokens at once during inference, providing a direct speedup without quality loss.
What you can run where
| Model | BF16 VRAM | 4-bit VRAM | Runs On |
|---|---|---|---|
| Qwen3.5-0.8B | ~1.6 GB | ~0.5 GB | Phone, Raspberry Pi, any GPU |
| Qwen3.5-2B | ~4 GB | ~1.5 GB | Laptop GPU, mobile SoC |
| Qwen3.5-4B | ~8 GB | ~3 GB | RTX 3060 12GB, M1/M2 Mac |
| Qwen3.5-9B | ~18 GB | ~5 GB | RTX 3090/4090, M2 Pro+ Mac |
The 9B also supports YaRN extension to approximately 1 million tokens. All models are compatible with vLLM, SGLang, llama.cpp, MLX, and Hugging Face Transformers. Quantized variants (GGUF and others) are available on HuggingFace for all four models.
The complete Qwen 3.5 family
With today's release, the full lineup now stands at:
| Model | Release | Type | Active Params |
|---|---|---|---|
| Qwen3.5-397B-A17B | Feb 16 | MoE (flagship) | 17B |
| Qwen3.5-122B-A10B | Feb 24 | MoE | 10B |
| Qwen3.5-35B-A3B | Feb 24 | MoE | 3B |
| Qwen3.5-27B | Feb 24 | Dense | 27B |
| Qwen3.5-Flash | Feb 24 | API | N/A |
| Qwen3.5-9B | Mar 2 | Dense | 9B |
| Qwen3.5-4B | Mar 2 | Dense | 4B |
| Qwen3.5-2B | Mar 2 | Dense | 2B |
| Qwen3.5-0.8B | Mar 2 | Dense | 0.8B |
Nine models in sixteen days, spanning from a 0.8B edge model to a 397B frontier flagship. All sharing the same architecture, all natively multimodal, all Apache 2.0 (except Flash). That's a complete product line where most labs have shipped one or two.
What this means
The "small model" space is getting aggressive. Google's Gemma 3 offers 1B and 4B but without native vision across the board. Meta's Llama 3.2 small models are text-only at the smallest sizes. Microsoft's Phi-4-mini at 14B is strong but 50% larger than the 9B and text-focused.
Qwen 3.5 is the first family where a 0.8B model can process video, a 4B model can act as a lightweight multimodal agent, and a 9B model beats previous-generation 30B models across the board. The architecture innovations - Gated DeltaNet, DeepStack vision, MTP - are doing real work at every scale.
For developers building on-device AI, the calculus just changed. A year ago, running a multimodal model locally meant a 13B+ parameter model and a serious GPU. Now a 4B model with 262K context handles text, images, and video from 8GB of VRAM.
All models are available now on HuggingFace and ModelScope, with base models included for research and fine-tuning.
Sources:
- Qwen3.5 Small Model Series - HuggingFace Collection
- Qwen3.5 Small Model Series - ModelScope
- Qwen3.5-9B Model Card (HuggingFace)
- Qwen3.5-4B Model Card (HuggingFace)
- Qwen3.5-2B Model Card (HuggingFace)
- Qwen3.5-0.8B Model Card (HuggingFace)
- QwenLM/Qwen3.5 GitHub Repository
- Qwen3.5: Towards Native Multimodal Agents (Blog)
- Alibaba Cloud Blog: Qwen3.5
- Gated Delta Networks Paper (arXiv)
