Qwen3.5-9B
Qwen3.5-9B is a 9B dense model that outperforms Qwen3-30B on most benchmarks and beats GPT-5-Nano on vision tasks. Natively multimodal with 262K-1M context, Apache 2.0 licensed.

Qwen3.5-9B is the largest model in the Qwen 3.5 Small Series and arguably the most impressive for what it hits relative to its size. At 9 billion parameters, it beats the previous generation's Qwen3-30B on most language benchmarks and dominates GPT-5-Nano and Gemini-2.5-Flash-Lite on vision tasks by double-digit margins.
TL;DR
- 9B dense model with Gated DeltaNet hybrid attention - beats Qwen3-30B (3x its size) on GPQA, IFEval, LongBench
- Beats GPT-5-Nano on MMMU-Pro (70.1 vs 57.2), MathVision (78.9 vs 62.2), OmniDocBench (87.7 vs 55.9)
- 262K native context, extendable to ~1M with YaRN - 201 languages, multi-token prediction
- Natively multimodal: text, images, and video from the same weights
- Apache 2.0 - runs on a single RTX 4090 at 4-bit quantization (~5GB VRAM)
The 9B strikes a rare sweet spot: it's small enough to run on consumer hardware but capable enough to compete with models 3-9x its size on serious benchmarks. At BF16, it fits on a single RTX 3090 or any 24GB GPU. With 4-bit quantization, it drops to roughly 5GB - viable on a RTX 3060 12GB or a M1 Mac with room to spare.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Alibaba Cloud (Qwen) |
| Model Family | Qwen 3.5 |
| Architecture | Gated DeltaNet + Gated Attention (3:1 hybrid) |
| Total Parameters | 9B |
| Active Parameters | 9B (all - dense) |
| Layers | 32 |
| Hidden Dimension | 4,096 |
| FFN Intermediate Dimension | 12,288 |
| Attention Pattern | 8 x (3 x (Gated DeltaNet -> FFN) -> 1 x (Gated Attention -> FFN)) |
| Gated DeltaNet Heads | 32 for V, 16 for QK; Head Dim: 128 |
| Gated Attention Heads | 16 Q, 4 KV; Head Dim: 256; RoPE Dim: 64 |
| Context Window | 262,144 tokens (native), ~1M (YaRN extended) |
| Max Output | 65,536 tokens |
| Input Modalities | Text, Image, Video |
| Vocabulary | 248,320 tokens |
| Languages | 201 |
| Training | Multi-Token Prediction (MTP), strong-to-weak distillation |
| Thinking Mode | Enabled (toggleable via enable_thinking parameter) |
| Release Date | March 2, 2026 |
| License | Apache 2.0 |
The 9B uses the deeper 32-layer stack (shared with the 4B) and the widest hidden dimension (4,096) among the small models. Its FFN intermediate dimension of 12,288 gives it meaningful capacity for world knowledge and reasoning. The vision encoder is a DeepStack Vision Transformer with Conv3d patch embeddings for temporal video understanding.
Benchmark Performance
Language (Thinking Mode)
| Benchmark | Qwen3.5-9B | Qwen3.5-4B | Qwen3-30B | Qwen3-80B | Qwen3.5-27B |
|---|---|---|---|---|---|
| MMLU-Pro | 82.5 | 79.1 | 80.9 | 82.7 | 86.1 |
| C-Eval | 88.2 | 85.1 | 87.4 | 89.7 | - |
| SuperGPQA | 58.2 | 52.9 | 56.8 | 60.8 | - |
| GPQA Diamond | 81.7 | 76.2 | 73.4 | 77.2 | 85.5 |
| IFEval | 91.5 | 89.8 | 88.9 | 88.9 | 95.0 |
| AA-LCR (Long Context) | 63.0 | 57.0 | 49.0 | 51.7 | - |
| LongBench v2 | 55.2 | 50.0 | 44.8 | 48.0 | - |
The 9B beats the previous Qwen3-30B on GPQA Diamond by 8 points, IFEval by 3 points, and LongBench v2 by 10 points. It even edges past Qwen3-80B on GPQA Diamond and IFEval - models that required an A100 cluster to run.
Vision-Language
| Benchmark | Qwen3.5-9B | GPT-5-Nano | Gemini-2.5-Flash-Lite | Qwen3.5-27B |
|---|---|---|---|---|
| MMMU | 78.4 | 75.8 | - | 82.3 |
| MMMU-Pro | 70.1 | 57.2 | 59.7 | - |
| MathVision | 78.9 | 62.2 | 52.1 | 86.0 |
| MathVista (mini) | 85.7 | 71.5 | 72.8 | 87.8 |
| RealWorldQA | 80.3 | 71.8 | 72.2 | - |
| MMBench | 90.1 | - | 82.7 | - |
| OmniDocBench1.5 | 87.7 | 55.9 | 79.4 | - |
| VideoMME (w/sub) | 84.5 | - | 74.6 | - |
The vision results are the 9B's strongest selling point. It beats GPT-5-Nano by 13 points on MMMU-Pro, 17 points on MathVision, and 32 points on document understanding. Against the 27B (3x its size), it trails by only 4 points on MMMU and 7 points on MathVision - a narrow gap for a 3x reduction in compute.
Key Capabilities
Native multimodal at every level - Unlike Qwen 3 which needed separate VL model variants, the 9B handles text, images, and video from the same weights. The DeepStack Vision Transformer uses Conv3d patch embeddings to capture temporal dynamics in video, and merges features from multiple encoder layers rather than just the final one. VideoMME at 84.5 with subtitles shows genuine video comprehension, not just frame sampling.
Long context without the cost - The Gated DeltaNet linear attention maintains constant memory complexity, enabling 262K native context with extension to roughly 1M tokens via YaRN. LongBench v2 at 55.2 and AA-LCR at 63.0 confirm the model actually uses this context effectively, not just theoretically supports it. For comparison, the previous Qwen3-80B scored 48.0 on LongBench v2 despite being nearly 9x larger.
Thinking and non-thinking modes - The model supports both explicit chain-of-thought reasoning (enclosed in <think>...</think> tags) and direct response generation. Thinking mode is enabled by default and can be toggled via the enable_thinking API parameter or /think and /no_think tags. Thinking mode pushes GPQA Diamond to 81.7; non-thinking mode still delivers competitive results for latency-sensitive applications.
Pricing and Availability
Apache 2.0 licensed and available on HuggingFace and ModelScope. A base model (Qwen3.5-9B-Base) is also available for fine-tuning and research. Multiple quantized variants are available.
| Deployment Option | VRAM Required | Notes |
|---|---|---|
| BF16 (full precision) | ~18 GB | RTX 3090/4090 24GB, A100 |
| 8-bit quantization | ~9 GB | RTX 3060 12GB, M1 Pro Mac |
| 4-bit quantization | ~5 GB | RTX 3060, M1 Mac, most modern GPUs |
Supported inference frameworks: vLLM (recommended), SGLang (recommended), llama.cpp, MLX (Apple Silicon), Hugging Face Transformers, and KTransformers.
Strengths
- Outperforms Qwen3-30B (3x its size) on GPQA, IFEval, and long-context benchmarks
- Controls GPT-5-Nano on vision: MMMU-Pro +13, MathVision +17, OmniDocBench +32
- 262K to 1M context window from just 9B parameters - constant memory via DeltaNet
- Natively multimodal: text, images, and video without adapter overhead
- Runs on a single consumer GPU - RTX 4090 at BF16, RTX 3060 at 4-bit
- 201 languages with 248K vocabulary - strong multilingual coverage
- VideoMME 84.5 rivals models 10x its size
Weaknesses
- Trails the Qwen3.5-27B by 4-8 points on top-end benchmarks (MMLU-Pro 82.5 vs 86.1)
- All benchmarks are self-reported - independent validation pending
- Gated DeltaNet architecture requires compatible inference frameworks for ideal speed
- 9B all-active means higher per-token cost than the 35B-A3B MoE (3B active)
- No managed API specifically aligned with this model - self-hosting only
- Thinking mode adds latency and token cost for simple tasks
Related Coverage
- Qwen 3.5 Small Series Ships Four Models
- Qwen 3.5 Medium Series Drops Four Models
- Qwen3.5: 397B Parameters, 17B Active
- Qwen3.5-27B
- Qwen3.5-4B
- Open Source LLM Leaderboard
Sources:
