Qwen3.5-9B is the largest model in the Qwen 3.5 Small Series and arguably the most impressive for what it hits relative to its size. At 9 billion parameters, it beats the previous generation's Qwen3-30B on most language benchmarks and dominates GPT-5-Nano and Gemini-2.5-Flash-Lite on vision tasks by double-digit margins.

TL;DR

9B dense model with Gated DeltaNet hybrid attention - beats Qwen3-30B (3x its size) on GPQA, IFEval, LongBench
Beats GPT-5-Nano on MMMU-Pro (70.1 vs 57.2), MathVision (78.9 vs 62.2), OmniDocBench (87.7 vs 55.9)
262K native context, extendable to ~1M with YaRN - 201 languages, multi-token prediction
Natively multimodal: text, images, and video from the same weights
Apache 2.0 - runs on a single RTX 4090 at 4-bit quantization (~5GB VRAM)

The 9B strikes a rare sweet spot: it's small enough to run on consumer hardware but capable enough to compete with models 3-9x its size on serious benchmarks. At BF16, it fits on a single RTX 3090 or any 24GB GPU. With 4-bit quantization, it drops to roughly 5GB - viable on a RTX 3060 12GB or a M1 Mac with room to spare.

Key Specifications

Specification	Details
Provider	Alibaba Cloud (Qwen)
Model Family	Qwen 3.5
Architecture	Gated DeltaNet + Gated Attention (3:1 hybrid)
Total Parameters	9B
Active Parameters	9B (all - dense)
Layers	32
Hidden Dimension	4,096
FFN Intermediate Dimension	12,288
Attention Pattern	8 x (3 x (Gated DeltaNet -> FFN) -> 1 x (Gated Attention -> FFN))
Gated DeltaNet Heads	32 for V, 16 for QK; Head Dim: 128
Gated Attention Heads	16 Q, 4 KV; Head Dim: 256; RoPE Dim: 64
Context Window	262,144 tokens (native), ~1M (YaRN extended)
Max Output	65,536 tokens
Input Modalities	Text, Image, Video
Vocabulary	248,320 tokens
Languages	201
Training	Multi-Token Prediction (MTP), strong-to-weak distillation
Thinking Mode	Enabled (toggleable via `enable_thinking` parameter)
Release Date	March 2, 2026
License	Apache 2.0

The 9B uses the deeper 32-layer stack (shared with the 4B) and the widest hidden dimension (4,096) among the small models. Its FFN intermediate dimension of 12,288 gives it meaningful capacity for world knowledge and reasoning. The vision encoder is a DeepStack Vision Transformer with Conv3d patch embeddings for temporal video understanding.

Benchmark Performance

Language (Thinking Mode)

Benchmark	Qwen3.5-9B	Qwen3.5-4B	Qwen3-30B	Qwen3-80B	Qwen3.5-27B
MMLU-Pro	82.5	79.1	80.9	82.7	86.1
C-Eval	88.2	85.1	87.4	89.7	-
SuperGPQA	58.2	52.9	56.8	60.8	-
GPQA Diamond	81.7	76.2	73.4	77.2	85.5
IFEval	91.5	89.8	88.9	88.9	95.0
AA-LCR (Long Context)	63.0	57.0	49.0	51.7	-
LongBench v2	55.2	50.0	44.8	48.0	-

The 9B beats the previous Qwen3-30B on GPQA Diamond by 8 points, IFEval by 3 points, and LongBench v2 by 10 points. It even edges past Qwen3-80B on GPQA Diamond and IFEval - models that required an A100 cluster to run.

Vision-Language

Benchmark	Qwen3.5-9B	GPT-5-Nano	Gemini-2.5-Flash-Lite	Qwen3.5-27B
MMMU	78.4	75.8	-	82.3
MMMU-Pro	70.1	57.2	59.7	-
MathVision	78.9	62.2	52.1	86.0
MathVista (mini)	85.7	71.5	72.8	87.8
RealWorldQA	80.3	71.8	72.2	-
MMBench	90.1	-	82.7	-
OmniDocBench1.5	87.7	55.9	79.4	-
VideoMME (w/sub)	84.5	-	74.6	-

The vision results are the 9B's strongest selling point. It beats GPT-5-Nano by 13 points on MMMU-Pro, 17 points on MathVision, and 32 points on document understanding. Against the 27B (3x its size), it trails by only 4 points on MMMU and 7 points on MathVision - a narrow gap for a 3x reduction in compute.

Key Capabilities

Native multimodal at every level - Unlike Qwen 3 which needed separate VL model variants, the 9B handles text, images, and video from the same weights. The DeepStack Vision Transformer uses Conv3d patch embeddings to capture temporal dynamics in video, and merges features from multiple encoder layers rather than just the final one. VideoMME at 84.5 with subtitles shows genuine video comprehension, not just frame sampling.

Long context without the cost - The Gated DeltaNet linear attention maintains constant memory complexity, enabling 262K native context with extension to roughly 1M tokens via YaRN. LongBench v2 at 55.2 and AA-LCR at 63.0 confirm the model actually uses this context effectively, not just theoretically supports it. For comparison, the previous Qwen3-80B scored 48.0 on LongBench v2 despite being nearly 9x larger.

Thinking and non-thinking modes - The model supports both explicit chain-of-thought reasoning (enclosed in <think>...</think> tags) and direct response generation. Thinking mode is enabled by default and can be toggled via the enable_thinking API parameter or /think and /no_think tags. Thinking mode pushes GPQA Diamond to 81.7; non-thinking mode still delivers competitive results for latency-sensitive applications.

Pricing and Availability

Apache 2.0 licensed and available on HuggingFace and ModelScope. A base model (Qwen3.5-9B-Base) is also available for fine-tuning and research. Multiple quantized variants are available.

Deployment Option	VRAM Required	Notes
BF16 (full precision)	~18 GB	RTX 3090/4090 24GB, A100
8-bit quantization	~9 GB	RTX 3060 12GB, M1 Pro Mac
4-bit quantization	~5 GB	RTX 3060, M1 Mac, most modern GPUs

Supported inference frameworks: vLLM (recommended), SGLang (recommended), llama.cpp, MLX (Apple Silicon), Hugging Face Transformers, and KTransformers.

Strengths

Outperforms Qwen3-30B (3x its size) on GPQA, IFEval, and long-context benchmarks
Controls GPT-5-Nano on vision: MMMU-Pro +13, MathVision +17, OmniDocBench +32
262K to 1M context window from just 9B parameters - constant memory via DeltaNet
Natively multimodal: text, images, and video without adapter overhead
Runs on a single consumer GPU - RTX 4090 at BF16, RTX 3060 at 4-bit
201 languages with 248K vocabulary - strong multilingual coverage
VideoMME 84.5 rivals models 10x its size

Weaknesses

Trails the Qwen3.5-27B by 4-8 points on top-end benchmarks (MMLU-Pro 82.5 vs 86.1)
All benchmarks are self-reported - independent validation pending
Gated DeltaNet architecture requires compatible inference frameworks for ideal speed
9B all-active means higher per-token cost than the 35B-A3B MoE (3B active)
No managed API specifically aligned with this model - self-hosting only
Thinking mode adds latency and token cost for simple tasks

Sources: