Qwen3.5-4B is the middleweight of the Qwen 3.5 Small Series and the model that makes the strongest case for what 4 billion parameters can do in 2026. It approaches the previous generation's Qwen3-30B on MMLU-Pro, beats GPT-5-Nano across vision benchmarks, and handles text, images, and video natively - all from roughly 8GB of VRAM at full precision.

TL;DR

4B dense multimodal model - approaches Qwen3-30B on MMLU-Pro (79.1 vs 80.9) at 1/7th the parameter count
Beats GPT-5-Nano on MMMU-Pro (66.3 vs 57.2), MathVision (74.6 vs 62.2), OmniDocBench (86.2 vs 55.9)
262K native context (1M extended), 201 languages, multi-token prediction
Runs on 8GB VRAM at BF16, ~3GB at 4-bit - ideal for lightweight multimodal agents
Apache 2.0, base model also available for fine-tuning

The 4B occupies what may be the optimal size for edge-deployed multimodal agents. It's large enough to handle complex reasoning (GPQA Diamond 76.2 in thinking mode) but small enough to run on laptop GPUs or mobile SoCs with aggressive quantization. For developers building applications that need vision, long context, and decent reasoning without datacenter hardware, this is the new default.

Key Specifications

Specification	Details
Provider	Alibaba Cloud (Qwen)
Model Family	Qwen 3.5
Architecture	Gated DeltaNet + Gated Attention (3:1 hybrid)
Total Parameters	4B
Active Parameters	4B (all - dense)
Layers	32
Hidden Dimension	2,560
FFN Intermediate Dimension	9,216
Attention Pattern	8 x (3 x (Gated DeltaNet -> FFN) -> 1 x (Gated Attention -> FFN))
Gated DeltaNet Heads	32 for V, 16 for QK; Head Dim: 128
Gated Attention Heads	16 Q, 4 KV; Head Dim: 256; RoPE Dim: 64
Context Window	262,144 tokens (native), ~1M (YaRN extended)
Max Output	65,536 tokens
Input Modalities	Text, Image, Video
Vocabulary	248,320 tokens
Languages	201
Training	Multi-Token Prediction (MTP), strong-to-weak distillation
Thinking Mode	Enabled (toggleable via `enable_thinking` parameter)
Release Date	March 2, 2026
License	Apache 2.0

The 4B shares the 32-layer stack with the 9B but uses a narrower hidden dimension (2,560 vs 4,096) and smaller FFN (9,216 vs 12,288). The same 3:1 DeltaNet-to-Attention ratio is maintained.

Benchmark Performance

Language (Thinking Mode)

Benchmark	Qwen3.5-4B	Qwen3.5-9B	Qwen3-30B	Qwen3-80B
MMLU-Pro	79.1	82.5	80.9	82.7
C-Eval	85.1	88.2	87.4	89.7
SuperGPQA	52.9	58.2	56.8	60.8
GPQA Diamond	76.2	81.7	73.4	77.2
IFEval	89.8	91.5	88.9	88.9
AA-LCR (Long Context)	57.0	63.0	49.0	51.7
LongBench v2	50.0	55.2	44.8	48.0

The 4B beats Qwen3-30B on GPQA Diamond (76.2 vs 73.4), IFEval (89.8 vs 88.9), and long-context tasks (AA-LCR 57.0 vs 49.0, LongBench v2 50.0 vs 44.8). On MMLU-Pro it trails by less than 2 points despite being 7.5x smaller. It even beats Qwen3-80B on long-context benchmarks.

Vision-Language

Benchmark	Qwen3.5-4B	GPT-5-Nano	Gemini-2.5-Flash-Lite	Qwen3.5-9B
MMMU	77.6	75.8	-	78.4
MMMU-Pro	66.3	57.2	59.7	70.1
MathVision	74.6	62.2	52.1	78.9
MathVista (mini)	85.1	71.5	72.8	85.7
RealWorldQA	79.5	71.8	72.2	80.3
OmniDocBench1.5	86.2	55.9	79.4	87.7

The 4B's vision performance is remarkably close to the 9B - trailing by less than 2 points on most benchmarks. Against GPT-5-Nano, it leads on MMMU-Pro by 9 points, MathVision by 12 points, and OmniDocBench by 30 points.

Key Capabilities

Lightweight multimodal agent base - At 4B parameters, this model is small enough for edge deployment but multimodal enough to serve as an agent that processes documents, images, and video. OmniDocBench at 86.2 means it handles receipts, invoices, screenshots, and technical diagrams with high accuracy.

Long context from a small model - 262K native context with YaRN extension to 1M tokens. LongBench v2 at 50.0 and AA-LCR at 57.0 outperform the previous Qwen3-80B on long-context tasks, showing that the Gated DeltaNet architecture maintains quality across the full context window.

Thinking mode for complex reasoning - When enabled, thinking mode pushes GPQA Diamond to 76.2 and MMLU-Pro to 79.1 - competitive with the previous generation's 30B models. For latency-sensitive tasks, non-thinking mode provides direct responses without the chain-of-thought overhead.

Pricing and Availability

Apache 2.0 licensed and available on HuggingFace and ModelScope. A base model (Qwen3.5-4B-Base) and eight quantized variants are available.

Deployment Option	VRAM Required	Notes
BF16 (full precision)	~8 GB	RTX 3060 12GB, M1 Mac
8-bit quantization	~4 GB	Most modern GPUs, M1 Mac
4-bit quantization	~3 GB	Integrated GPUs, mobile SoCs

Strengths

Approaches Qwen3-30B benchmarks at 1/7th the parameters
Beats GPT-5-Nano on all vision benchmarks by significant margins
262K-1M context from just 4B params - beats Qwen3-80B on long-context tasks
~8GB VRAM at BF16 makes it ideal for edge deployment and lightweight agents
Natively multimodal - no separate VL model needed
8 quantized variants available for maximum deployment flexibility
Apache 2.0 with base model for fine-tuning

Weaknesses

Trails the 9B by 3-5 points on most benchmarks - the 9B is the better choice if hardware allows
No video benchmarks reported (VideoMME only available for 9B, 2B, and 0.8B)
Self-reported benchmarks - independent validation pending
Thinking mode adds latency; non-thinking mode has lower scores
Self-hosting only - no managed API at this parameter count
Gated DeltaNet requires compatible inference frameworks

Sources: