Models

Qwen3.5-4B

Qwen3.5-4B is a 4B dense multimodal model that matches Qwen3-30B on MMLU-Pro and beats GPT-5-Nano on vision benchmarks. Runs on 8GB VRAM, Apache 2.0 licensed, 262K-1M context.

Qwen3.5-4B

Qwen3.5-4B is the middleweight of the Qwen 3.5 Small Series and the model that makes the strongest case for what 4 billion parameters can do in 2026. It approaches the previous generation's Qwen3-30B on MMLU-Pro, beats GPT-5-Nano across vision benchmarks, and handles text, images, and video natively - all from roughly 8GB of VRAM at full precision.

TL;DR

  • 4B dense multimodal model - approaches Qwen3-30B on MMLU-Pro (79.1 vs 80.9) at 1/7th the parameter count
  • Beats GPT-5-Nano on MMMU-Pro (66.3 vs 57.2), MathVision (74.6 vs 62.2), OmniDocBench (86.2 vs 55.9)
  • 262K native context (1M extended), 201 languages, multi-token prediction
  • Runs on 8GB VRAM at BF16, ~3GB at 4-bit - ideal for lightweight multimodal agents
  • Apache 2.0, base model also available for fine-tuning

The 4B occupies what may be the optimal size for edge-deployed multimodal agents. It's large enough to handle complex reasoning (GPQA Diamond 76.2 in thinking mode) but small enough to run on laptop GPUs or mobile SoCs with aggressive quantization. For developers building applications that need vision, long context, and decent reasoning without datacenter hardware, this is the new default.

Key Specifications

SpecificationDetails
ProviderAlibaba Cloud (Qwen)
Model FamilyQwen 3.5
ArchitectureGated DeltaNet + Gated Attention (3:1 hybrid)
Total Parameters4B
Active Parameters4B (all - dense)
Layers32
Hidden Dimension2,560
FFN Intermediate Dimension9,216
Attention Pattern8 x (3 x (Gated DeltaNet -> FFN) -> 1 x (Gated Attention -> FFN))
Gated DeltaNet Heads32 for V, 16 for QK; Head Dim: 128
Gated Attention Heads16 Q, 4 KV; Head Dim: 256; RoPE Dim: 64
Context Window262,144 tokens (native), ~1M (YaRN extended)
Max Output65,536 tokens
Input ModalitiesText, Image, Video
Vocabulary248,320 tokens
Languages201
TrainingMulti-Token Prediction (MTP), strong-to-weak distillation
Thinking ModeEnabled (toggleable via enable_thinking parameter)
Release DateMarch 2, 2026
LicenseApache 2.0

The 4B shares the 32-layer stack with the 9B but uses a narrower hidden dimension (2,560 vs 4,096) and smaller FFN (9,216 vs 12,288). The same 3:1 DeltaNet-to-Attention ratio is maintained.

Benchmark Performance

Language (Thinking Mode)

BenchmarkQwen3.5-4BQwen3.5-9BQwen3-30BQwen3-80B
MMLU-Pro79.182.580.982.7
C-Eval85.188.287.489.7
SuperGPQA52.958.256.860.8
GPQA Diamond76.281.773.477.2
IFEval89.891.588.988.9
AA-LCR (Long Context)57.063.049.051.7
LongBench v250.055.244.848.0

The 4B beats Qwen3-30B on GPQA Diamond (76.2 vs 73.4), IFEval (89.8 vs 88.9), and long-context tasks (AA-LCR 57.0 vs 49.0, LongBench v2 50.0 vs 44.8). On MMLU-Pro it trails by less than 2 points despite being 7.5x smaller. It even beats Qwen3-80B on long-context benchmarks.

Vision-Language

BenchmarkQwen3.5-4BGPT-5-NanoGemini-2.5-Flash-LiteQwen3.5-9B
MMMU77.675.8-78.4
MMMU-Pro66.357.259.770.1
MathVision74.662.252.178.9
MathVista (mini)85.171.572.885.7
RealWorldQA79.571.872.280.3
OmniDocBench1.586.255.979.487.7

The 4B's vision performance is remarkably close to the 9B - trailing by less than 2 points on most benchmarks. Against GPT-5-Nano, it leads on MMMU-Pro by 9 points, MathVision by 12 points, and OmniDocBench by 30 points.

Key Capabilities

Lightweight multimodal agent base - At 4B parameters, this model is small enough for edge deployment but multimodal enough to serve as an agent that processes documents, images, and video. OmniDocBench at 86.2 means it handles receipts, invoices, screenshots, and technical diagrams with high accuracy.

Long context from a small model - 262K native context with YaRN extension to 1M tokens. LongBench v2 at 50.0 and AA-LCR at 57.0 outperform the previous Qwen3-80B on long-context tasks, showing that the Gated DeltaNet architecture maintains quality across the full context window.

Thinking mode for complex reasoning - When enabled, thinking mode pushes GPQA Diamond to 76.2 and MMLU-Pro to 79.1 - competitive with the previous generation's 30B models. For latency-sensitive tasks, non-thinking mode provides direct responses without the chain-of-thought overhead.

Pricing and Availability

Apache 2.0 licensed and available on HuggingFace and ModelScope. A base model (Qwen3.5-4B-Base) and eight quantized variants are available.

Deployment OptionVRAM RequiredNotes
BF16 (full precision)~8 GBRTX 3060 12GB, M1 Mac
8-bit quantization~4 GBMost modern GPUs, M1 Mac
4-bit quantization~3 GBIntegrated GPUs, mobile SoCs

Strengths

  • Approaches Qwen3-30B benchmarks at 1/7th the parameters
  • Beats GPT-5-Nano on all vision benchmarks by significant margins
  • 262K-1M context from just 4B params - beats Qwen3-80B on long-context tasks
  • ~8GB VRAM at BF16 makes it ideal for edge deployment and lightweight agents
  • Natively multimodal - no separate VL model needed
  • 8 quantized variants available for maximum deployment flexibility
  • Apache 2.0 with base model for fine-tuning

Weaknesses

  • Trails the 9B by 3-5 points on most benchmarks - the 9B is the better choice if hardware allows
  • No video benchmarks reported (VideoMME only available for 9B, 2B, and 0.8B)
  • Self-reported benchmarks - independent validation pending
  • Thinking mode adds latency; non-thinking mode has lower scores
  • Self-hosting only - no managed API at this parameter count
  • Gated DeltaNet requires compatible inference frameworks

Sources:

Qwen3.5-4B
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.