Qwen3.5-2B is the compact reasoner in the Qwen 3.5 Small Series. At just 2 billion parameters, it delivers OCRBench at 84.5, VideoMME at 75.6, and MMLU-Pro at 66.5 in thinking mode - numbers that would have headlined a 7B model launch a year ago. It's natively multimodal, supports both thinking and non-thinking modes, and fits in roughly 4GB of VRAM.

TL;DR

2B dense multimodal model - text, images, and video natively, no adapter needed
OCRBench 84.5 and VideoMME 75.6 outperform many previous-gen 7B models
Thinking mode (MMLU-Pro 66.5, IFEval 78.6) and non-thinking mode (MMLU-Pro 55.3) both available
262K native context, 201 languages, multi-token prediction
Apache 2.0 - runs on ~4GB VRAM at BF16, ~1.5GB at 4-bit

The 2B is the sweet spot for applications that need multimodal understanding on seriously constrained hardware. Laptop GPUs, Raspberry Pi-class devices with quantization, and mobile processors can all run this model. Having thinking mode at 2B means the model can tackle complex reasoning when needed and switch to fast non-thinking mode for simple tasks.

Key Specifications

Specification	Details
Provider	Alibaba Cloud (Qwen)
Model Family	Qwen 3.5
Architecture	Gated DeltaNet + Gated Attention (3:1 hybrid)
Total Parameters	2B
Active Parameters	2B (all - dense)
Layers	24
Hidden Dimension	2,048
FFN Intermediate Dimension	6,144
Attention Pattern	6 x (3 x (Gated DeltaNet -> FFN) -> 1 x (Gated Attention -> FFN))
Gated DeltaNet Heads	16 for V, 16 for QK; Head Dim: 128
Gated Attention Heads	8 Q, 2 KV; Head Dim: 256; RoPE Dim: 64
Context Window	262,144 tokens (native)
Max Output	65,536 tokens
Input Modalities	Text, Image, Video
Vocabulary	248,320 tokens
Languages	201
Training	Multi-Token Prediction (MTP), strong-to-weak distillation
Thinking Mode	Supported (both thinking and non-thinking)
Release Date	March 2, 2026
License	Apache 2.0

The 2B uses the shallower 24-layer stack (shared with the 0.8B) with the 6x repeating block pattern. It has half the attention heads of the 9B/4B models (8 Q, 2 KV vs 16 Q, 4 KV) but maintains the same head dimensions and DeltaNet architecture.

Benchmark Performance

Language

Benchmark	Qwen3.5-2B (Thinking)	Qwen3.5-2B (Non-Thinking)
MMLU-Pro	66.5	55.3
MMLU-Redux	79.6	69.2
C-Eval	73.2	65.2
SuperGPQA	37.5	30.4
IFEval	78.6	61.2

Thinking mode provides a substantial boost across all benchmarks - 11 points on MMLU-Pro, 17 points on IFEval. This makes the thinking/non-thinking toggle a genuine architectural advantage: use thinking mode for complex tasks, non-thinking for latency-sensitive applications.

Vision-Language

Benchmark	Qwen3.5-2B	Qwen3.5-0.8B	Qwen3.5-9B
MMMU	64.2	49.0	78.4
MathVista (mini)	76.7	62.2	85.7
RealWorldQA	74.5	63.4	80.3
MMBench (EN-DEV)	83.3	69.9	90.1
OCRBench	84.5	74.5	-
RefCOCO (avg)	84.8	79.3	-
VideoMME (w/sub)	75.6	63.8	84.5

OCRBench at 84.5 stands out - the 2B handles document text extraction with accuracy that makes it viable for production OCR pipelines. RefCOCO at 84.8 (visual grounding) and VideoMME at 75.6 demonstrate genuine multimodal reasoning capability, not just image classification.

Key Capabilities

Thinking mode at 2B scale - The 2B is the smallest model in the series that supports thinking mode by default. This means a model running on 4GB of VRAM can perform step-by-step reasoning, pushing IFEval from 61.2 to 78.6. The thinking/non-thinking toggle allows developers to trade latency for quality on a per-request basis.

Document understanding - OCRBench 84.5 and MMBench 83.3 make the 2B a strong candidate for on-device document processing. Extracting text from screenshots, receipts, forms, and technical diagrams doesn't require a datacenter model anymore.

Video comprehension - VideoMME at 75.6 (with subtitles) from just 2 billion parameters. The DeepStack Vision Transformer with Conv3d patch embeddings captures temporal dynamics, and the model can answer questions about video content with surprising accuracy for its size.

Pricing and Availability

Apache 2.0 licensed and available on HuggingFace and ModelScope. A base model (Qwen3.5-2B-Base) and four quantized variants are available.

Deployment Option	VRAM Required	Notes
BF16 (full precision)	~4 GB	Most modern GPUs, M1 Mac
8-bit quantization	~2 GB	Integrated GPUs, mobile
4-bit quantization	~1.5 GB	Raspberry Pi (with swap), old GPUs

Strengths

OCRBench 84.5 and VideoMME 75.6 from just 2B parameters
Thinking mode provides 11-17 point boosts on reasoning benchmarks
262K context window with constant memory via Gated DeltaNet
Natively multimodal - text, images, video from the same weights
~4GB VRAM at BF16, ~1.5GB at 4-bit - runs on almost anything
201 languages with 248K vocabulary
Apache 2.0 with base model for fine-tuning

Weaknesses

Significant gap to the 4B on knowledge benchmarks (MMLU-Pro 66.5 vs 79.1)
Non-thinking mode scores drop substantially (MMLU-Pro 55.3, IFEval 61.2)
No YaRN extension to 1M mentioned - may be limited to 262K native context
Self-reported benchmarks - independent validation pending
Self-hosting only - no managed API
SuperGPQA at 37.5 shows limits on graduate-level reasoning

Sources: