Qwen3.5-2B
Qwen3.5-2B is a 2B dense multimodal model with 262K context, thinking mode, and native vision including video understanding. OCRBench 84.5, VideoMME 75.6. Apache 2.0 licensed.

Qwen3.5-2B is the compact reasoner in the Qwen 3.5 Small Series. At just 2 billion parameters, it delivers OCRBench at 84.5, VideoMME at 75.6, and MMLU-Pro at 66.5 in thinking mode - numbers that would have headlined a 7B model launch a year ago. It's natively multimodal, supports both thinking and non-thinking modes, and fits in roughly 4GB of VRAM.
TL;DR
- 2B dense multimodal model - text, images, and video natively, no adapter needed
- OCRBench 84.5 and VideoMME 75.6 outperform many previous-gen 7B models
- Thinking mode (MMLU-Pro 66.5, IFEval 78.6) and non-thinking mode (MMLU-Pro 55.3) both available
- 262K native context, 201 languages, multi-token prediction
- Apache 2.0 - runs on ~4GB VRAM at BF16, ~1.5GB at 4-bit
The 2B is the sweet spot for applications that need multimodal understanding on seriously constrained hardware. Laptop GPUs, Raspberry Pi-class devices with quantization, and mobile processors can all run this model. Having thinking mode at 2B means the model can tackle complex reasoning when needed and switch to fast non-thinking mode for simple tasks.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Alibaba Cloud (Qwen) |
| Model Family | Qwen 3.5 |
| Architecture | Gated DeltaNet + Gated Attention (3:1 hybrid) |
| Total Parameters | 2B |
| Active Parameters | 2B (all - dense) |
| Layers | 24 |
| Hidden Dimension | 2,048 |
| FFN Intermediate Dimension | 6,144 |
| Attention Pattern | 6 x (3 x (Gated DeltaNet -> FFN) -> 1 x (Gated Attention -> FFN)) |
| Gated DeltaNet Heads | 16 for V, 16 for QK; Head Dim: 128 |
| Gated Attention Heads | 8 Q, 2 KV; Head Dim: 256; RoPE Dim: 64 |
| Context Window | 262,144 tokens (native) |
| Max Output | 65,536 tokens |
| Input Modalities | Text, Image, Video |
| Vocabulary | 248,320 tokens |
| Languages | 201 |
| Training | Multi-Token Prediction (MTP), strong-to-weak distillation |
| Thinking Mode | Supported (both thinking and non-thinking) |
| Release Date | March 2, 2026 |
| License | Apache 2.0 |
The 2B uses the shallower 24-layer stack (shared with the 0.8B) with the 6x repeating block pattern. It has half the attention heads of the 9B/4B models (8 Q, 2 KV vs 16 Q, 4 KV) but maintains the same head dimensions and DeltaNet architecture.
Benchmark Performance
Language
| Benchmark | Qwen3.5-2B (Thinking) | Qwen3.5-2B (Non-Thinking) |
|---|---|---|
| MMLU-Pro | 66.5 | 55.3 |
| MMLU-Redux | 79.6 | 69.2 |
| C-Eval | 73.2 | 65.2 |
| SuperGPQA | 37.5 | 30.4 |
| IFEval | 78.6 | 61.2 |
Thinking mode provides a substantial boost across all benchmarks - 11 points on MMLU-Pro, 17 points on IFEval. This makes the thinking/non-thinking toggle a genuine architectural advantage: use thinking mode for complex tasks, non-thinking for latency-sensitive applications.
Vision-Language
| Benchmark | Qwen3.5-2B | Qwen3.5-0.8B | Qwen3.5-9B |
|---|---|---|---|
| MMMU | 64.2 | 49.0 | 78.4 |
| MathVista (mini) | 76.7 | 62.2 | 85.7 |
| RealWorldQA | 74.5 | 63.4 | 80.3 |
| MMBench (EN-DEV) | 83.3 | 69.9 | 90.1 |
| OCRBench | 84.5 | 74.5 | - |
| RefCOCO (avg) | 84.8 | 79.3 | - |
| VideoMME (w/sub) | 75.6 | 63.8 | 84.5 |
OCRBench at 84.5 stands out - the 2B handles document text extraction with accuracy that makes it viable for production OCR pipelines. RefCOCO at 84.8 (visual grounding) and VideoMME at 75.6 demonstrate genuine multimodal reasoning capability, not just image classification.
Key Capabilities
Thinking mode at 2B scale - The 2B is the smallest model in the series that supports thinking mode by default. This means a model running on 4GB of VRAM can perform step-by-step reasoning, pushing IFEval from 61.2 to 78.6. The thinking/non-thinking toggle allows developers to trade latency for quality on a per-request basis.
Document understanding - OCRBench 84.5 and MMBench 83.3 make the 2B a strong candidate for on-device document processing. Extracting text from screenshots, receipts, forms, and technical diagrams doesn't require a datacenter model anymore.
Video comprehension - VideoMME at 75.6 (with subtitles) from just 2 billion parameters. The DeepStack Vision Transformer with Conv3d patch embeddings captures temporal dynamics, and the model can answer questions about video content with surprising accuracy for its size.
Pricing and Availability
Apache 2.0 licensed and available on HuggingFace and ModelScope. A base model (Qwen3.5-2B-Base) and four quantized variants are available.
| Deployment Option | VRAM Required | Notes |
|---|---|---|
| BF16 (full precision) | ~4 GB | Most modern GPUs, M1 Mac |
| 8-bit quantization | ~2 GB | Integrated GPUs, mobile |
| 4-bit quantization | ~1.5 GB | Raspberry Pi (with swap), old GPUs |
Strengths
- OCRBench 84.5 and VideoMME 75.6 from just 2B parameters
- Thinking mode provides 11-17 point boosts on reasoning benchmarks
- 262K context window with constant memory via Gated DeltaNet
- Natively multimodal - text, images, video from the same weights
- ~4GB VRAM at BF16, ~1.5GB at 4-bit - runs on almost anything
- 201 languages with 248K vocabulary
- Apache 2.0 with base model for fine-tuning
Weaknesses
- Significant gap to the 4B on knowledge benchmarks (MMLU-Pro 66.5 vs 79.1)
- Non-thinking mode scores drop substantially (MMLU-Pro 55.3, IFEval 61.2)
- No YaRN extension to 1M mentioned - may be limited to 262K native context
- Self-reported benchmarks - independent validation pending
- Self-hosting only - no managed API
- SuperGPQA at 37.5 shows limits on graduate-level reasoning
Related Coverage
- Qwen 3.5 Small Series Ships Four Models
- Qwen3.5-9B
- Qwen3.5-4B
- Qwen3.5-0.8B
- Qwen 3.5 Medium Series Drops Four Models
- Open Source LLM Leaderboard
Sources:
