Models

Qwen3.5-2B

Qwen3.5-2B is a 2B dense multimodal model with 262K context, thinking mode, and native vision including video understanding. OCRBench 84.5, VideoMME 75.6. Apache 2.0 licensed.

Qwen3.5-2B

Qwen3.5-2B is the compact reasoner in the Qwen 3.5 Small Series. At just 2 billion parameters, it delivers OCRBench at 84.5, VideoMME at 75.6, and MMLU-Pro at 66.5 in thinking mode - numbers that would have headlined a 7B model launch a year ago. It's natively multimodal, supports both thinking and non-thinking modes, and fits in roughly 4GB of VRAM.

TL;DR

  • 2B dense multimodal model - text, images, and video natively, no adapter needed
  • OCRBench 84.5 and VideoMME 75.6 outperform many previous-gen 7B models
  • Thinking mode (MMLU-Pro 66.5, IFEval 78.6) and non-thinking mode (MMLU-Pro 55.3) both available
  • 262K native context, 201 languages, multi-token prediction
  • Apache 2.0 - runs on ~4GB VRAM at BF16, ~1.5GB at 4-bit

The 2B is the sweet spot for applications that need multimodal understanding on seriously constrained hardware. Laptop GPUs, Raspberry Pi-class devices with quantization, and mobile processors can all run this model. Having thinking mode at 2B means the model can tackle complex reasoning when needed and switch to fast non-thinking mode for simple tasks.

Key Specifications

SpecificationDetails
ProviderAlibaba Cloud (Qwen)
Model FamilyQwen 3.5
ArchitectureGated DeltaNet + Gated Attention (3:1 hybrid)
Total Parameters2B
Active Parameters2B (all - dense)
Layers24
Hidden Dimension2,048
FFN Intermediate Dimension6,144
Attention Pattern6 x (3 x (Gated DeltaNet -> FFN) -> 1 x (Gated Attention -> FFN))
Gated DeltaNet Heads16 for V, 16 for QK; Head Dim: 128
Gated Attention Heads8 Q, 2 KV; Head Dim: 256; RoPE Dim: 64
Context Window262,144 tokens (native)
Max Output65,536 tokens
Input ModalitiesText, Image, Video
Vocabulary248,320 tokens
Languages201
TrainingMulti-Token Prediction (MTP), strong-to-weak distillation
Thinking ModeSupported (both thinking and non-thinking)
Release DateMarch 2, 2026
LicenseApache 2.0

The 2B uses the shallower 24-layer stack (shared with the 0.8B) with the 6x repeating block pattern. It has half the attention heads of the 9B/4B models (8 Q, 2 KV vs 16 Q, 4 KV) but maintains the same head dimensions and DeltaNet architecture.

Benchmark Performance

Language

BenchmarkQwen3.5-2B (Thinking)Qwen3.5-2B (Non-Thinking)
MMLU-Pro66.555.3
MMLU-Redux79.669.2
C-Eval73.265.2
SuperGPQA37.530.4
IFEval78.661.2

Thinking mode provides a substantial boost across all benchmarks - 11 points on MMLU-Pro, 17 points on IFEval. This makes the thinking/non-thinking toggle a genuine architectural advantage: use thinking mode for complex tasks, non-thinking for latency-sensitive applications.

Vision-Language

BenchmarkQwen3.5-2BQwen3.5-0.8BQwen3.5-9B
MMMU64.249.078.4
MathVista (mini)76.762.285.7
RealWorldQA74.563.480.3
MMBench (EN-DEV)83.369.990.1
OCRBench84.574.5-
RefCOCO (avg)84.879.3-
VideoMME (w/sub)75.663.884.5

OCRBench at 84.5 stands out - the 2B handles document text extraction with accuracy that makes it viable for production OCR pipelines. RefCOCO at 84.8 (visual grounding) and VideoMME at 75.6 demonstrate genuine multimodal reasoning capability, not just image classification.

Key Capabilities

Thinking mode at 2B scale - The 2B is the smallest model in the series that supports thinking mode by default. This means a model running on 4GB of VRAM can perform step-by-step reasoning, pushing IFEval from 61.2 to 78.6. The thinking/non-thinking toggle allows developers to trade latency for quality on a per-request basis.

Document understanding - OCRBench 84.5 and MMBench 83.3 make the 2B a strong candidate for on-device document processing. Extracting text from screenshots, receipts, forms, and technical diagrams doesn't require a datacenter model anymore.

Video comprehension - VideoMME at 75.6 (with subtitles) from just 2 billion parameters. The DeepStack Vision Transformer with Conv3d patch embeddings captures temporal dynamics, and the model can answer questions about video content with surprising accuracy for its size.

Pricing and Availability

Apache 2.0 licensed and available on HuggingFace and ModelScope. A base model (Qwen3.5-2B-Base) and four quantized variants are available.

Deployment OptionVRAM RequiredNotes
BF16 (full precision)~4 GBMost modern GPUs, M1 Mac
8-bit quantization~2 GBIntegrated GPUs, mobile
4-bit quantization~1.5 GBRaspberry Pi (with swap), old GPUs

Strengths

  • OCRBench 84.5 and VideoMME 75.6 from just 2B parameters
  • Thinking mode provides 11-17 point boosts on reasoning benchmarks
  • 262K context window with constant memory via Gated DeltaNet
  • Natively multimodal - text, images, video from the same weights
  • ~4GB VRAM at BF16, ~1.5GB at 4-bit - runs on almost anything
  • 201 languages with 248K vocabulary
  • Apache 2.0 with base model for fine-tuning

Weaknesses

  • Significant gap to the 4B on knowledge benchmarks (MMLU-Pro 66.5 vs 79.1)
  • Non-thinking mode scores drop substantially (MMLU-Pro 55.3, IFEval 61.2)
  • No YaRN extension to 1M mentioned - may be limited to 262K native context
  • Self-reported benchmarks - independent validation pending
  • Self-hosting only - no managed API
  • SuperGPQA at 37.5 shows limits on graduate-level reasoning

Sources:

Qwen3.5-2B
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.