Qwen3.5-0.8B is the floor of the Qwen 3.5 Small Series - and it's a remarkably high floor. Under a billion parameters, this model natively processes text, images, and video, supports 262K context, covers 201 languages, and posts MathVista 62.2 and OCRBench 74.5. It runs on around 1.6GB of VRAM at full precision and fits on a phone with 4-bit quantization.

TL;DR

0.8B dense model - the smallest natively multimodal model in the Qwen 3.5 family
Handles text, images, and video from under a billion parameters
MathVista 62.2, OCRBench 74.5, VideoMME 63.8 (w/subtitles)
262K native context, 201 languages, multi-token prediction
Apache 2.0 - ~1.6GB VRAM at BF16, ~0.5GB at 4-bit

The 0.8B exists for devices where every megabyte counts. Phone apps, embedded systems, IoT devices, and Raspberry Pi deployments can run a model that processes screenshots, reads documents, and even handles basic video understanding. A year ago, that required a 7B model and a serious GPU.

Key Specifications

Specification	Details
Provider	Alibaba Cloud (Qwen)
Model Family	Qwen 3.5
Architecture	Gated DeltaNet + Gated Attention (3:1 hybrid)
Total Parameters	0.8B
Active Parameters	0.8B (all - dense)
Layers	24
Hidden Dimension	1,024
FFN Intermediate Dimension	3,584
Attention Pattern	6 x (3 x (Gated DeltaNet -> FFN) -> 1 x (Gated Attention -> FFN))
Gated DeltaNet Heads	16 for V, 16 for QK; Head Dim: 128
Gated Attention Heads	8 Q, 2 KV; Head Dim: 256; RoPE Dim: 64
Context Window	262,144 tokens (native)
Max Output	65,536 tokens
Input Modalities	Text, Image, Video
Vocabulary	248,320 tokens
Languages	201
Training	Multi-Token Prediction (MTP), strong-to-weak distillation
Thinking Mode	Non-thinking (default)
Release Date	March 2, 2026
License	Apache 2.0

The 0.8B shares the 24-layer stack with the 2B but halves the hidden dimension (1,024 vs 2,048) and uses a smaller FFN (3,584 vs 6,144). The attention architecture is identical - same head dimensions, same 3:1 DeltaNet-to-Attention ratio.

Benchmark Performance

Language (Non-Thinking Mode)

Benchmark	Qwen3.5-0.8B
MMLU-Pro	29.7
MMLU-Redux	48.5
C-Eval	46.4
SuperGPQA	16.9
IFEval	52.1
MMMLU	34.1

Language benchmarks are honest about the 0.8B's limits. This isn't a model for graduate-level reasoning or complex instruction following. MMLU-Pro at 29.7 and SuperGPQA at 16.9 show the constraints of sub-billion parameter language modeling. IFEval at 52.1 is functional but not reliable for production instruction following.

Vision-Language

Benchmark	Qwen3.5-0.8B	Qwen3.5-2B	Qwen3.5-9B
MMMU	49.0	64.2	78.4
MathVista (mini)	62.2	76.7	85.7
RealWorldQA	63.4	74.5	80.3
MMStar	58.3	-	79.7
MMBench (EN-DEV)	69.9	83.3	90.1
OCRBench	74.5	84.5	-
VideoMME (w/sub)	63.8	75.6	84.5
RefCOCO (avg)	79.3	84.8	-

Vision is where the 0.8B punches above its weight. OCRBench at 74.5 means useful document text extraction from a phone-sized model. MathVista at 62.2 and VideoMME at 63.8 demonstrate genuine multimodal reasoning. RefCOCO at 79.3 (visual grounding - pointing to objects in images) is surprisingly strong for a model this small.

Key Capabilities

Multimodal from under a billion parameters - This is the smallest model in the Qwen 3.5 family and one of the smallest natively multimodal models publicly available. It processes images, video, and text through the same architecture and weights. The DeepStack Vision Transformer with Conv3d handles temporal video understanding even at this scale.

Edge and mobile deployment - At ~1.6GB VRAM (BF16) or ~0.5GB (4-bit quantization), the 0.8B runs on phone-class hardware. Seven quantized variants are available on HuggingFace, including GGUF formats compatible with llama.cpp for CPU-only deployment. This opens multimodal AI to IoT devices, embedded systems, and offline mobile applications.

262K context window - The Gated DeltaNet architecture provides constant-memory attention, enabling 262K token context even at 0.8B parameters. This means the model can process long documents, multi-page PDFs, or extended conversation histories without running out of memory on edge devices.

Pricing and Availability

Apache 2.0 licensed and available on HuggingFace and ModelScope. A base model (Qwen3.5-0.8B-Base) and seven quantized variants are available.

Deployment Option	VRAM Required	Notes
BF16 (full precision)	~1.6 GB	Any modern GPU, M1 Mac
8-bit quantization	~0.8 GB	Mobile GPUs, integrated graphics
4-bit quantization	~0.5 GB	Phones, Raspberry Pi, CPU-only

Strengths

Smallest natively multimodal model in Qwen 3.5 - text, images, video from 0.8B params
OCRBench 74.5 and RefCOCO 79.3 enable practical vision tasks on edge devices
262K context from constant-memory DeltaNet attention
~0.5GB at 4-bit quantization - runs on phones and embedded hardware
201 languages with 248K vocabulary
7 quantized variants including GGUF for llama.cpp
Apache 2.0 with base model for fine-tuning

Weaknesses

Language benchmarks are limited - MMLU-Pro 29.7, IFEval 52.1
Non-thinking mode only - no chain-of-thought reasoning available
Not suitable for complex reasoning, coding, or multi-step instruction following
Knowledge capacity is constrained by the small parameter count
Self-reported benchmarks - independent validation pending
Vision quality degrades clearly vs the 2B (OCRBench 74.5 vs 84.5)

Sources: