Models

Qwen3.5-0.8B

Qwen3.5-0.8B is the smallest natively multimodal model in the Qwen 3.5 family - 0.8B parameters handling text, images, and video with 262K context. MathVista 62.2, OCRBench 74.5. Apache 2.0.

Qwen3.5-0.8B

Qwen3.5-0.8B is the floor of the Qwen 3.5 Small Series - and it's a remarkably high floor. Under a billion parameters, this model natively processes text, images, and video, supports 262K context, covers 201 languages, and posts MathVista 62.2 and OCRBench 74.5. It runs on around 1.6GB of VRAM at full precision and fits on a phone with 4-bit quantization.

TL;DR

  • 0.8B dense model - the smallest natively multimodal model in the Qwen 3.5 family
  • Handles text, images, and video from under a billion parameters
  • MathVista 62.2, OCRBench 74.5, VideoMME 63.8 (w/subtitles)
  • 262K native context, 201 languages, multi-token prediction
  • Apache 2.0 - ~1.6GB VRAM at BF16, ~0.5GB at 4-bit

The 0.8B exists for devices where every megabyte counts. Phone apps, embedded systems, IoT devices, and Raspberry Pi deployments can run a model that processes screenshots, reads documents, and even handles basic video understanding. A year ago, that required a 7B model and a serious GPU.

Key Specifications

SpecificationDetails
ProviderAlibaba Cloud (Qwen)
Model FamilyQwen 3.5
ArchitectureGated DeltaNet + Gated Attention (3:1 hybrid)
Total Parameters0.8B
Active Parameters0.8B (all - dense)
Layers24
Hidden Dimension1,024
FFN Intermediate Dimension3,584
Attention Pattern6 x (3 x (Gated DeltaNet -> FFN) -> 1 x (Gated Attention -> FFN))
Gated DeltaNet Heads16 for V, 16 for QK; Head Dim: 128
Gated Attention Heads8 Q, 2 KV; Head Dim: 256; RoPE Dim: 64
Context Window262,144 tokens (native)
Max Output65,536 tokens
Input ModalitiesText, Image, Video
Vocabulary248,320 tokens
Languages201
TrainingMulti-Token Prediction (MTP), strong-to-weak distillation
Thinking ModeNon-thinking (default)
Release DateMarch 2, 2026
LicenseApache 2.0

The 0.8B shares the 24-layer stack with the 2B but halves the hidden dimension (1,024 vs 2,048) and uses a smaller FFN (3,584 vs 6,144). The attention architecture is identical - same head dimensions, same 3:1 DeltaNet-to-Attention ratio.

Benchmark Performance

Language (Non-Thinking Mode)

BenchmarkQwen3.5-0.8B
MMLU-Pro29.7
MMLU-Redux48.5
C-Eval46.4
SuperGPQA16.9
IFEval52.1
MMMLU34.1

Language benchmarks are honest about the 0.8B's limits. This isn't a model for graduate-level reasoning or complex instruction following. MMLU-Pro at 29.7 and SuperGPQA at 16.9 show the constraints of sub-billion parameter language modeling. IFEval at 52.1 is functional but not reliable for production instruction following.

Vision-Language

BenchmarkQwen3.5-0.8BQwen3.5-2BQwen3.5-9B
MMMU49.064.278.4
MathVista (mini)62.276.785.7
RealWorldQA63.474.580.3
MMStar58.3-79.7
MMBench (EN-DEV)69.983.390.1
OCRBench74.584.5-
VideoMME (w/sub)63.875.684.5
RefCOCO (avg)79.384.8-

Vision is where the 0.8B punches above its weight. OCRBench at 74.5 means useful document text extraction from a phone-sized model. MathVista at 62.2 and VideoMME at 63.8 demonstrate genuine multimodal reasoning. RefCOCO at 79.3 (visual grounding - pointing to objects in images) is surprisingly strong for a model this small.

Key Capabilities

Multimodal from under a billion parameters - This is the smallest model in the Qwen 3.5 family and one of the smallest natively multimodal models publicly available. It processes images, video, and text through the same architecture and weights. The DeepStack Vision Transformer with Conv3d handles temporal video understanding even at this scale.

Edge and mobile deployment - At ~1.6GB VRAM (BF16) or ~0.5GB (4-bit quantization), the 0.8B runs on phone-class hardware. Seven quantized variants are available on HuggingFace, including GGUF formats compatible with llama.cpp for CPU-only deployment. This opens multimodal AI to IoT devices, embedded systems, and offline mobile applications.

262K context window - The Gated DeltaNet architecture provides constant-memory attention, enabling 262K token context even at 0.8B parameters. This means the model can process long documents, multi-page PDFs, or extended conversation histories without running out of memory on edge devices.

Pricing and Availability

Apache 2.0 licensed and available on HuggingFace and ModelScope. A base model (Qwen3.5-0.8B-Base) and seven quantized variants are available.

Deployment OptionVRAM RequiredNotes
BF16 (full precision)~1.6 GBAny modern GPU, M1 Mac
8-bit quantization~0.8 GBMobile GPUs, integrated graphics
4-bit quantization~0.5 GBPhones, Raspberry Pi, CPU-only

Strengths

  • Smallest natively multimodal model in Qwen 3.5 - text, images, video from 0.8B params
  • OCRBench 74.5 and RefCOCO 79.3 enable practical vision tasks on edge devices
  • 262K context from constant-memory DeltaNet attention
  • ~0.5GB at 4-bit quantization - runs on phones and embedded hardware
  • 201 languages with 248K vocabulary
  • 7 quantized variants including GGUF for llama.cpp
  • Apache 2.0 with base model for fine-tuning

Weaknesses

  • Language benchmarks are limited - MMLU-Pro 29.7, IFEval 52.1
  • Non-thinking mode only - no chain-of-thought reasoning available
  • Not suitable for complex reasoning, coding, or multi-step instruction following
  • Knowledge capacity is constrained by the small parameter count
  • Self-reported benchmarks - independent validation pending
  • Vision quality degrades clearly vs the 2B (OCRBench 74.5 vs 84.5)

Sources:

Qwen3.5-0.8B
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.