Qwen3.5-Omni is Alibaba's natively multimodal model. It takes text, images, audio, and video as input and produces text plus streaming speech in a single forward pass. Shipped on March 30, 2026 in three variants - Plus, Flash, and Light - it sits at the crossroads of frontier voice AI and open-weight development.

TL;DR

One model handles text, image, audio, video input while creating text and speech output - no bolt-on TTS
Plus hits SOTA on 215 subtasks and edges out Gemini 3.1 Pro on MMAU and LibriSpeech
256K context covers 10+ hours of audio or 400 seconds of 720p video in one call
Pricing starts at $0.10/M input tokens for Flash; Light weights on Hugging Face for self-hosting

Qwen3-Omni handled 19 input languages for speech and 10 outputs. The 3.5 generation jumps to 113 and 36, which changes the calculus for voice agents outside the English-Chinese axis. The architecture evolves the Thinker-Talker design onto the Hybrid-Attention MoE backbone shared with the rest of the Qwen 3.5 series.

Key Specifications

Specification	Details
Provider	Alibaba Cloud (Qwen)
Model Family	Qwen 3.5 Omni
Architecture	Thinker-Talker with Hybrid-Attention MoE
Total Parameters (Plus)	~30B total, ~3B active per token
Variants	Plus, Flash, Light
Context Window	256K tokens (native)
Max Audio Input	10+ hours continuous
Max Video Input	400+ seconds of 720p at 1 FPS
Input Modalities	Text, Image, Audio, Video
Output Modalities	Text, Speech (streaming)
Speech Recognition	113 languages and dialects
Speech Generation	36 languages
Input Price (Plus)	$0.40/M tokens
Output Price (Plus)	$4.80/M tokens
Input Price (Flash)	$0.10/M tokens
Output Price (Flash)	$0.80/M tokens
Release Date	March 30, 2026
License	Apache 2.0 for open weights (Light); Plus and Flash via API

The 256K context maps to roughly 10 hours of audio or 400 seconds of 720p video at 1 FPS. For meeting transcription or long-form podcast editing, that's enough to process entire sessions in one call.

Thinker-Talker architecture diagram for Qwen3.5-Omni showing Hybrid-Attention MoE with shared backbone Qwen3.5-Omni's Thinker-Talker architecture. The Thinker ingests text, audio, images, and video through a Hybrid-Attention MoE. The Talker produces streaming speech tokens concurrently with generation. Source: apidog.com

Benchmark Performance

Alibaba reports 215 SOTA results for Plus across audio, audio-visual, visual, text, and speech-generation benchmarks. The directly comparable numbers against Gemini 3.1 Pro are where the claim carries the most weight.

Benchmark	Qwen3.5-Omni-Plus	Gemini 3.1 Pro	GPT-Audio	ElevenLabs
MMAU (audio understanding)	82.2	81.1	-	-
MMSU (audio understanding)	82.8	81.3	-	-
RUL-MuchoMusic	72.4	59.6	-	-
VoiceBench (dialogue)	93.1	88.9	-	-
LibriSpeech clean WER	1.11	3.36	-	-
LibriSpeech other WER	2.23	4.41	-	-
CV15 (en) WER	4.83	8.73	-	-
MMLU-Redux (text)	94.2	-	-	-
MMMU-Pro (visual reasoning)	73.9	-	-	-
Seed-zh voice stability (lower better)	1.07	2.42 (2.5 Pro)	1.11	13.08
Seed-hard WER	6.24	-	8.19	27.70

Word error rate on LibriSpeech clean drops from 3.36 (Gemini 3.1 Pro) to 1.11 - roughly a two-thirds reduction on a benchmark where gains have been fractional for years. Music understanding on RUL-MuchoMusic swings from 59.6 to 72.4, pointing to substantially more musical audio in training than competitors had.

Text performance is the quiet story. MMLU-Redux 94.2 for the Omni variant sits within a point of the non-multimodal Qwen3.5-Plus at 94.3. Adding native audio and video didn't cost the model on text reasoning, which has historically been a tradeoff in unified architectures.

The Artificial Analysis Intelligence Index places Plus at 39 (rank 8 of 67) and Flash at 26 (rank 13 of 78). Both are non-reasoning models.

Key Capabilities

The headline capability is Audio-Visual Vibe Coding: point a camera at a UI or share a screen, describe what you want built verbally, and the model produces functional code from the combined audio-visual stream with no text prompt. The Decoder frames it as emergent rather than an explicit training target - more interesting than the marketing positioning suggests.

Semantic interruption matters for production voice agents. Earlier real-time models struggled to separate a user wanting to cut in from background noise. Qwen3.5-Omni's turn-taking intent recognition distinguishes backchannels ("uh-huh", "right") from actual interruptions. Whether it holds up in a noisy open-plan office is something builders will need to test, but the architecture handles it at the model level rather than via an external VAD pipeline.

Voice cloning works from 10-30 second samples and exposes controls for speed, volume, and emotion. On Seed-zh stability (lower better), Plus scores 1.07 against ElevenLabs 13.08, GPT-Audio 1.11, and Minimax 1.19. On Seed-hard, cloned-voice WER is 6.24 versus ElevenLabs 27.70. That gap matters for customer-facing products where voice drift or word errors kill the experience.

Pricing and Availability

DashScope pricing differs between international (Singapore) and mainland China regions. Headline rates via Artificial Analysis:

Variant	Input	Output	Blended (3:1)	Context
Qwen3.5-Omni-Plus	$0.40/M	$4.80/M	$1.50/M	256K
Qwen3.5-Omni-Flash	$0.10/M	$0.80/M	-	256K
Qwen3.5-Omni-Light	Free (self-host)	Free (self-host)	-	256K
Gemini 3.1 Pro	$2.00/M	$12.00/M	-	1M
GPT-5.2	$10.00/M	$30.00/M	-	400K

At $0.10/M input, Flash is one of the cheapest frontier-tier multimodal APIs - matching text-only Qwen3.5-Flash pricing but with audio, video, and speech output folded in. Plus is still under a third of Gemini 3.1 Pro's rate.

Access paths:

DashScope API - Plus and Flash via qwen3.5-omni-plus and qwen3.5-omni-flash model IDs, OpenAI-compatible interface
Qwen-Omni-Realtime WebSocket - live audio and video streams with full-duplex conversation and semantic interruption (docs)
Hugging Face - Light variant weights plus demo spaces (online, offline)
Qwen Chat - consumer-facing test interface at chat.qwen.ai

New DashScope accounts in Singapore get a free quota of 1M input and 1M output tokens for 90 days. Audio tokenizes at roughly 427 tokens per minute, so a 10-minute voice call burns around 4,300 input tokens before any response.

Strengths

SOTA on directly comparable audio benchmarks - LibriSpeech WER and MMAU ahead of Gemini 3.1 Pro
Native multimodal single forward pass with streaming speech output concurrent with generation
256K context for 10+ hour audio or 400-second 720p video in one call
113-language speech recognition and 36-language generation
Plus pricing undercuts Gemini 3.1 Pro 5x on input and 2.5x on output
Voice cloning stability far ahead of ElevenLabs on Seed-zh and Seed-hard
Light variant ships with open weights, preserving Qwen's Apache 2.0 tradition

Weaknesses

The 215 SOTA claim is thin - many entries are per-language ASR and S2TT subtasks with little competition
Plus needs 40GB+ VRAM for FP16 self-hosting, pricing individual developers out
Parameter counts for Flash and Light aren't formally disclosed
DashScope tiered pricing can get complex on long prompts - chunking may be cheaper
Plus and Flash are API-only; only Light has published weights
Non-reasoning model, trails reasoning frontier on chain-of-thought benchmarks
Verbose outputs - Artificial Analysis measured 16M tokens against a 7.2M average

Qwen3.5-Omni Does 10-Hour Audio and 4M Video Frames - original launch coverage
Alibaba Drops Qwen3.5: 397B Parameters, 17B Active - flagship text model in the same generation
Qwen3.5-35B-A3B - open-weight sibling sharing the MoE backbone
Qwen3.5-Flash - text-only hosted API variant at similar pricing
Gemini 3.1 Pro - closest audio-benchmark peer
AI Voice and Speech Leaderboard
Audio Understanding Benchmarks Leaderboard
How to Set Up an AI Voice Agent
Qwen 3 Review