Qwen3.5-Omni - Native Audio, Vision and Voice Model

Alibaba's Qwen3.5-Omni takes text, images, audio, and video as input and streams both text and speech output in a single end-to-end model with a 256K context window.

Qwen3.5-Omni - Native Audio, Vision and Voice Model

Qwen3.5-Omni is Alibaba's natively multimodal model. It takes text, images, audio, and video as input and produces text plus streaming speech in a single forward pass. Shipped on March 30, 2026 in three variants - Plus, Flash, and Light - it sits at the crossroads of frontier voice AI and open-weight development.

TL;DR

  • One model handles text, image, audio, video input while creating text and speech output - no bolt-on TTS
  • Plus hits SOTA on 215 subtasks and edges out Gemini 3.1 Pro on MMAU and LibriSpeech
  • 256K context covers 10+ hours of audio or 400 seconds of 720p video in one call
  • Pricing starts at $0.10/M input tokens for Flash; Light weights on Hugging Face for self-hosting

Qwen3-Omni handled 19 input languages for speech and 10 outputs. The 3.5 generation jumps to 113 and 36, which changes the calculus for voice agents outside the English-Chinese axis. The architecture evolves the Thinker-Talker design onto the Hybrid-Attention MoE backbone shared with the rest of the Qwen 3.5 series.

Key Specifications

SpecificationDetails
ProviderAlibaba Cloud (Qwen)
Model FamilyQwen 3.5 Omni
ArchitectureThinker-Talker with Hybrid-Attention MoE
Total Parameters (Plus)~30B total, ~3B active per token
VariantsPlus, Flash, Light
Context Window256K tokens (native)
Max Audio Input10+ hours continuous
Max Video Input400+ seconds of 720p at 1 FPS
Input ModalitiesText, Image, Audio, Video
Output ModalitiesText, Speech (streaming)
Speech Recognition113 languages and dialects
Speech Generation36 languages
Input Price (Plus)$0.40/M tokens
Output Price (Plus)$4.80/M tokens
Input Price (Flash)$0.10/M tokens
Output Price (Flash)$0.80/M tokens
Release DateMarch 30, 2026
LicenseApache 2.0 for open weights (Light); Plus and Flash via API

The 256K context maps to roughly 10 hours of audio or 400 seconds of 720p video at 1 FPS. For meeting transcription or long-form podcast editing, that's enough to process entire sessions in one call.

Thinker-Talker architecture diagram for Qwen3.5-Omni showing Hybrid-Attention MoE with shared backbone Qwen3.5-Omni's Thinker-Talker architecture. The Thinker ingests text, audio, images, and video through a Hybrid-Attention MoE. The Talker produces streaming speech tokens concurrently with generation. Source: apidog.com

Benchmark Performance

Alibaba reports 215 SOTA results for Plus across audio, audio-visual, visual, text, and speech-generation benchmarks. The directly comparable numbers against Gemini 3.1 Pro are where the claim carries the most weight.

BenchmarkQwen3.5-Omni-PlusGemini 3.1 ProGPT-AudioElevenLabs
MMAU (audio understanding)82.281.1--
MMSU (audio understanding)82.881.3--
RUL-MuchoMusic72.459.6--
VoiceBench (dialogue)93.188.9--
LibriSpeech clean WER1.113.36--
LibriSpeech other WER2.234.41--
CV15 (en) WER4.838.73--
MMLU-Redux (text)94.2---
MMMU-Pro (visual reasoning)73.9---
Seed-zh voice stability (lower better)1.072.42 (2.5 Pro)1.1113.08
Seed-hard WER6.24-8.1927.70

Word error rate on LibriSpeech clean drops from 3.36 (Gemini 3.1 Pro) to 1.11 - roughly a two-thirds reduction on a benchmark where gains have been fractional for years. Music understanding on RUL-MuchoMusic swings from 59.6 to 72.4, pointing to substantially more musical audio in training than competitors had.

Text performance is the quiet story. MMLU-Redux 94.2 for the Omni variant sits within a point of the non-multimodal Qwen3.5-Plus at 94.3. Adding native audio and video didn't cost the model on text reasoning, which has historically been a tradeoff in unified architectures.

The Artificial Analysis Intelligence Index places Plus at 39 (rank 8 of 67) and Flash at 26 (rank 13 of 78). Both are non-reasoning models.

Key Capabilities

The headline capability is Audio-Visual Vibe Coding: point a camera at a UI or share a screen, describe what you want built verbally, and the model produces functional code from the combined audio-visual stream with no text prompt. The Decoder frames it as emergent rather than an explicit training target - more interesting than the marketing positioning suggests.

Semantic interruption matters for production voice agents. Earlier real-time models struggled to separate a user wanting to cut in from background noise. Qwen3.5-Omni's turn-taking intent recognition distinguishes backchannels ("uh-huh", "right") from actual interruptions. Whether it holds up in a noisy open-plan office is something builders will need to test, but the architecture handles it at the model level rather than via an external VAD pipeline.

Voice cloning works from 10-30 second samples and exposes controls for speed, volume, and emotion. On Seed-zh stability (lower better), Plus scores 1.07 against ElevenLabs 13.08, GPT-Audio 1.11, and Minimax 1.19. On Seed-hard, cloned-voice WER is 6.24 versus ElevenLabs 27.70. That gap matters for customer-facing products where voice drift or word errors kill the experience.

Pricing and Availability

DashScope pricing differs between international (Singapore) and mainland China regions. Headline rates via Artificial Analysis:

VariantInputOutputBlended (3:1)Context
Qwen3.5-Omni-Plus$0.40/M$4.80/M$1.50/M256K
Qwen3.5-Omni-Flash$0.10/M$0.80/M-256K
Qwen3.5-Omni-LightFree (self-host)Free (self-host)-256K
Gemini 3.1 Pro$2.00/M$12.00/M-1M
GPT-5.2$10.00/M$30.00/M-400K

At $0.10/M input, Flash is one of the cheapest frontier-tier multimodal APIs - matching text-only Qwen3.5-Flash pricing but with audio, video, and speech output folded in. Plus is still under a third of Gemini 3.1 Pro's rate.

Access paths:

  1. DashScope API - Plus and Flash via qwen3.5-omni-plus and qwen3.5-omni-flash model IDs, OpenAI-compatible interface
  2. Qwen-Omni-Realtime WebSocket - live audio and video streams with full-duplex conversation and semantic interruption (docs)
  3. Hugging Face - Light variant weights plus demo spaces (online, offline)
  4. Qwen Chat - consumer-facing test interface at chat.qwen.ai

New DashScope accounts in Singapore get a free quota of 1M input and 1M output tokens for 90 days. Audio tokenizes at roughly 427 tokens per minute, so a 10-minute voice call burns around 4,300 input tokens before any response.

Strengths

  • SOTA on directly comparable audio benchmarks - LibriSpeech WER and MMAU ahead of Gemini 3.1 Pro
  • Native multimodal single forward pass with streaming speech output concurrent with generation
  • 256K context for 10+ hour audio or 400-second 720p video in one call
  • 113-language speech recognition and 36-language generation
  • Plus pricing undercuts Gemini 3.1 Pro 5x on input and 2.5x on output
  • Voice cloning stability far ahead of ElevenLabs on Seed-zh and Seed-hard
  • Light variant ships with open weights, preserving Qwen's Apache 2.0 tradition

Weaknesses

  • The 215 SOTA claim is thin - many entries are per-language ASR and S2TT subtasks with little competition
  • Plus needs 40GB+ VRAM for FP16 self-hosting, pricing individual developers out
  • Parameter counts for Flash and Light aren't formally disclosed
  • DashScope tiered pricing can get complex on long prompts - chunking may be cheaper
  • Plus and Flash are API-only; only Light has published weights
  • Non-reasoning model, trails reasoning frontier on chain-of-thought benchmarks
  • Verbose outputs - Artificial Analysis measured 16M tokens against a 7.2M average

Sources

✓ Last verified April 21, 2026

Qwen3.5-Omni - Native Audio, Vision and Voice Model
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.