Qwen-RobotManip

Alibaba's generalist VLA model for robotic manipulation, built on Qwen3.5-4B with a DiT action decoder, trained on 38,100+ hours of open-source data, and ranked first on the RoboChallenge generalist track.

Qwen-RobotManip

Qwen-RobotManip is a vision-language-action (VLA) model from Alibaba's Qwen team, designed to govern dexterous robotic arm manipulation. Released June 16, 2026 as part of the Qwen-Robot Suite with Qwen-RobotNav and Qwen-RobotWorld, it combines a Qwen3.5-4B vision-language backbone with a 1.15-billion-parameter Diffusion Transformer (DiT) action decoder trained through flow matching.

TL;DR

  • Generalist VLA for dexterous arm control: ranked 1st on RoboChallenge generalist track with 45% task success rate and process score 59.83
  • Built on Qwen3.5-4B + 1.15B DiT action decoder; trained on 38,100+ hours of open-source manipulation data
  • Beats specialist per-platform models by enabling cross-embodiment transfer with a single policy checkpoint

The model's core design challenge was making large-scale training on heterogeneous robot data coherent rather than conflicting. Different robotic platforms encode state and action in incompatible formats. Qwen-RobotManip addresses this with a canonical alignment scheme: all states and actions are mapped to a unified 80-dimensional vector covering single-arm, dual-arm, dexterous hand, and mobile base configurations. Per-dimension binary masks ensure gradients flow only through slots relevant to each embodiment, so different platforms share representation space without interfering with each other.

End-effector actions are expressed as deltas in camera coordinate space rather than each robot's native base frame. This single choice makes visually similar actions - grasping, picking, placing - numerically proximate across mechanically different arms, which is what makes the pretraining corpus cohere at scale.

Key Specifications

SpecificationDetails
ProviderAlibaba (Qwen Team)
Model FamilyQwen-Robot
ArchitectureQwen3.5-4B VL backbone + 1.15B DiT flow-matching action decoder
Parameters~5.15B total (not officially disclosed as a single number)
Context WindowNot disclosed
Input PriceNot disclosed - enterprise pilot via Alibaba Cloud
Output PriceNot disclosed - enterprise pilot via Alibaba Cloud
Release DateJune 16, 2026
LicenseProprietary (weights not publicly released as of launch)
Training Data38,100+ hours open-source robotic and human demonstration data

Benchmark Performance

Qwen-RobotManip topped the RoboChallenge Table30 v1 Generalist Track, which spans 30 tasks across four robot platforms. The Qwen-VLA unified model (which Qwen-RobotManip powers for manipulation) also shows strong numbers across standard simulation suites.

BenchmarkQwen-RobotManip / Qwen-VLANotes
RoboChallenge Generalist (Task Success)45%1st place
RoboChallenge Generalist (Process Score)59.83~20 points above runner-up
LIBERO97.9%Simulation suite
RoboTwin-Easy86.1%Simulation suite
RoboTwin-Hard87.2%Simulation suite
Simpler-WidowX73.7%Sim-to-real transfer
ALOHA dual-arm in-domain83.6%Real-world bimanual
ALOHA dual-arm OOD76.9%Real-world bimanual, out-of-distribution
DOMINO zero-shot dynamic26.6%Dynamic manipulation, zero-shot

The LIBERO score of 97.9% is strong enough to be near-saturating that benchmark, which is worth noting - the more informative numbers are the real-world ALOHA figures and DOMINO. A 76.9% OOD success rate on ALOHA bimanual tasks is meaningful, since out-of-distribution generalization is where most robot policies fall apart. The DOMINO zero-shot result at 26.6% is lower but reflects a harder, purposefully unseen dynamic manipulation regime.

Cross-embodiment transfer shows the biggest payoff. A single policy jointly fine-tuned on 6,000 CobotMagic and 130 ARX demonstrations was tested on four novel ARX tasks with zero ARX training demos. The unified canonical framework hit 55.0% on those tasks, versus under 14% for the best ablation that dropped the canonical alignment step. That gap confirms the design decision more than any single benchmark number.

Robotic arm performing precision manipulation tasks Precision manipulation requires sub-centimeter control - the task class Qwen-RobotManip targets with its DiT flow-matching action decoder. Source: unsplash.com

Key Capabilities

The DiT action decoder produces continuous action chunks through iterative denoising rather than tokenizing actions as discrete tokens. This matters for dexterous tasks. Discrete token-based VLAs (like the original RT-2 and early OpenVLA) produce one action step at a time, which creates jerky motion. Flow-matching creates smooth trajectories and handles multi-modal action distributions - situations where there are multiple valid ways to reach the same outcome.

Embodiment-aware prompt conditioning lets a single set of weights control different robot platforms. The operator specifies the robot type, arm configuration, and control convention through a text description. No separate policy heads, no fine-tuning per platform. This is conceptually similar to how π0.7 handles generalization, though the architectural approach differs - Qwen-RobotManip uses a DiT decoder while Physical Intelligence's system uses diffusion over flow matching on a separate architecture.

The training corpus covers four manipulation regimes: single-arm, dual-arm bimanual, dexterous hand manipulation, and mobile manipulation with base movement. The 38,100-hour dataset was assembled completely from open-source sources: standard robotic datasets (DROID, RT-1, similar), egocentric human demonstration videos from Ego4D and EPIC-KITCHENS, and synthetic trajectories created across 15 robot morphologies. No proprietary data collection was needed.

The four-stage training recipe starts with large-scale action pretraining, adds multimodal continued pretraining, follows with supervised fine-tuning across manipulation task families, then closes with reinforcement learning. This mirrors what you'd expect from a model trying to bridge broad coverage (pretraining) with task-specific performance (fine-tuning and RL).

Pricing and Availability

Qwen-RobotManip was in pilot testing with selected Alibaba Cloud enterprise clients at launch on June 16, 2026. No API pricing, public weights, or open-source release had been announced as of that date.

The underlying language backbone - Qwen3.5-4B - is open-source and available on Hugging Face under Apache 2.0. The manipulation-specific DiT action decoder and the trained VLA weights aren't yet publicly released. Alibaba's typical playbook for Qwen models has been to release weights after an enterprise preview period, though there's no stated timeline for Qwen-RobotManip specifically.

For comparison, the closest open alternative is OpenVLA under Apache 2.0, which fine-tunes on consumer-grade GPUs and has an active open-source community. Our Robotics Embodied AI Leaderboard tracks the current competitive field including OpenVLA, π0, Octo, and others.

Strengths

  • Canonical alignment works at scale. The 80-dimensional unified state-action space with camera-frame end-effector deltas solves the heterogeneous data problem that prevents most labs from training on diverse robot datasets simultaneously.
  • Cross-embodiment transfer with a single checkpoint. 55% success on zero-shot cross-platform tasks is a real number from an ablation, not a marketing claim.
  • Strong real-world OOD results. 76.9% on ALOHA bimanual OOD tasks is one of the better published numbers in this space.
  • No proprietary data. The 38,100-hour corpus is reproducible from public sources, which matters for independent verification.
  • DiT flow-matching decoder. Smooth path generation is more suitable for dexterous tasks than discrete-token action prediction.

Weaknesses

  • Weights are not public at launch. Without open weights, independent researchers can't verify the benchmark numbers or build on the architecture.
  • Pricing and API access undisclosed. Enterprise pilot only - no way to evaluate cost structure.
  • DOMINO zero-shot at 26.6% shows limits. Dynamic manipulation tasks with unseen object interactions remain hard; the model doesn't solve the problem, it reduces it.
  • No disclosed context window. It's unclear how long the observation history used in inference is, which matters for tasks requiring memory of prior steps.
  • Part of a suite, not standalone. Navigation and world modeling tasks require the separate Qwen-RobotNav and Qwen-RobotWorld models. A single deployment needs the full suite.

Sources:

✓ Last verified June 16, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.