Qwen-RobotManip
Alibaba's generalist VLA model for robotic manipulation, built on Qwen3.5-4B with a DiT action decoder, trained on 38,100+ hours of open-source data, and ranked first on the RoboChallenge generalist track.

Qwen-RobotManip is a vision-language-action (VLA) model from Alibaba's Qwen team, designed to govern dexterous robotic arm manipulation. Released June 16, 2026 as part of the Qwen-Robot Suite with Qwen-RobotNav and Qwen-RobotWorld, it combines a Qwen3.5-4B vision-language backbone with a 1.15-billion-parameter Diffusion Transformer (DiT) action decoder trained through flow matching.
TL;DR
- Generalist VLA for dexterous arm control: ranked 1st on RoboChallenge generalist track with 45% task success rate and process score 59.83
- Built on Qwen3.5-4B + 1.15B DiT action decoder; trained on 38,100+ hours of open-source manipulation data
- Beats specialist per-platform models by enabling cross-embodiment transfer with a single policy checkpoint
The model's core design challenge was making large-scale training on heterogeneous robot data coherent rather than conflicting. Different robotic platforms encode state and action in incompatible formats. Qwen-RobotManip addresses this with a canonical alignment scheme: all states and actions are mapped to a unified 80-dimensional vector covering single-arm, dual-arm, dexterous hand, and mobile base configurations. Per-dimension binary masks ensure gradients flow only through slots relevant to each embodiment, so different platforms share representation space without interfering with each other.
End-effector actions are expressed as deltas in camera coordinate space rather than each robot's native base frame. This single choice makes visually similar actions - grasping, picking, placing - numerically proximate across mechanically different arms, which is what makes the pretraining corpus cohere at scale.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Alibaba (Qwen Team) |
| Model Family | Qwen-Robot |
| Architecture | Qwen3.5-4B VL backbone + 1.15B DiT flow-matching action decoder |
| Parameters | ~5.15B total (not officially disclosed as a single number) |
| Context Window | Not disclosed |
| Input Price | Not disclosed - enterprise pilot via Alibaba Cloud |
| Output Price | Not disclosed - enterprise pilot via Alibaba Cloud |
| Release Date | June 16, 2026 |
| License | Proprietary (weights not publicly released as of launch) |
| Training Data | 38,100+ hours open-source robotic and human demonstration data |
Benchmark Performance
Qwen-RobotManip topped the RoboChallenge Table30 v1 Generalist Track, which spans 30 tasks across four robot platforms. The Qwen-VLA unified model (which Qwen-RobotManip powers for manipulation) also shows strong numbers across standard simulation suites.
| Benchmark | Qwen-RobotManip / Qwen-VLA | Notes |
|---|---|---|
| RoboChallenge Generalist (Task Success) | 45% | 1st place |
| RoboChallenge Generalist (Process Score) | 59.83 | ~20 points above runner-up |
| LIBERO | 97.9% | Simulation suite |
| RoboTwin-Easy | 86.1% | Simulation suite |
| RoboTwin-Hard | 87.2% | Simulation suite |
| Simpler-WidowX | 73.7% | Sim-to-real transfer |
| ALOHA dual-arm in-domain | 83.6% | Real-world bimanual |
| ALOHA dual-arm OOD | 76.9% | Real-world bimanual, out-of-distribution |
| DOMINO zero-shot dynamic | 26.6% | Dynamic manipulation, zero-shot |
The LIBERO score of 97.9% is strong enough to be near-saturating that benchmark, which is worth noting - the more informative numbers are the real-world ALOHA figures and DOMINO. A 76.9% OOD success rate on ALOHA bimanual tasks is meaningful, since out-of-distribution generalization is where most robot policies fall apart. The DOMINO zero-shot result at 26.6% is lower but reflects a harder, purposefully unseen dynamic manipulation regime.
Cross-embodiment transfer shows the biggest payoff. A single policy jointly fine-tuned on 6,000 CobotMagic and 130 ARX demonstrations was tested on four novel ARX tasks with zero ARX training demos. The unified canonical framework hit 55.0% on those tasks, versus under 14% for the best ablation that dropped the canonical alignment step. That gap confirms the design decision more than any single benchmark number.
Precision manipulation requires sub-centimeter control - the task class Qwen-RobotManip targets with its DiT flow-matching action decoder.
Source: unsplash.com
Key Capabilities
The DiT action decoder produces continuous action chunks through iterative denoising rather than tokenizing actions as discrete tokens. This matters for dexterous tasks. Discrete token-based VLAs (like the original RT-2 and early OpenVLA) produce one action step at a time, which creates jerky motion. Flow-matching creates smooth trajectories and handles multi-modal action distributions - situations where there are multiple valid ways to reach the same outcome.
Embodiment-aware prompt conditioning lets a single set of weights control different robot platforms. The operator specifies the robot type, arm configuration, and control convention through a text description. No separate policy heads, no fine-tuning per platform. This is conceptually similar to how π0.7 handles generalization, though the architectural approach differs - Qwen-RobotManip uses a DiT decoder while Physical Intelligence's system uses diffusion over flow matching on a separate architecture.
The training corpus covers four manipulation regimes: single-arm, dual-arm bimanual, dexterous hand manipulation, and mobile manipulation with base movement. The 38,100-hour dataset was assembled completely from open-source sources: standard robotic datasets (DROID, RT-1, similar), egocentric human demonstration videos from Ego4D and EPIC-KITCHENS, and synthetic trajectories created across 15 robot morphologies. No proprietary data collection was needed.
The four-stage training recipe starts with large-scale action pretraining, adds multimodal continued pretraining, follows with supervised fine-tuning across manipulation task families, then closes with reinforcement learning. This mirrors what you'd expect from a model trying to bridge broad coverage (pretraining) with task-specific performance (fine-tuning and RL).
Pricing and Availability
Qwen-RobotManip was in pilot testing with selected Alibaba Cloud enterprise clients at launch on June 16, 2026. No API pricing, public weights, or open-source release had been announced as of that date.
The underlying language backbone - Qwen3.5-4B - is open-source and available on Hugging Face under Apache 2.0. The manipulation-specific DiT action decoder and the trained VLA weights aren't yet publicly released. Alibaba's typical playbook for Qwen models has been to release weights after an enterprise preview period, though there's no stated timeline for Qwen-RobotManip specifically.
For comparison, the closest open alternative is OpenVLA under Apache 2.0, which fine-tunes on consumer-grade GPUs and has an active open-source community. Our Robotics Embodied AI Leaderboard tracks the current competitive field including OpenVLA, π0, Octo, and others.
Strengths
- Canonical alignment works at scale. The 80-dimensional unified state-action space with camera-frame end-effector deltas solves the heterogeneous data problem that prevents most labs from training on diverse robot datasets simultaneously.
- Cross-embodiment transfer with a single checkpoint. 55% success on zero-shot cross-platform tasks is a real number from an ablation, not a marketing claim.
- Strong real-world OOD results. 76.9% on ALOHA bimanual OOD tasks is one of the better published numbers in this space.
- No proprietary data. The 38,100-hour corpus is reproducible from public sources, which matters for independent verification.
- DiT flow-matching decoder. Smooth path generation is more suitable for dexterous tasks than discrete-token action prediction.
Weaknesses
- Weights are not public at launch. Without open weights, independent researchers can't verify the benchmark numbers or build on the architecture.
- Pricing and API access undisclosed. Enterprise pilot only - no way to evaluate cost structure.
- DOMINO zero-shot at 26.6% shows limits. Dynamic manipulation tasks with unseen object interactions remain hard; the model doesn't solve the problem, it reduces it.
- No disclosed context window. It's unclear how long the observation history used in inference is, which matters for tasks requiring memory of prior steps.
- Part of a suite, not standalone. Navigation and world modeling tasks require the separate Qwen-RobotNav and Qwen-RobotWorld models. A single deployment needs the full suite.
Related Coverage
- Robotics Embodied AI Leaderboard - where VLA models rank on LIBERO, SimplerEnv, RoboTwin, and real-robot tasks as of April 2026
- Physical Intelligence Launches π0.7 for Untrained Tasks - the closest competitor benchmark-wise
- Qwen3.5-4B - the underlying language backbone powering Qwen-RobotManip
- Gemini Robotics-ER 1.6 Can Now Read Analog Gauges - Google's competing embodied AI approach
Sources:
- Qwen-RobotManip: Alignment Unlocks Scale for Robotic Manipulation (Qwen Blog)
- Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments (arXiv:2605.30280)
- Qwen-VLA: From Understanding the World to Acting in It (Qwen Blog)
- GitHub - QwenLM/Qwen-VLA
- Alibaba Unveils Qwen Robot Suite, Moving AI From Chatbots Into the Physical World (Technology.org)
- Alibaba eyes physical world with its first suite of AI models for robots (South China Morning Post)
- Unified Embodied AI with Qwen-VLA (StartupHub.ai)
- Alibaba Unveils Qwen Robot Suite for Embodied AI (Let's Data Science)
✓ Last verified June 16, 2026
