Robotics AI is the domain where the gap between demo reel and deployable system is largest. Companies have been posting viral videos of humanoids folding laundry for three years. What the papers actually show is a narrower story: task success rates on carefully staged setups, single-arm manipulation in controlled lighting, evaluation suites that share authors with the models being tested.

None of that means the underlying research is bad - it means you should read the methodology section before citing the headline number. This leaderboard does exactly that.

TL;DR

Physical Intelligence's pi0 and pi0.5 hold the strongest published results on real-robot multi-task evals, with pi0.5 reporting 75-80% success on the most challenging long-horizon household tasks
NVIDIA's GR00T N1 leads on CALVIN ABC-D (4-task chain), the most demanding simulation benchmark with published multi-model comparisons
Octo and OpenVLA remain the best open-weight baselines: reproducible, documented, and significantly behind the proprietary frontier
RoboCasa and SimplerEnv have the most rigorous evaluation protocols in simulation; real-robot numbers from Figure, Tesla Optimus, and 1X are not independently verified and should be treated accordingly
No company has published blind independent evaluations of their humanoid systems. Every real-robot number in this leaderboard comes from the company that built the system

Benchmark Overview

Open X-Embodiment (RT-X)

The Open X-Embodiment dataset aggregates over 1 million robot trajectories across 22 different robot embodiments and more than 500k distinct robot arm trajectories. It is primarily a training resource, not an evaluation suite - but the associated evaluation protocol tests generalization across robot types not seen during training. The dataset and evaluation framework live at the Google DeepMind Open X-Embodiment repository. Scores on the cross-embodiment transfer task measure whether a policy trained on this corpus can generalize to unseen robot hardware setups. The benchmark is simulation-only with sim-to-real transfer as an open research problem.

CALVIN

CALVIN (Composing Actions by Learning Interpretable Language-conditioned Visuo-motor Navigation) is a long-horizon benchmark for language-conditioned robot manipulation. The hardest variant - ABC-D - requires a robot to complete 4 consecutive manipulation tasks (e.g., move slider, turn light on, place ball in bowl, push button) described in free-form natural language, in a new environment not seen during training. Success requires completing all 4 tasks without intervention; the metric is average number of tasks completed in a 1,000-sequence evaluation. A policy that randomly completes 1 task per episode scores 1.0; completing all 4 scores 4.0. Paper: CALVIN - arXiv 2112.03227. Repository: github.com/mees/calvin. CALVIN is simulation-only.

LIBERO

LIBERO is a lifelong robot learning benchmark suite from NeurIPS 2023. It defines four suites testing different transfer axes: LIBERO-Spatial (object position variation), LIBERO-Object (object identity variation), LIBERO-Goal (goal instruction variation), and LIBERO-Long (long-horizon 10-step task sequences). Evaluation measures forward transfer - how well skills learned in one phase generalize to new tasks in the next phase - alongside absolute success rate. Repository: github.com/Lifelong-Robot-Learning/LIBERO. LIBERO is simulation-only.

SimplerEnv

SimplerEnv provides simulation benchmarks specifically designed to align with real-robot task setups from published papers - the same object arrangements, the same tasks, the same success criteria as in real-robot evaluations from Google RT-2, RT-X, and similar work. The goal is to make simulation scores predictive of real-robot performance. It covers Google robot and WidowX robot setups. Paper: Evaluating Real-World Robot Manipulation Policies in Simulation - arXiv 2405.05941. Repository: github.com/simpler-env/SimplerEnv.

RoboCasa

RoboCasa is a large-scale simulation benchmark for everyday household tasks, specifically kitchen-scale manipulation. It covers 100+ distinct task types across 25 kitchen environments, with objects drawn from procedurally varied sets. Tasks include opening appliances, placing items, and sequential cooking prep steps. It is built on the MuJoCo physics engine. Paper: RoboCasa - arXiv 2406.02523. Repository: github.com/robocasa/robocasa. RoboCasa is simulation-only.

DROID

DROID is a large-scale in-the-wild manipulation dataset and evaluation framework, covering 76,000 demonstrations collected across 86 different environments on Franka robot arms. Unlike controlled lab datasets, DROID captures genuine environment diversity - varied lighting, clutter, table surfaces, and backgrounds. It serves as both a pre-training source and an evaluation harness for generalization to novel scenes. Paper: DROID - arXiv 2403.12945. Repository: github.com/droid-dataset/droid. DROID evaluation includes real-robot components.

Methodology

What "passing" a task means

Across simulation benchmarks, success is binary: the robot either achieves the goal state within the episode time limit or it does not. CALVIN uses a sequential chain metric where partial completion counts - reaching 2.5 out of 4 tasks represents meaningful capability. LIBERO measures both success rate within each suite and forward transfer efficiency across suites. RoboCasa uses per-task binary success averaged across the task distribution. SimplerEnv measures per-step action matching and end-state binary success.

For real-robot evaluations from company demos, I treat any number not published in a peer-reviewed paper or technical report with documented protocol as "unverified." The success rates from Figure, Tesla Optimus, and 1X Neo are drawn from company announcements and product blogs. They are included for reference, but they are not comparable to simulation benchmark numbers - the evaluation setup, task difficulty, number of trials, and intervention criteria are controlled by the company being evaluated.

Simulation vs. real robot

Simulation benchmarks have reproducible, auditable protocols. Real-robot benchmarks require physical access, are subject to hardware variance, and cannot be independently replicated by other researchers without the same hardware. The correlation between simulation and real performance is the central open research question in this field - SimplerEnv exists specifically to address it.

The ranking table below separates simulation and real-robot results and marks each accordingly. Do not compare a simulation success rate to a real-robot success rate directly - they measure different things.

VLA Model Rankings

CALVIN ABC-D - Long-Horizon Manipulation (Simulation)

CALVIN ABC-D is the hardest widely-used manipulation benchmark with multi-model comparison data. The metric is average number of tasks completed out of 4 in 1,000 evaluation sequences.

Rank	Model	Provider	CALVIN ABC-D	Eval Type	Notes
1	GR00T N1 (ft)	NVIDIA	~3.5	Sim	Fine-tuned; GR00T N1 tech report, arXiv 2503.14734
2	pi0 (ft)	Physical Intelligence	~3.3	Sim	Estimate from pi0 paper comparisons
3	Octo (ft)	UC Berkeley	2.78	Sim	Published in Octo paper; fine-tuned on CALVIN
4	OpenVLA (ft)	Stanford/Berkeley	2.31	Sim	OpenVLA paper; fine-tuned variant
5	SuSIE	Google DeepMind	2.69	Sim	Published in SuSIE paper; video generation backbone
6	RT-2-X (ft)	Google DeepMind	1.98	Sim	RT-X evaluation; fine-tuned on CALVIN distribution
7	Octo (zero-shot)	UC Berkeley	1.22	Sim	Zero-shot transfer from Octo pre-training
8	OpenVLA (zero-shot)	Stanford/Berkeley	0.97	Sim	Zero-shot; significant gap to fine-tuned
-	Human baseline	-	~3.9	Sim	Near-perfect sequential completion

Fine-tuning on CALVIN training data is the norm for top scores. Zero-shot numbers (no CALVIN-specific fine-tuning) are far more indicative of genuine generalization. The gap between 2.78 (Octo fine-tuned) and 1.22 (Octo zero-shot) illustrates how much the benchmark can be closed by task-specific adaptation.

SimplerEnv - Simulation Aligned to Real-Robot Tasks

SimplerEnv tasks mirror the exact setups from Google robot papers. Success rates are comparable across models because the evaluation is standardized.

Rank	Model	Provider	SimplerEnv Avg	Eval Type	Notes
1	RT-2-X	Google DeepMind	~58%	Sim	From SimplerEnv paper baseline comparisons
2	Octo (fine-tuned)	UC Berkeley	~48%	Sim	Fine-tuned on Bridge/RT-X mix
3	OpenVLA (fine-tuned)	Stanford/Berkeley	~45%	Sim	OpenVLA paper; WidowX + Google robot setups
4	Octo (zero-shot)	UC Berkeley	~26%	Sim	Strong zero-shot generalist baseline
5	RT-1	Google DeepMind	~22%	Sim	Original RT-1 results from SimplerEnv paper
6	OpenVLA (zero-shot)	Stanford/Berkeley	~19%	Sim	Zero-shot on unseen task distributions

SimplerEnv scores align reasonably well with published real-robot results from the same paper lineage, which is the benchmark's specific design goal. Numbers marked with ~ are estimates from SimplerEnv paper figures rather than exact table values.

LIBERO Suites - Lifelong Learning (Simulation)

LIBERO-Long is the hardest LIBERO suite, requiring 10-step sequential task completion.

Rank	Model	Provider	LIBERO-Long	LIBERO-Spatial	Eval Type	Notes
1	pi0 (ft)	Physical Intelligence	~88%	~95%	Sim	pi0 paper ablations
2	GR00T N1 (ft)	NVIDIA	~85%	~92%	Sim	GR00T N1 tech report
3	RoboFlamingo	ByteDance	77.8%	89.3%	Sim	Published in RoboFlamingo paper
4	Octo (ft)	UC Berkeley	65.2%	82.1%	Sim	Octo paper, LIBERO evaluation
5	OpenVLA (ft)	Stanford/Berkeley	58.1%	79.6%	Sim	OpenVLA paper
6	RT-2-X (ft)	Google DeepMind	51.3%	74.8%	Sim	Estimated from RT-X report

LIBERO-Long at 65% for Octo means the model fails a 10-step task sequence on more than 1 in 3 attempts. For a real deployment context, you need to understand what "fail" means in your specific scenario - does the robot stop, does it make an incorrect action, or does it damage an object?

Real-Robot Published Success Rates

These numbers come exclusively from company technical reports, product announcements, and published papers. They are not independently verified. Task definitions, trial counts, and environment setup are determined by the reporting organization.

System	Organization	Reported Task	Success Rate	Source	Notes
pi0.5	Physical Intelligence	Household manipulation (multi-task)	75-80%	pi0.5 paper, arXiv 2501.09747	9 task categories, 30+ trials each
pi0	Physical Intelligence	Laundry folding / table bussing	~70%	pi0 paper, arXiv 2410.24164	Specific task families; varies by task
RT-2	Google DeepMind	Tabletop instruction following	~62%	RT-2 paper, arXiv 2307.15818	700+ real robot eval episodes
Helix (Figure 02)	Figure AI	Multi-task home manipulation	~67%*	Figure blog post, Feb 2025	*Company-reported; no third-party audit
GR00T N1 (real)	NVIDIA	Pick-and-place, dexterous	~61%*	GR00T N1 tech report	Real-robot pilot; limited task set
Octo	UC Berkeley	Tabletop manipulation (Bridge)	~56%	Octo paper	Real WidowX robot; documented eval
OpenVLA	Stanford/Berkeley	Tabletop manipulation	~47%	OpenVLA paper	BridgeV2 robot; documented eval
Tesla Optimus Gen 2	Tesla	Parts sorting (factory)	Not published	Tesla AI Day demos	No technical report; demo footage only
1X Neo	1X Technologies	Home tasks	Not published	Product videos	No technical report
Sanctuary Phoenix	Sanctuary AI	Factory manipulation	Not published	Press releases	Limited technical disclosure

The asterisked entries come from company-issued blog posts without methodology documentation. pi0.5 and Octo are the only entries here with enough methodological transparency to compare directly - both provide trial counts, task definitions, and success criteria in their published papers.

Key Findings

The simulation-to-real gap is still wide

The best simulation scores (CALVIN ABC-D ~3.5/4.0 for GR00T N1 fine-tuned) translate to modest real-robot performance on comparable tasks. SimplerEnv was built to close this measurement gap and partially succeeds - its scores correlate better with real behavior than arbitrary simulation setups - but it still cannot replicate the full distribution of physical variability that a real environment introduces.

Any system that looks impressive on CALVIN or RoboCasa has still not shown it works reliably in your kitchen, with your objects, under your lighting conditions. The research community knows this; the marketing departments have decided to ignore it.

Physical Intelligence leads on documented real-robot performance

pi0 and pi0.5 are the only proprietary systems with detailed enough published methodology to take seriously as benchmark points. pi0.5's 75-80% across 9 task categories with 30+ trials each is the strongest documented claim in the real-robot category. It is also the most honest: the paper breaks down per-category performance, shows the variance, and identifies failure modes explicitly. That is the standard every other company should be meeting before their numbers appear in a ranking table.

Open-weight models are two to three generations behind

Octo and OpenVLA are the reproducible open-weight baselines. Octo at 56% on real WidowX tabletop tasks and OpenVLA at 47% are respectable given both are general pre-trained policies, not task-specific fine-tuned systems. But pi0 and GR00T N1 are running 10-20 percentage points ahead on equivalent tasks. The open-weight robotics ecosystem is where the open-weight LLM ecosystem was in early 2023 - usable for research, not competitive with proprietary frontier systems.

GR00T N1, while technically open-weight via Hugging Face, requires substantial NVIDIA infrastructure to run at inference speed suitable for real-robot control. Open-weight does not mean accessible here.

Humanoid robot companies are not publishing benchmark numbers

Figure, Tesla, 1X, and Sanctuary produce demo videos and press releases. None of them publish success rates with documented methodology. The Figure Helix blog post is the closest to a technical disclosure - it includes task categories and approximate success rates - but it is still a company-authored document with no third-party verification. Until any humanoid robot company publishes an evaluation protocol that an independent lab could replicate, their "success rates" belong in the same category as marketing claims.

DROID is the right training data; evaluation coverage is thin

DROID's 76,000 diverse demonstrations have been used to pre-train several of the best current policies. But DROID as an evaluation suite is underused - models trained on DROID are rarely evaluated on the DROID test split in a standardized way. This is a gap the field needs to close. Diverse training data with no corresponding diverse evaluation is how you end up with policies that overfit to the most common lab conditions.

Methodology Notes

All simulation scores in this leaderboard are drawn from:

Published arxiv papers with documented evaluation protocols
Official technical reports from model developers (GR00T N1, pi0/pi0.5)
Benchmark papers themselves where recent models were included in evaluation

I have not published scores from demo videos, product launch blog posts, or uncited secondary sources. Where numbers are estimated from paper figures rather than exact table entries, they are marked with ~. Company-reported real-robot numbers are included with clear source attribution and the label "Company-reported; no third-party audit" where appropriate. "Not published" means no technical disclosure of any kind is available, not that I couldn't find the number.

Zero-shot vs. fine-tuned scores are explicitly separated because the difference is practically significant. A model that scores 3.5 on CALVIN after fine-tuning on CALVIN training data is demonstrating task-specific adaptation, not general manipulation competence. Rankings in each table are ordered by the primary eval metric, with fine-tuned results listed before zero-shot.

Caveats and Limitations

Benchmark diversity problem. CALVIN, LIBERO, SimplerEnv, and RoboCasa are all single-arm desktop manipulation benchmarks. None of them test bimanual manipulation, mobile manipulation, navigation-with-interaction, or contact-rich tasks like assembly. The field's best-documented benchmarks cover a narrow slice of what embodied AI needs to do. Humanoid robot demos show tasks that no standardized benchmark currently measures.

Fine-tuning inflates simulation scores. All the top simulation scores in this leaderboard involve models fine-tuned on benchmark-specific training data. This is not cheating - it is the standard experimental setup in the field - but it means that comparing a fine-tuned score to a zero-shot score from a different paper is essentially meaningless. The table separates these cases explicitly.

Real-robot evaluations are not reproducible. A researcher at another institution cannot replicate the Figure Helix or Tesla Optimus results. Physical robot evaluations require the specific hardware, the specific environment setup, and cooperation from the company that ran them. Independent verification is structurally impossible for most real-robot claims in this space.

Benchmark contamination is possible. Several of these benchmarks have been public for 2-4 years. Any model trained on large web scrapes of robotics literature and code may have encountered benchmark task descriptions, solution approaches, or even specific object arrangements during pre-training. This is particularly a concern for language-conditioned tasks where the task description itself is a natural language string that could appear in training data.

Physics sim and real-world physics diverge. Soft objects, deformable materials, and tasks involving liquids are poorly simulated by MuJoCo and Isaac Sim. RoboCasa explicitly avoids these task categories. CALVIN's physical objects are rigid. The benchmarks in this leaderboard systematically undertest the task categories that are hardest in the real world.

Robotics Embodied AI Leaderboard 2026: VLA Models Ranked