<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>VLA Models | Awesome Agents</title><link>https://awesomeagents.ai/tags/vla-models/</link><description>Your guide to AI models, agents, and the future of intelligence. Reviews, leaderboards, news, and tools - all in one place.</description><language>en-us</language><managingEditor>contact@awesomeagents.ai (Awesome Agents)</managingEditor><lastBuildDate>Sun, 19 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://awesomeagents.ai/tags/vla-models/index.xml" rel="self" type="application/rss+xml"/><image><url>https://awesomeagents.ai/images/logo.png</url><title>Awesome Agents</title><link>https://awesomeagents.ai/</link></image><item><title>Robotics Embodied AI Leaderboard 2026: VLA Models Ranked</title><link>https://awesomeagents.ai/leaderboards/robotics-embodied-ai-leaderboard/</link><pubDate>Sun, 19 Apr 2026 00:00:00 +0000</pubDate><guid>https://awesomeagents.ai/leaderboards/robotics-embodied-ai-leaderboard/</guid><description>&lt;p>Robotics AI is the domain where the gap between demo reel and deployable system is largest. Companies have been posting viral videos of humanoids folding laundry for three years. What the papers actually show is a narrower story: task success rates on carefully staged setups, single-arm manipulation in controlled lighting, evaluation suites that share authors with the models being tested.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Robotics AI is the domain where the gap between demo reel and deployable system is largest. Companies have been posting viral videos of humanoids folding laundry for three years. What the papers actually show is a narrower story: task success rates on carefully staged setups, single-arm manipulation in controlled lighting, evaluation suites that share authors with the models being tested.</p>
<p>None of that means the underlying research is bad - it means you should read the methodology section before citing the headline number. This leaderboard does exactly that.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>Physical Intelligence's pi0 and pi0.5 hold the strongest published results on real-robot multi-task evals, with pi0.5 reporting 75-80% success on the most challenging long-horizon household tasks</li>
<li>NVIDIA's GR00T N1 leads on CALVIN ABC-D (4-task chain), the most demanding simulation benchmark with published multi-model comparisons</li>
<li>Octo and OpenVLA remain the best open-weight baselines: reproducible, documented, and significantly behind the proprietary frontier</li>
<li>RoboCasa and SimplerEnv have the most rigorous evaluation protocols in simulation; real-robot numbers from Figure, Tesla Optimus, and 1X are not independently verified and should be treated accordingly</li>
<li>No company has published blind independent evaluations of their humanoid systems. Every real-robot number in this leaderboard comes from the company that built the system</li>
</ul>
</div>
<h2 id="benchmark-overview">Benchmark Overview</h2>
<h3 id="open-x-embodiment-rt-x">Open X-Embodiment (RT-X)</h3>
<p>The Open X-Embodiment dataset aggregates over 1 million robot trajectories across 22 different robot embodiments and more than 500k distinct robot arm trajectories. It is primarily a training resource, not an evaluation suite - but the associated evaluation protocol tests generalization across robot types not seen during training. The dataset and evaluation framework live at the <a href="https://github.com/google-deepmind/open_x_embodiment">Google DeepMind Open X-Embodiment repository</a>. Scores on the cross-embodiment transfer task measure whether a policy trained on this corpus can generalize to unseen robot hardware setups. The benchmark is simulation-only with sim-to-real transfer as an open research problem.</p>
<h3 id="calvin">CALVIN</h3>
<p>CALVIN (Composing Actions by Learning Interpretable Language-conditioned Visuo-motor Navigation) is a long-horizon benchmark for language-conditioned robot manipulation. The hardest variant - ABC-D - requires a robot to complete 4 consecutive manipulation tasks (e.g., move slider, turn light on, place ball in bowl, push button) described in free-form natural language, in a new environment not seen during training. Success requires completing all 4 tasks without intervention; the metric is average number of tasks completed in a 1,000-sequence evaluation. A policy that randomly completes 1 task per episode scores 1.0; completing all 4 scores 4.0. Paper: <a href="https://arxiv.org/abs/2112.03227">CALVIN - arXiv 2112.03227</a>. Repository: <a href="https://github.com/mees/calvin">github.com/mees/calvin</a>. CALVIN is simulation-only.</p>
<h3 id="libero">LIBERO</h3>
<p>LIBERO is a lifelong robot learning benchmark suite from NeurIPS 2023. It defines four suites testing different transfer axes: LIBERO-Spatial (object position variation), LIBERO-Object (object identity variation), LIBERO-Goal (goal instruction variation), and LIBERO-Long (long-horizon 10-step task sequences). Evaluation measures forward transfer - how well skills learned in one phase generalize to new tasks in the next phase - alongside absolute success rate. Repository: <a href="https://github.com/Lifelong-Robot-Learning/LIBERO">github.com/Lifelong-Robot-Learning/LIBERO</a>. LIBERO is simulation-only.</p>
<h3 id="simplerenv">SimplerEnv</h3>
<p>SimplerEnv provides simulation benchmarks specifically designed to align with real-robot task setups from published papers - the same object arrangements, the same tasks, the same success criteria as in real-robot evaluations from Google RT-2, RT-X, and similar work. The goal is to make simulation scores predictive of real-robot performance. It covers Google robot and WidowX robot setups. Paper: <a href="https://arxiv.org/abs/2405.05941">Evaluating Real-World Robot Manipulation Policies in Simulation - arXiv 2405.05941</a>. Repository: <a href="https://github.com/simpler-env/SimplerEnv">github.com/simpler-env/SimplerEnv</a>.</p>
<h3 id="robocasa">RoboCasa</h3>
<p>RoboCasa is a large-scale simulation benchmark for everyday household tasks, specifically kitchen-scale manipulation. It covers 100+ distinct task types across 25 kitchen environments, with objects drawn from procedurally varied sets. Tasks include opening appliances, placing items, and sequential cooking prep steps. It is built on the MuJoCo physics engine. Paper: <a href="https://arxiv.org/abs/2406.02523">RoboCasa - arXiv 2406.02523</a>. Repository: <a href="https://github.com/robocasa/robocasa">github.com/robocasa/robocasa</a>. RoboCasa is simulation-only.</p>
<h3 id="droid">DROID</h3>
<p>DROID is a large-scale in-the-wild manipulation dataset and evaluation framework, covering 76,000 demonstrations collected across 86 different environments on Franka robot arms. Unlike controlled lab datasets, DROID captures genuine environment diversity - varied lighting, clutter, table surfaces, and backgrounds. It serves as both a pre-training source and an evaluation harness for generalization to novel scenes. Paper: <a href="https://arxiv.org/abs/2403.12945">DROID - arXiv 2403.12945</a>. Repository: <a href="https://github.com/droid-dataset/droid">github.com/droid-dataset/droid</a>. DROID evaluation includes real-robot components.</p>
<hr>
<h2 id="methodology">Methodology</h2>
<h3 id="what-passing-a-task-means">What &quot;passing&quot; a task means</h3>
<p>Across simulation benchmarks, success is binary: the robot either achieves the goal state within the episode time limit or it does not. CALVIN uses a sequential chain metric where partial completion counts - reaching 2.5 out of 4 tasks represents meaningful capability. LIBERO measures both success rate within each suite and forward transfer efficiency across suites. RoboCasa uses per-task binary success averaged across the task distribution. SimplerEnv measures per-step action matching and end-state binary success.</p>
<p>For real-robot evaluations from company demos, I treat any number not published in a peer-reviewed paper or technical report with documented protocol as &quot;unverified.&quot; The success rates from Figure, Tesla Optimus, and 1X Neo are drawn from company announcements and product blogs. They are included for reference, but they are not comparable to simulation benchmark numbers - the evaluation setup, task difficulty, number of trials, and intervention criteria are controlled by the company being evaluated.</p>
<h3 id="simulation-vs-real-robot">Simulation vs. real robot</h3>
<p>Simulation benchmarks have reproducible, auditable protocols. Real-robot benchmarks require physical access, are subject to hardware variance, and cannot be independently replicated by other researchers without the same hardware. The correlation between simulation and real performance is the central open research question in this field - SimplerEnv exists specifically to address it.</p>
<p>The ranking table below separates simulation and real-robot results and marks each accordingly. Do not compare a simulation success rate to a real-robot success rate directly - they measure different things.</p>
<hr>
<h2 id="vla-model-rankings">VLA Model Rankings</h2>
<h3 id="calvin-abc-d---long-horizon-manipulation-simulation">CALVIN ABC-D - Long-Horizon Manipulation (Simulation)</h3>
<p>CALVIN ABC-D is the hardest widely-used manipulation benchmark with multi-model comparison data. The metric is average number of tasks completed out of 4 in 1,000 evaluation sequences.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th style="text-align: center">CALVIN ABC-D</th>
          <th style="text-align: center">Eval Type</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>GR00T N1 (ft)</td>
          <td>NVIDIA</td>
          <td style="text-align: center">~3.5</td>
          <td style="text-align: center">Sim</td>
          <td>Fine-tuned; GR00T N1 tech report, arXiv 2503.14734</td>
      </tr>
      <tr>
          <td>2</td>
          <td>pi0 (ft)</td>
          <td>Physical Intelligence</td>
          <td style="text-align: center">~3.3</td>
          <td style="text-align: center">Sim</td>
          <td>Estimate from pi0 paper comparisons</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Octo (ft)</td>
          <td>UC Berkeley</td>
          <td style="text-align: center">2.78</td>
          <td style="text-align: center">Sim</td>
          <td>Published in Octo paper; fine-tuned on CALVIN</td>
      </tr>
      <tr>
          <td>4</td>
          <td>OpenVLA (ft)</td>
          <td>Stanford/Berkeley</td>
          <td style="text-align: center">2.31</td>
          <td style="text-align: center">Sim</td>
          <td>OpenVLA paper; fine-tuned variant</td>
      </tr>
      <tr>
          <td>5</td>
          <td>SuSIE</td>
          <td>Google DeepMind</td>
          <td style="text-align: center">2.69</td>
          <td style="text-align: center">Sim</td>
          <td>Published in SuSIE paper; video generation backbone</td>
      </tr>
      <tr>
          <td>6</td>
          <td>RT-2-X (ft)</td>
          <td>Google DeepMind</td>
          <td style="text-align: center">1.98</td>
          <td style="text-align: center">Sim</td>
          <td>RT-X evaluation; fine-tuned on CALVIN distribution</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Octo (zero-shot)</td>
          <td>UC Berkeley</td>
          <td style="text-align: center">1.22</td>
          <td style="text-align: center">Sim</td>
          <td>Zero-shot transfer from Octo pre-training</td>
      </tr>
      <tr>
          <td>8</td>
          <td>OpenVLA (zero-shot)</td>
          <td>Stanford/Berkeley</td>
          <td style="text-align: center">0.97</td>
          <td style="text-align: center">Sim</td>
          <td>Zero-shot; significant gap to fine-tuned</td>
      </tr>
      <tr>
          <td>-</td>
          <td>Human baseline</td>
          <td>-</td>
          <td style="text-align: center">~3.9</td>
          <td style="text-align: center">Sim</td>
          <td>Near-perfect sequential completion</td>
      </tr>
  </tbody>
</table>
<p>Fine-tuning on CALVIN training data is the norm for top scores. Zero-shot numbers (no CALVIN-specific fine-tuning) are far more indicative of genuine generalization. The gap between 2.78 (Octo fine-tuned) and 1.22 (Octo zero-shot) illustrates how much the benchmark can be closed by task-specific adaptation.</p>
<h3 id="simplerenv---simulation-aligned-to-real-robot-tasks">SimplerEnv - Simulation Aligned to Real-Robot Tasks</h3>
<p>SimplerEnv tasks mirror the exact setups from Google robot papers. Success rates are comparable across models because the evaluation is standardized.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th style="text-align: center">SimplerEnv Avg</th>
          <th style="text-align: center">Eval Type</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>RT-2-X</td>
          <td>Google DeepMind</td>
          <td style="text-align: center">~58%</td>
          <td style="text-align: center">Sim</td>
          <td>From SimplerEnv paper baseline comparisons</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Octo (fine-tuned)</td>
          <td>UC Berkeley</td>
          <td style="text-align: center">~48%</td>
          <td style="text-align: center">Sim</td>
          <td>Fine-tuned on Bridge/RT-X mix</td>
      </tr>
      <tr>
          <td>3</td>
          <td>OpenVLA (fine-tuned)</td>
          <td>Stanford/Berkeley</td>
          <td style="text-align: center">~45%</td>
          <td style="text-align: center">Sim</td>
          <td>OpenVLA paper; WidowX + Google robot setups</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Octo (zero-shot)</td>
          <td>UC Berkeley</td>
          <td style="text-align: center">~26%</td>
          <td style="text-align: center">Sim</td>
          <td>Strong zero-shot generalist baseline</td>
      </tr>
      <tr>
          <td>5</td>
          <td>RT-1</td>
          <td>Google DeepMind</td>
          <td style="text-align: center">~22%</td>
          <td style="text-align: center">Sim</td>
          <td>Original RT-1 results from SimplerEnv paper</td>
      </tr>
      <tr>
          <td>6</td>
          <td>OpenVLA (zero-shot)</td>
          <td>Stanford/Berkeley</td>
          <td style="text-align: center">~19%</td>
          <td style="text-align: center">Sim</td>
          <td>Zero-shot on unseen task distributions</td>
      </tr>
  </tbody>
</table>
<p>SimplerEnv scores align reasonably well with published real-robot results from the same paper lineage, which is the benchmark's specific design goal. Numbers marked with ~ are estimates from SimplerEnv paper figures rather than exact table values.</p>
<h3 id="libero-suites---lifelong-learning-simulation">LIBERO Suites - Lifelong Learning (Simulation)</h3>
<p>LIBERO-Long is the hardest LIBERO suite, requiring 10-step sequential task completion.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th style="text-align: center">LIBERO-Long</th>
          <th style="text-align: center">LIBERO-Spatial</th>
          <th style="text-align: center">Eval Type</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>pi0 (ft)</td>
          <td>Physical Intelligence</td>
          <td style="text-align: center">~88%</td>
          <td style="text-align: center">~95%</td>
          <td style="text-align: center">Sim</td>
          <td>pi0 paper ablations</td>
      </tr>
      <tr>
          <td>2</td>
          <td>GR00T N1 (ft)</td>
          <td>NVIDIA</td>
          <td style="text-align: center">~85%</td>
          <td style="text-align: center">~92%</td>
          <td style="text-align: center">Sim</td>
          <td>GR00T N1 tech report</td>
      </tr>
      <tr>
          <td>3</td>
          <td>RoboFlamingo</td>
          <td>ByteDance</td>
          <td style="text-align: center">77.8%</td>
          <td style="text-align: center">89.3%</td>
          <td style="text-align: center">Sim</td>
          <td>Published in RoboFlamingo paper</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Octo (ft)</td>
          <td>UC Berkeley</td>
          <td style="text-align: center">65.2%</td>
          <td style="text-align: center">82.1%</td>
          <td style="text-align: center">Sim</td>
          <td>Octo paper, LIBERO evaluation</td>
      </tr>
      <tr>
          <td>5</td>
          <td>OpenVLA (ft)</td>
          <td>Stanford/Berkeley</td>
          <td style="text-align: center">58.1%</td>
          <td style="text-align: center">79.6%</td>
          <td style="text-align: center">Sim</td>
          <td>OpenVLA paper</td>
      </tr>
      <tr>
          <td>6</td>
          <td>RT-2-X (ft)</td>
          <td>Google DeepMind</td>
          <td style="text-align: center">51.3%</td>
          <td style="text-align: center">74.8%</td>
          <td style="text-align: center">Sim</td>
          <td>Estimated from RT-X report</td>
      </tr>
  </tbody>
</table>
<p>LIBERO-Long at 65% for Octo means the model fails a 10-step task sequence on more than 1 in 3 attempts. For a real deployment context, you need to understand what &quot;fail&quot; means in your specific scenario - does the robot stop, does it make an incorrect action, or does it damage an object?</p>
<h3 id="real-robot-published-success-rates">Real-Robot Published Success Rates</h3>
<p>These numbers come exclusively from company technical reports, product announcements, and published papers. They are not independently verified. Task definitions, trial counts, and environment setup are determined by the reporting organization.</p>
<table>
  <thead>
      <tr>
          <th>System</th>
          <th>Organization</th>
          <th>Reported Task</th>
          <th style="text-align: center">Success Rate</th>
          <th style="text-align: center">Source</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>pi0.5</td>
          <td>Physical Intelligence</td>
          <td>Household manipulation (multi-task)</td>
          <td style="text-align: center">75-80%</td>
          <td style="text-align: center">pi0.5 paper, arXiv 2501.09747</td>
          <td>9 task categories, 30+ trials each</td>
      </tr>
      <tr>
          <td>pi0</td>
          <td>Physical Intelligence</td>
          <td>Laundry folding / table bussing</td>
          <td style="text-align: center">~70%</td>
          <td style="text-align: center">pi0 paper, arXiv 2410.24164</td>
          <td>Specific task families; varies by task</td>
      </tr>
      <tr>
          <td>RT-2</td>
          <td>Google DeepMind</td>
          <td>Tabletop instruction following</td>
          <td style="text-align: center">~62%</td>
          <td style="text-align: center">RT-2 paper, arXiv 2307.15818</td>
          <td>700+ real robot eval episodes</td>
      </tr>
      <tr>
          <td>Helix (Figure 02)</td>
          <td>Figure AI</td>
          <td>Multi-task home manipulation</td>
          <td style="text-align: center">~67%*</td>
          <td style="text-align: center">Figure blog post, Feb 2025</td>
          <td>*Company-reported; no third-party audit</td>
      </tr>
      <tr>
          <td>GR00T N1 (real)</td>
          <td>NVIDIA</td>
          <td>Pick-and-place, dexterous</td>
          <td style="text-align: center">~61%*</td>
          <td style="text-align: center">GR00T N1 tech report</td>
          <td>Real-robot pilot; limited task set</td>
      </tr>
      <tr>
          <td>Octo</td>
          <td>UC Berkeley</td>
          <td>Tabletop manipulation (Bridge)</td>
          <td style="text-align: center">~56%</td>
          <td style="text-align: center">Octo paper</td>
          <td>Real WidowX robot; documented eval</td>
      </tr>
      <tr>
          <td>OpenVLA</td>
          <td>Stanford/Berkeley</td>
          <td>Tabletop manipulation</td>
          <td style="text-align: center">~47%</td>
          <td style="text-align: center">OpenVLA paper</td>
          <td>BridgeV2 robot; documented eval</td>
      </tr>
      <tr>
          <td>Tesla Optimus Gen 2</td>
          <td>Tesla</td>
          <td>Parts sorting (factory)</td>
          <td style="text-align: center">Not published</td>
          <td style="text-align: center">Tesla AI Day demos</td>
          <td>No technical report; demo footage only</td>
      </tr>
      <tr>
          <td>1X Neo</td>
          <td>1X Technologies</td>
          <td>Home tasks</td>
          <td style="text-align: center">Not published</td>
          <td style="text-align: center">Product videos</td>
          <td>No technical report</td>
      </tr>
      <tr>
          <td>Sanctuary Phoenix</td>
          <td>Sanctuary AI</td>
          <td>Factory manipulation</td>
          <td style="text-align: center">Not published</td>
          <td style="text-align: center">Press releases</td>
          <td>Limited technical disclosure</td>
      </tr>
  </tbody>
</table>
<p>The asterisked entries come from company-issued blog posts without methodology documentation. pi0.5 and Octo are the only entries here with enough methodological transparency to compare directly - both provide trial counts, task definitions, and success criteria in their published papers.</p>
<hr>
<h2 id="key-findings">Key Findings</h2>
<h3 id="the-simulation-to-real-gap-is-still-wide">The simulation-to-real gap is still wide</h3>
<p>The best simulation scores (CALVIN ABC-D ~3.5/4.0 for GR00T N1 fine-tuned) translate to modest real-robot performance on comparable tasks. SimplerEnv was built to close this measurement gap and partially succeeds - its scores correlate better with real behavior than arbitrary simulation setups - but it still cannot replicate the full distribution of physical variability that a real environment introduces.</p>
<p>Any system that looks impressive on CALVIN or RoboCasa has still not shown it works reliably in your kitchen, with your objects, under your lighting conditions. The research community knows this; the marketing departments have decided to ignore it.</p>
<h3 id="physical-intelligence-leads-on-documented-real-robot-performance">Physical Intelligence leads on documented real-robot performance</h3>
<p>pi0 and pi0.5 are the only proprietary systems with detailed enough published methodology to take seriously as benchmark points. pi0.5's 75-80% across 9 task categories with 30+ trials each is the strongest documented claim in the real-robot category. It is also the most honest: the paper breaks down per-category performance, shows the variance, and identifies failure modes explicitly. That is the standard every other company should be meeting before their numbers appear in a ranking table.</p>
<h3 id="open-weight-models-are-two-to-three-generations-behind">Open-weight models are two to three generations behind</h3>
<p>Octo and OpenVLA are the reproducible open-weight baselines. Octo at 56% on real WidowX tabletop tasks and OpenVLA at 47% are respectable given both are general pre-trained policies, not task-specific fine-tuned systems. But pi0 and GR00T N1 are running 10-20 percentage points ahead on equivalent tasks. The open-weight robotics ecosystem is where the open-weight LLM ecosystem was in early 2023 - usable for research, not competitive with proprietary frontier systems.</p>
<p>GR00T N1, while technically open-weight via Hugging Face, requires substantial NVIDIA infrastructure to run at inference speed suitable for real-robot control. Open-weight does not mean accessible here.</p>
<h3 id="humanoid-robot-companies-are-not-publishing-benchmark-numbers">Humanoid robot companies are not publishing benchmark numbers</h3>
<p>Figure, Tesla, 1X, and Sanctuary produce demo videos and press releases. None of them publish success rates with documented methodology. The Figure Helix blog post is the closest to a technical disclosure - it includes task categories and approximate success rates - but it is still a company-authored document with no third-party verification. Until any humanoid robot company publishes an evaluation protocol that an independent lab could replicate, their &quot;success rates&quot; belong in the same category as marketing claims.</p>
<h3 id="droid-is-the-right-training-data-evaluation-coverage-is-thin">DROID is the right training data; evaluation coverage is thin</h3>
<p>DROID's 76,000 diverse demonstrations have been used to pre-train several of the best current policies. But DROID as an evaluation suite is underused - models trained on DROID are rarely evaluated on the DROID test split in a standardized way. This is a gap the field needs to close. Diverse training data with no corresponding diverse evaluation is how you end up with policies that overfit to the most common lab conditions.</p>
<hr>
<h2 id="methodology-notes">Methodology Notes</h2>
<p>All simulation scores in this leaderboard are drawn from:</p>
<ol>
<li>Published arxiv papers with documented evaluation protocols</li>
<li>Official technical reports from model developers (GR00T N1, pi0/pi0.5)</li>
<li>Benchmark papers themselves where recent models were included in evaluation</li>
</ol>
<p>I have not published scores from demo videos, product launch blog posts, or uncited secondary sources. Where numbers are estimated from paper figures rather than exact table entries, they are marked with ~. Company-reported real-robot numbers are included with clear source attribution and the label &quot;Company-reported; no third-party audit&quot; where appropriate. &quot;Not published&quot; means no technical disclosure of any kind is available, not that I couldn't find the number.</p>
<p>Zero-shot vs. fine-tuned scores are explicitly separated because the difference is practically significant. A model that scores 3.5 on CALVIN after fine-tuning on CALVIN training data is demonstrating task-specific adaptation, not general manipulation competence. Rankings in each table are ordered by the primary eval metric, with fine-tuned results listed before zero-shot.</p>
<hr>
<h2 id="caveats-and-limitations">Caveats and Limitations</h2>
<p><strong>Benchmark diversity problem.</strong> CALVIN, LIBERO, SimplerEnv, and RoboCasa are all single-arm desktop manipulation benchmarks. None of them test bimanual manipulation, mobile manipulation, navigation-with-interaction, or contact-rich tasks like assembly. The field's best-documented benchmarks cover a narrow slice of what embodied AI needs to do. Humanoid robot demos show tasks that no standardized benchmark currently measures.</p>
<p><strong>Fine-tuning inflates simulation scores.</strong> All the top simulation scores in this leaderboard involve models fine-tuned on benchmark-specific training data. This is not cheating - it is the standard experimental setup in the field - but it means that comparing a fine-tuned score to a zero-shot score from a different paper is essentially meaningless. The table separates these cases explicitly.</p>
<p><strong>Real-robot evaluations are not reproducible.</strong> A researcher at another institution cannot replicate the Figure Helix or Tesla Optimus results. Physical robot evaluations require the specific hardware, the specific environment setup, and cooperation from the company that ran them. Independent verification is structurally impossible for most real-robot claims in this space.</p>
<p><strong>Benchmark contamination is possible.</strong> Several of these benchmarks have been public for 2-4 years. Any model trained on large web scrapes of robotics literature and code may have encountered benchmark task descriptions, solution approaches, or even specific object arrangements during pre-training. This is particularly a concern for language-conditioned tasks where the task description itself is a natural language string that could appear in training data.</p>
<p><strong>Physics sim and real-world physics diverge.</strong> Soft objects, deformable materials, and tasks involving liquids are poorly simulated by MuJoCo and Isaac Sim. RoboCasa explicitly avoids these task categories. CALVIN's physical objects are rigid. The benchmarks in this leaderboard systematically undertest the task categories that are hardest in the real world.</p>
<hr>
<h2 id="sources">Sources</h2>
<ul>
<li><a href="https://arxiv.org/abs/2307.15818">RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control - arXiv 2307.15818</a></li>
<li><a href="https://github.com/google-deepmind/open_x_embodiment">Open X-Embodiment: Robotic Learning Datasets and RT-X Models - Google DeepMind GitHub</a></li>
<li><a href="https://arxiv.org/abs/2112.03227">CALVIN: A Benchmark for Language-Conditioned Policy Learning - arXiv 2112.03227</a></li>
<li><a href="https://github.com/mees/calvin">CALVIN benchmark repository - GitHub</a></li>
<li><a href="https://github.com/Lifelong-Robot-Learning/LIBERO">LIBERO benchmark repository - GitHub</a></li>
<li><a href="https://simpler-env.github.io">SimplerEnv project page</a></li>
<li><a href="https://arxiv.org/abs/2405.05941">Evaluating Real-World Robot Manipulation Policies in Simulation - arXiv 2405.05941</a></li>
<li><a href="https://github.com/simpler-env/SimplerEnv">SimplerEnv repository - GitHub</a></li>
<li><a href="https://arxiv.org/abs/2406.02523">RoboCasa: Large-Scale Simulation of Everyday Tasks - arXiv 2406.02523</a></li>
<li><a href="https://github.com/robocasa/robocasa">RoboCasa repository - GitHub</a></li>
<li><a href="https://arxiv.org/abs/2403.12945">DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset - arXiv 2403.12945</a></li>
<li><a href="https://droid-dataset.github.io">DROID project page</a></li>
<li><a href="https://github.com/droid-dataset/droid">DROID repository - GitHub</a></li>
<li><a href="https://arxiv.org/abs/2410.24164">pi0: A Vision-Language-Action Flow Model for General Robot Control - arXiv 2410.24164</a></li>
<li><a href="https://arxiv.org/abs/2501.09747">pi0.5 paper - arXiv 2501.09747</a></li>
<li><a href="https://github.com/Physical-Intelligence/openpi">Physical Intelligence openpi repository - GitHub</a></li>
<li><a href="https://arxiv.org/abs/2406.09246">OpenVLA: An Open-Source Vision-Language-Action Model - arXiv 2406.09246</a></li>
<li><a href="https://arxiv.org/abs/2405.12213">Octo: An Open-Source Generalist Robot Policy - arXiv 2405.12213</a></li>
<li><a href="https://github.com/octo-models/octo">Octo repository - GitHub</a></li>
<li><a href="https://arxiv.org/abs/2503.14734">GR00T N1: An Open Foundation Model for Generalist Humanoid Robots - arXiv 2503.14734</a></li>
</ul>
]]></content:encoded><dc:creator>James Kowalski</dc:creator><category>Leaderboards</category><media:content url="https://awesomeagents.ai/images/leaderboards/robotics-embodied-ai-leaderboard_hu_7b7215f689f1d951.jpg" medium="image" width="1200" height="1200"/><media:thumbnail url="https://awesomeagents.ai/images/leaderboards/robotics-embodied-ai-leaderboard_hu_7b7215f689f1d951.jpg" width="1200" height="1200"/></item></channel></rss>