Physical Intelligence Launches π0.7 for Untrained Tasks

Physical Intelligence on April 16 unveiled π0.7, a generalist robot model that can tackle tasks it was never directly trained on - then match the performance of specialist models that were. It's a result that the robotics field has been chasing for years and that most researchers thought was still a few model generations away.

Key Specs - π0.7

Spec	Value
Model type	Vision-Language-Action (VLA) generalist
Core claim	Compositional generalization across unseen tasks
Conditioning inputs	Language, visual subgoals, metadata, control labels
Training data sources	Robot episodes, human demos, web VLM pretraining, RL policies
Demonstrated tasks	Laundry folding, espresso prep, box assembly, vegetable peeling, appliance operation
Cross-embodiment	UR5e industrial arm, bimanual desktop robots
Specialist parity	Matches RL-tuned π*0.6 across all tested tasks

What π0.7 Actually Does

A generalist that rivals specialists

The headline result is that a single π0.7 model, with no task-specific fine-tuning, can match the success rates of Physical Intelligence's own π*0.6 specialist models - which were trained via a dedicated reinforcement learning algorithm called Recap specifically designed to maximize performance on individual tasks.

That matters because the standard path in robot learning has always been: train a new model (or fine-tune heavily) for every new task. Swapping out a laundry-folding robot for an espresso-making one isn't just a software update - it's a whole new training pipeline. π0.7 is an attempt to change that.

Co-founder Sergey Levine described the shift this way: "Once it crosses that threshold where it goes from only doing exactly the trained tasks to actually remixing things in new ways, the capabilities are going up more than linearly." That's the same description researchers used for early large language models - where each new benchmark didn't just improve, it collapsed.

The air fryer experiment

The clearest demonstration of what the company means by "compositional generalization" is the air fryer test. π0.7 had exactly two relevant training episodes involving that appliance: one showing a robot pushing an air fryer closed, and one showing a robot placing a bottle inside. Neither episode involved cooking.

Given the instruction "cook a sweet potato," with no additional coaching, the model attempted the task and failed. Given step-by-step verbal instructions ("open the fryer, place the potato, set the timer"), it succeeded - synthesizing fragments from tangential training data into functional behavior. The research team later found that the model probably learned air fryer context from the open-source DROID dataset, which includes footage of humans interacting with appliances. The robot never saw the specific task but absorbed enough about the appliance class to generalize.

π0.7 navigating a kitchen environment during the air fryer demonstration A fisheye view from inside the robot's perception stack as it navigates the kitchen for the air fryer cooking experiment. Source: pi.website

Architecture: Three Layers Working Together

The policy stack

π0.7 is a Vision-Language-Action model built from three components that operate at different levels of abstraction:

High-level policy: creates language subgoals describing what the robot should achieve next
World model: produces visual subgoals at inference time - literally predicting what the scene should look like after the next action
Action expert: takes the multimodal context (camera frames, language instructions, predicted subgoals) and outputs low-level robot control commands

The world model step is what makes compositional generalization plausible. Instead of mapping raw instructions directly to motor commands, the model first imagines what success looks like, then plans toward that image. This creates a natural bottleneck where learned skills from different domains can be mixed and matched through the visual prediction space.

Training with varied conditioning

The training methodology is built around what the team calls "diverse multimodal conditioning" - as opposed to naive data merging. Rather than throwing all robot episodes into a single training run with a fixed instruction format, π0.7 was trained with varied prompt structures:

Language instructions describing tasks and subtasks at different granularities
Metadata flags specifying execution speed or quality level
Control modality labels showing whether the robot should use joint control or end-effector control
Visual subgoal images showing desired intermediate states

The underlying data combined several sources: teleoperated robot episodes, human demonstrations captured with a different setup, web-scale vision-language pretraining (the same recipe that powers today's VLMs), and synthetic data created by running existing RL policies autonomously. The diversity of conditioning formats is what allows the model to be steered at inference time through plain language rather than requiring task-specific prompts.

Benchmark Comparison

Physical Intelligence compared π0.7 against its own RL-tuned specialist models across the core task set. External benchmarks for generalist robot policies don't yet exist in any standardized form, so these numbers are self-reported.

Task	π0.7 (generalist)	π*0.6 specialist (RL-tuned)
Laundry folding (t-shirts, shorts)	Matches	Baseline
Espresso preparation	Matches	Baseline
Box assembly	Matches	Baseline
Vegetable peeling	Matches	Baseline
Air fryer (unseen, coached)	Succeeds	N/A
Cross-embodiment (UR5e)	Matches human teleop	Not evaluated

The cross-embodiment result deserves closer reading: for the UR5e laundry-folding transfer, the comparison group was human teleoperators attempting the task on the target robot for the first time - subjects with about 375 hours of teleoperation experience. π0.7 matched their first-attempt success rates. That's a useful baseline exactly because it anchors the claim in observable human performance rather than a competing model number.

π0.7 in a different physical setup during multi-task evaluation The robot's on-board perception during a separate task evaluation in a non-kitchen environment. Source: pi.website

Cross-Embodiment Transfer

One of the more surprising results is the cross-embodiment transfer: moving the laundry-folding capability from a small static bimanual desktop robot to a large UR5e industrial arm with markedly different morphology - different joint counts, different workspace geometry, different gripper design.

The model didn't require retraining for this. The cross-embodiment generalization comes from the control modality labels in the training data, which teach the model to reason about robot morphology as a conditioning variable rather than a fixed property. If the model has seen enough variations of arm kinematics during training, a new embodiment becomes one more conditioning input rather than a reason to start over.

Physical Intelligence has been documenting this arc for a while. Their earlier investment round at an $11B valuation in March was partly premised on exactly this kind of result - that the jump from task-specific to general-purpose robot policies was closer than the market believed. π0.7 is the first internal proof point they've released publicly.

The timing also puts Physical Intelligence directly in competition with Google DeepMind, whose Gemini Robotics-ER 1.6 launched in April for Boston Dynamics' Spot platform with a focus on embodied visual reasoning. Those are different bets - Google is pushing integration with commercial robot hardware, while Physical Intelligence is still building the foundation model layer. And Hugging Face's open-source LeRobot v0.5.0 remains the accessible alternative for researchers who want to experiment without proprietary APIs.

What To Watch

What the demos don't prove

The key limitation is one Physical Intelligence states clearly: π0.7 can't execute multi-step tasks autonomously from a single high-level instruction. "Make toast" doesn't work. "Open the toaster, place the bread, press the lever" does. The coaching step is essential. The model has generalized skill components, not autonomous task planning.

That's a real gap. The practical value of a generalist robot in most commercial settings depends on reducing the amount of per-task human setup. If every new task still requires someone to decompose it into coached substeps, a lot of the generalization value moves from the robot to the operator.

Benchmark opacity

The comparison data is self-reported, compared against Physical Intelligence's own prior models, with no external replication. The research community still lacks standardized robotics benchmarks that would let third parties verify claims like "matches specialist performance." Until those exist, every result from any robotics lab - Physical Intelligence, Google DeepMind, NVIDIA - should carry a methodological asterisk. We've seen similar dynamics with NVIDIA's Alpamayo model, where strong internal numbers didn't fully translate to external testing conditions.

What's actually novel

Skepticism aside, the compositional generalization result is genuinely new if it holds up. Previous generalist robot models required fine-tuning to match specialist performance. π0.7's claim is that the gap is now zero, in the zero-shot direction. That's a meaningful threshold to cross.

The real test will come when external researchers get access to the model weights - Physical Intelligence hasn't showed a release timeline - and when the evaluation methodology gets stress-tested against tasks that sit further from the training distribution. The air fryer worked because DROID had relevant tangential data. The next test is something truly outside the training set.

Physical Intelligence's next public benchmark is whether the model generalizes to tasks where no tangential data exists at all - a question the company hasn't answered yet, and one the broader field will be watching closely.

Sources: