Evaluating \( \pi_0 \) in the Wild:
Strengths, Problems, and the Future of Generalist Robot Policies

GRASP Lab, University of Pennsylvania
*Corresponding author
Robotics, particularly manipulation, has never had trained models that work out of the box on new objects, locations, and tasks. Roboticists have had the unsatisfying experience of going through tedious engineering and data collection to acquire robot policies, and then finding that even small environmental changes break those policies. One promising direction is to train generalist models on large datasets, in the hope that they will produce sensible behavior in new situations, reducing the burden on the end user. This last year has been exciting because of the first wave of models that are beginning to promise that this dream of generalist robots is possible. So, when Physical Intelligence made their models public, we were keen to try it out ourselves, and we were largely impressed and excited about the possibilities as these models continue to improve.

\( \pi_0 \) commanded to fold a newspaper, having never seen this particular workspace before.

Our evaluations were conducted using the π₀-FAST-DROID model, specifically fine-tuned on the DROID robot setup, which consists of the Franka Panda robot with side and wrist cameras. We found it refreshingly easy to setup the platform for policy inference - no camera / controller calibration, or workspace-specific tuning was required. To output actions, the model requires a prompt from the user, describing the task, and images from the wrist and side cameras (see video above).
GRASP Lab Setup
Robot Workspace
Levine Setup 2

Our evaluation setup showing the robot workspace, test objects, and typical manipulation scenes.

Our experiments took place in a kitchen environment, see images above. The kitchen has a diverse selection of objects, backgrounds and lighting conditions, making it ideal for coming up with a large variety of tasks.

Evaluating robot policies is difficult because it is hard to come up with a selection of tasks that cover the wide range of behaviors an arbitrary user would find useful.

We take inspiration from the NLP community, by adopting their "vibe-checking" approach. Vibe-checking involves the user directly evaluating the LLMs themselves by chatting about whatever topic comes into mind, rather than relying on a standard benchmark. Similarly, we subject \( \pi_0 \) to "vibe checks," which are unstructured real world tasks generated by the end user. We improvise tasks, alter camera angles, rearrange objects, and try to think of edge cases to stress-test the model.

Word cloud visualization
Word cloud of task instruction in evaluations.

We conducted over 300 trials of \( \pi_0 \) on various manipulation tasks. It is important to stress that our evaluations were conducted to satisfy our own curiosity about the capabilities of the model (e.g. how does \( \pi_0 \) handle articulated objects, or occluded viewpoints), and do not provide a comprehensive evaluation of the model's capabilities. We summarize our findings below:

  1. Strong Prior for Sensible Behaviors: \( \pi_0 \) produces sensible behaviors across a wide variety of our tasks, although it is important to note that sensible behaviors are often insufficient for task completion.
  2. Prompt Engineering Matters: Although \( \pi_0 \) can produce reasonable actions for many prompts and camera viewpoints, we observed that its success rate for the same task can fluctuate dramatically when the phrasing or viewpoint changes. To achieve consistent performance, use canonical prompts(verb + object) and select camera angles that clearly shows the target object.
  3. Unexpected Quirks: \( \pi_0 \) can recover from failures, and handle moving humans in the scene, but it struggles with mid-task freezing, collision avoidance, and fine-grained manipulation.

Below, we elaborate on our findings in detail, going through success and failure behavior of \( \pi_0 \), as well as some additional interesting phenomena we discovered.

3. Where We Were Impressed With \( \pi_0 \)

Below we showcase six cherry-picked rollouts that highlight \( \pi_0 \)'s strengths. We found two impressive points: (1) \( \pi_0 \) has good vision-language understanding and (2) can imitate sequential behaviors in any scene, which we will discuss in detail in the rest of this section.

Pick and Place

"Place the yellow fish into the purple box"

Precise placement of the camouflage fish into the box

Articulation

"Open the drawer"

Opens drawer with multiple pulls to make sure it is fully open

Human Robot Interaction

"Hand the pineapple to the programmer"

Safe object handover to the programmer, even with wire occlusion from the side view

Dexterity

"Pour water from the silver cup to the pink bowl"

Pours real water from the Latte art vat into the target bowl

Multi-step Task

"Pick up all the objects into the basket"

Sequential placement of all the toys into the basket

Novel Objects

"Close the capsule lid of the coffee machine"

Handles previously unseen device

3.1 Robust Vision-Language Understanding in Complex Scenes

Powered by PaliGemma, Google DeepMind's 3B VLM, as its vision encoder (Beyer etc, 2024), \( \pi_0 \) demonstrates robust scene comprehension and adaptability. Despite relying solely on uncalibrated monocular RGB inputs (224x224 pixels after compression), it can handle very challenging objects and environments, including transparent or camouflaged items, and items it has not seen during training.

1. It can grasp transparent objects

\( \pi_0 \) is capable of identifying and manipulating transparent objects, as shown below. It picks up the bottle with a stable grasp, aligns it to the small cup, and precisely drops it in. Many traditional grasp detection techniques require an accurate 2D or 3D reconstruction of the scene, and transparent objects can cause issues in reconstruction accuracy. What makes it even more impressive is that the model can detect transparent objects solely from uncalibrated, mono-RGB images.

"Place the plastic bottle into the white cup."

"Place the plastic bottle into the bowl."

2. It can grasp an object even when it is camouflaged into a colorful background

\( \pi_0 \) can identify the 'yellow fish' here even when it is placed on top of a colorful board game. This object has an unusual and difficult shape, and it blends in well with the background, but \( \pi_0 \) detects it well enough to grasp it up.

"Place the fish into the red box"

"Place the fish into the purple box"

3. It is robust to human activity in the input

During evaluation, there were many times where the side-view camera captured humans moving around in the background. However, \( \pi_0 \) can always focus on its task, keeping the robotic arm's movements focused on object manipulation.

We believe there are two reasons for \( \pi_0 \)'s robustness to human movement. First, the pre-trained VLM backbone of \( \pi_0 \) is trained on images involving humans (Sharma et al., 2018, Changpinyo et al., 2022a, Kuznetsova et al. 2020), so humans are in-distribution. Next, as our occlusion experiments in Section 2.3.1 show, the policy seems to prioritize the wrist camera's images during pick-and-place tasks, so distractors in the side-view camera seem to minimally affect the policy.

Here are two side-view videos involving humans in the scene. Please refer to Appendix B.6 for more experiments on human-robot interaction.

(All videos featuring humans were uploaded with permission from the individuals involved.)

"Pick the pineapple and place it into the basket"

"Hand the pineapple to the outstretched hand"

Many existing works in Computer Vision and Robotics focus on transparent object detection and manipulation. But the nice part here is that we have an end-to-end, data-driven system that does it, without any special logic or care for transparent objects.

"\( \pi_0 \)'s ability to handle transparency, clutter and distractors hints at a future where robots see the world as humans do—through semantics, not just pixels."

3.2 \( \pi_0 \) can imitate behaviors step by step

If you are a human, you can easily imitate the behavior of the robot in the videos above. This is because the robot's behavior is sequential, and each step is independent of the previous step. However, it was not true for the history of robotics. Traditional behavior-cloning models may be able to memorize one exact path; But changing a scene and or start from different height could cause model fails. This is because the variance of the data could cause robot learning very bad behaviors, like collision and failure. Through our experiments, we observed that \( \pi_0 \) exhibits similar behavior patterns across a wide range of manipulation tasks. Although it is an autoregressive model without any memory or history, \( \pi_0 \) often executes the tasks step by step like:

Reach → Grasp → Transfer → Release → Reset → Idle

What's remarkable is that this pattern is not hard-coded in model structure, but emerges naturally from millions of demonstration data — suggesting that \( \pi_0 \) learns consistent task execution priors across environments. For instance, even when \( \pi_0 \) is unfamiliar with an object or task, it often proactively explores near affordance-rich areas, using its wrist camera to decide whether to grasp or not.

In certain trials, we also observed reset-like behaviors: if \( \pi_0 \) perceives the task as complete (e.g., after placing an item into a bowl), it may return to its home configuration and stop. While this often indicates a well-formed task boundary, it can also lead to early stopping/freezing, especially in multi-object scenes — see Section 4.1 for analysis of early stopping failure cases.

While this sequencing might suggest that \( \pi_0 \) has learned an internal understanding of the task, we caution against such framing. These patterns may reflect properties of the data distribution (e.g., Markovian, short-horizon tasks), rather than indicating the policy has acquired explicit task inference or memory.

"Remove the pink bowl from the tray"

Reach bowl's edge -> Grasp -> Lift away -> Release -> Reset

"Stack the wooden blocks"

Reach cube's top-> Grasp -> Transfer -> Stack -> Release -> Repeat

"Fold the cloth from left to right"

Reach a corner -> Pinch -> Pull -> Release -> Reset

However, it doesn't mean we have solved imitation learning. \( \pi_0 \) follows subtask sequences in a sensible way is maybe more of an observation about the task family rather than the algorithm at hand — many of the tested tasks may be sufficiently markovian that a history-less policy can follow a sensible chain of subtasks. We will discuss more in the Section 4.

4. Problems with \( \pi_0 \)

Failure Cases

\( \pi_0 \) demonstrates very impressive robustness across different tasks, locations, and lighting conditions. However, we have also observed some failure cases:

OOD Objects

"Pour water from teapot into bowl"

Cannot manipulate a novel glass teapot (0% success rate)

OOD Background

"Pick the black box on the white box"

Cannot handle unseen background well (0% success rate)

Task Misunderstanding

"Pour water into the pot"

Pick up the pot instead of water bottle (20% success rate)

Spatial Reasoning

"Place the can into the tray"

Misjudges object position relative to container (30% success rate)

Articulation Task

"Close the right cabinet door"

Fails to open toy kitchen cabinet on the table (0% success rate)

Coffee Making

"Pour coffee bean into the grinder"

Cannot work with espresso machine (0% success rate)

4.1 Early stopping

One common failure case is that policy may freeze unexpectedly during execution, leaving tasks incomplete until a human intervenes. This behavior comes from two related factors: semantic ambiguity and autoregressive action decoding limitations.

"Hand the pineapple to human"

Freezes in the air

"Open the drawer"

Freezes after grasping the handle

"Place the book into the bookshelf"

Hold book horizontally in the air, move very slowly

Possible Causes

1. The VLM part doesn't understand the instruction

Unlike those commercial chatbot with large parameters, \( \pi_0 \) is built upon PaliGemma, a very small VLM model. Therefore, it lacks the commonsense reasoning that LLMs can use to recognize unfamiliar object categories. When it does not understand a command, it gets stuck. In some experiments, we found that some objects / instructions are out of distribution (OOD), causing early stopping. To show how bad the PaliGemma model performs, we attached a Vision-Question-Answering (VQA) example in Appendix C.

2. \( \pi_0 \) remembers only the now, but many tasks need a sense of before-and-after

\( \pi_0 \) is a memory-less policy, meaning its next move depends only on the current camera images, it never "remembers" what it did a moment ago. That works for single, snap-shot actions(e.g. pick up a cup), but can fail when a task needs several coordinated steps. For example, here is an articulation task that requires multiple steps and freeze in the middle:

  • Case: "Open the drawer" → stops after grasping the handle.
  • Behavior: \( \pi_0 \) reaches out, grabs the handle... and then freezes.

Why? In the training data, most frames showing a robot holding a handle are idle frames—nothing moves. \( \pi_0 \) always chooses the most common action it has seen for a given image. So when it sees "hand-on-handle," the safest bet in its experience is "do nothing."

How could we fix this? We asked folks from Physical Intelligence, and the answer is: Shake the dice a little. Instead of always taking the single most likely (arg-max) action, we can allow a bit of randomness—called "sampling with temperature." By letting \( \pi_0 \) occasionally pick the second-most likely action, it may start pulling instead of freezing, and the drawer finally slides open.

3. Token Decoding Edge Cases

During inference, \( \pi_0 \) will throw out this error: Error decoding tokens: cannot reshape array of size 79 into shape (8)

According to our discussion on Github Issue#373, sometimes the policy decoded mis-shaped actions occasionally during inference. In official implementation, \( \pi_0 \)-fast-droid defaults to "no-motion" in these cases. Because the robot continues querying the policy, the error is skipped on subsequent queries, allowing the robot to quickly recover and continue decoding correctly-shaped outputs.

Warning: Don't kill the \( \pi_0 \) Inference Server when early stops occur. Be careful as the robot may continue moving before the server restarts!

Drawer articulation visualization

A visualization of Robot Joint and Gripper states during early stops

4.2 Imprecise spatial reasoning

\( \pi_0 \) often struggles with spatial reasoning about height. For example, when asked to pick up an object and place it into a container, the policy cannot lift the object high enough to clear the height of the container. This suggests one drawback of image-based policies: the policy does not have a metrically accurate method to determine the distance between the gripper and the surrounding environment.

As shown here, the robot seems to think the gripper is high enough, and so it pushes the destination container when it tries to place the object into it. Existing methods using monocular RGB images are able to accurately estimate the size of an object, and the distance between the height of the object and the bowl. The model should be able to understand that if it could increase the height of the object relative to the container, it could complete the task successfully.

"Place the plastic bottle into the pink bowl" - \( \pi_0 \) fails to lift the bottle high enough and always it collides with the bowl

"Touch the index finger of outstretched hand" - \( \pi_0 \) correctly touches the finger, but it uses the strong force of the gripper

We also tried to prompt \( \pi_0 \) to raise the gripper higher (i.e. "raise the bottle high enough / up 10cm to avoid collision..."), but this did not help. When \( \pi_0 \) is asked to operate with an articulated object, it becomes even harder for it to estimate the distance from the side view camera, causing frequent collisions. This is particularly worth noting when the robot is interacting with humans. Because the robot has no safety constraints, it will sometimes accidentally hit / grasp the user's hand, which could hurt the user!

What's more, when \( \pi_0 \) is told to manipulate a household appliance that it did not see during training, it will tend to collide with the device or stop during trials. As shown below, \( \pi_0 \) can not use the coffee machine in our lab.

"Take out the cup under the coffee machine" → collides with top of machine

"Raise the lever of coffee machine" → didn't understand where the lever is.

One possible solution is to use techniques such as voxel maps and planning constraints. Including depth information with a depth camera could also be helpful to implement collision avoidance.

Additionally, purely image-based policies lack tactile feedback. In our trials, \( \pi_0 \) sometimes applied too much force to delicate objects like human fingers, or too little to firmly grasp heavier ones like the plastic bottle. Complementing vision with tactile sensors or with low-level force controllers could help overcome these issues.

5. Quirk: Some interesting behaviors of \( \pi_0 \)

Quirk 1: Prompt Engineering matters

We investigated how variations in the prompts provided affect the policy's behavior, and found that \( \pi_0 \)'s performance heavily depends on the instructions given by the user, leaving space for prompt engineering.

1.1 You need to tune your prompt carefully to operate the robot
\( \pi_0 \) freezes or fails when instructions contain typos, grammatical errors, or ambiguous phrasing. For example, when we try to let \( \pi_0 \) manipulate the articulated objects, we may need to try multiple different prompts to find the 'in-distribution' instructions.
Instruction Success Rate Behavior
"Close the toilet" 0% Wanders aimlessly, unable to localize the target.
"Close the white lid of the toilet" 100% Always closes the toy toilet.

Close the white lid for the toilet (Success)

Close the toilet (Failure)

1.2 \( \pi_0 \)'s behavior without language goals

When given no specific language instruction, \( \pi_0 \) defaults to interacting with the most familiar objects from its training data:

  • Given nonsense text like "dgbfzjkfhjilawhdfkAWHDKLWHADFiQAWFHqawipfjcasklfmdc", it picks up marker pens
  • Given "xxx", it reaches for blocks repeatedly

In the DROID dataset, marker pens comprise 16.67% of objects, which could influence \( \pi_0 \) to pick the pen up when it is only given vision guidance. Default behaviors are heavily influenced by training data distribution. Overcoming this ambiguity and rejecting invalid instructions is still an ongoing problem.

dgbfzjkfhjilawhdfkAWHDKLWHADFiQAWFHqawipfjcasklfmdc
(nonsense, always pick up pen)

xxx
(nonsense, the policy will back and forth)

Quirk 2: How robust is \( \pi_0 \) under partial observability?

One of the most frequently asked questions about \( \pi_0 \) is, how robust is it when its visual inputs are disrupted? We ran several tests on blocking the camera and the object.

Camera Blocking Experiments

Setup:

  • Task: "Pick up the pink object and place it into the bowl."
  • Cameras: Side-view (primary) + wrist-mounted (secondary).
  • Blocking Scenarios: Partial/full occlusion of one or both cameras.
  • Test time: 4 trials per scenario, 300 steps of rollouts per trial.

Block the left camera: \( \pi_0 \) can still find pink object.

Blocking Type Success Rate Behavior
No Block 100% Baseline: Perfect execution. Picks correct object, then explores others.
Block Side Camera Mid-Trial 50% Relies on wrist camera to pick up the object → success rate decreases.
Block Wrist Camera Entirely 0% Frozen—no recovery.
Block Both Cameras Initially → Unblock Mid-Trial 75% Chaotic exploration → can execute after unblocking.

No Block: 100% execution.

Side Camera Blocked Mid-Trial: Robot is interrupted but can still complete the task.

Wrist Camera Blocked Entirely: Frozen.

Both Blocked, then Unblocked: Recovers.

Object Blocking Experiment

Setup:

  • Task: "Pick up the pineapple."
  • Occlusion Levels: None (fully visible), 50% occluded, 100% occluded.
  • Test time: 12 trials per scenario, 300 rollouts per trial.

From three boxes: 66.67% \( \pi_0 \) can easily find the target object from different boxes on the table.

Inside Drawer: 25%
Franka's collision with the drawer makes \( \pi_0 \) hard to pick out the pineapple.

Hidden under cloth: 0%
\( \pi_0 \) is unable to explore the environment that requires interaction exploration:.

Our Observations
  • Dependence on Wrist Camera:
    • In our pick and place tasks, \( \pi_0 \) heavily relies on the wrist camera. It still works even if the side camera is blocked,
    • \( \pi_0 \) works worse in the alternative case, where the wrist camera is occluded but not the side camera.
  • Viewpoint Robustness:
    • \( \pi_0 \) can tolerate changes to the side-view camera position and orientation during the task
    • If the camera is blocked and then unblocked, \( \pi_0 \) can recover.
  • Common Failure Modes under Partial Observability:
    • Total occlusion of wrist camera results in the robot freezing.
    • \( \pi_0 \) is memoryless, predicting the action per frame auto-regressively, so it will be able to continue executing the task if observation is available.
    • However, the exploration behavior of \( \pi_0 \) is not efficient and is limited to certain areas of the scene. This makes it difficult to actively search the environment.

6. Conclusion

Our evaluations show that \( \pi_0 \) is a promising generalist policy: it demonstrates intelligent behaviour in unseen manipulation scenes. However, many challenges remain —-- saying we are "very impressed" is true, but only with right context. Remember, we've had roughly 50 years of robotics research. Until now, you couldn't just download someone else's controller, load it onto your own robot, and expect it to do even simple things. If \( \pi_0 \) can do that --- even at 20 - 50 % success on simple tasks straight out of the box, that alone marks a major leap forward.

As discussed in section Problems and Quirks, our experiments reveals that performance is sensitive to prompt phrasing, and the policy still struggles with instruction following, fine-grained manipulation and partial observability. We don't expect \( \pi_0 \) to be installed in everyone's home tomorrow, but we hope to see more advancement. Let robot go and run , let it do reasonable things step by step, which may not finish task every time, but it moves in the right direction. We are optimistic that continued research will address these issues and bring truly generalist robot policies closer to the practical reality.

References

[1] K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. (2024). Black, o., Xia, Z., Ha, L., Kaplan, A., Huang, H., Hausman, K., Ichter, B., Fox, D., & Levine, S. (2024). π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164. [Paper] [Project Page]
[2] K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, & S. Levine. (2025). FAST: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747. [Paper] [Project Page]
[3] Physical-Intelligence. (2025). OpenPI. GitHub repository. [GitHub]
[4] A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, et al. (2024). DROID: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945. [Paper]
[5] W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. I. Jordan, J. E. Gonzalez, & I. Stoica. (2024). Chatbot Arena: An open platform for evaluating LLMs by human preference. Proceedings of the 41st International Conference on Machine Learning (ICML 2024). [Paper] [Platform]
[6] N. Lambert. (2024). ChatBotArena: The peoples' LLM evaluation, the future of evaluation, the incentives of evaluation, and gpt2chatbot. Interconnects.ai blog. [Article]
[7] P. Sharma, N. Ding, S. Goodman, & R. Soricut. (2018). Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), 2556 - 2565. [Paper]
[8] S. Changpinyo, D. Kukliansy, I. Szpektor, X. Chen, N. Ding, & R. Soricut. (2022). All you may need for VQA are image captions. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2022), 1947 - 1963. [Paper]
[9] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, & V. Ferrari. (2020). The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128, 1956 - 1981. [Paper] [Dataset]
[10] L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al. (2024). PaliGemma: A versatile 3B VLM for transfer. arXiv preprint arXiv:2407.07726. [Paper]

Acknowledgments

We are grateful to Will Liang, Hungju Wang, and Sam Wang for their assistance in setting up the π0 environment. We further thank Kaustubh Sridhar, Tianyou Wang, Ian Pedroza, Ethan Yu, Tim Song, and Yiqian Li for their help doing the experiments.

We also thank Junyao Shi, Aurora Qian, Leon Kim, and Jason Ma for their insightful suggestions on evaluating a generalist manipulation policy.

We extend our appreciation to Karl Pertsch from Physical Intelligence for his constructive feedback on the early blog draft.

This project was supported in part by the National Science Foundation Graduate Research Fellowship Program under Grant No DGE-2236662. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Citation

If you find this evaluation useful for your research, please consider citing our repository:

@misc{pi0-experiment-wild,
  author = {J. Wang, M. Leonard, K. Daniilidis, D. Jayaraman, & E. S. Hu},
  title = {Evaluating pi0 in the Wild: Strengths, Problems, and the Future of Generalist Robot Policies},
  year = {2025},
  publisher = {GRASP Lab, University of Pennsylvania},
  url = {https://penn-pal-lab.github.io/pi0-Experiment-in-the-Wild}
}
Robot Workspace
Our robot workspace setup used for evaluating \( \pi_0 \).

Appendix A: Our Robot & Model Setup

The following are details of our experiment set up.

Hardware:

  • Franka Research 3 Arm: 7-DOF force-sensitive robot with a 3 kg payload.
  • Robotiq 2F-85 gripper: two-finger gripper with 5mm stroke and adjustable force control.
  • Cameras:
    • Side-view: ZED 2 stereo camera for global scene understanding
    • Wrist-mounted: ZED Mini for close-range object manipulation
    • Perception Mode: Pure RGB (no depth calibration)

Compute

GPU Server:

  • GPUs: 1x NVIDIA RTX A6000 (48GB VRAM)
  • CUDA Version: 12.3
  • Usage: \( \pi_0 \) model inference.

Workstation

  • GPU: NVIDIA GeForce RTX 3080 (16GB VRAM)
  • CUDA Version: 12.6
  • Usage: DROID low level control.

3.3 \( \pi_0 \)-FAST-DROID:

  • Vision-Language Model: Paligemma 3B for spatial and semantic understanding.
  • FAST+: Frequency-space Action Sequence Tokenization (FAST), a universal robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies.
  • Training Data: Pretrained on π cross-embodiment robot dataset & Open X-Embodiment, fine tuned on DROID dataset.

Appendix B: Detailed Results for Each Task

Overall Performance

Chart showing overall performance statistics

Across our 300+ test trials, \( \pi_0 \) achieved varying degrees of progress.

For each category, we list some example rollouts and instructions as below.

We observed that performance varied significantly based on task type, environmental conditions, and most importantly, the task instructions.

Left: Performance across 300+ trials, average progress 42.3%.

Task-Specific Performance

  • We divide the tasks into 7 categories:
    • Pick-and-Place
    • Pour
    • Articulated Objects
    • Fabric Manipulation
    • YCB Benchmark
    • Human Robot Interaction
    • Coffee Machine Challenge
  • For each task:
    • We list the progress score and success rate for a group of similar trials.
    • Example rollouts and instructions are provided for each category.
  • Note: This is a "vibe-check" style evaluation, not a rigorous benchmark. It can indicate some strengths and weaknesses of \( \pi_0 \), but does not reflect the full capabilities of the model.
  • If you want to see more policy rollouts, please check our recent CoRL paper: RoboArena!

1. Pick-and-Place (38.4% Progress, 24% Success)

Pick-and-place is an easy but necessary task for robot to learn.

We let \( \pi_0 \) plays with over 50 different objects, where most of them are not toy but daily usage items.

We don't expect \( \pi_0 \) to pick up all objects, becausse some of them are just out of scope for our franka panda arm. But it should show an intuition to decide which object to pick up. Here are some observation about \( \pi_0 \)'s performance.

Strengths:

  • Small objects (pineapple toy, pen markers) - 90% progress
  • Clear spatial targets ("in the pink bowl") - 85% progress

Weaknesses:

  • Large objects (black cube) - 35% progress
  • Vague targets (yellow fish, Med bottle) - 25% progress
  • Multi-object tasks ("Put all objects into the basket") - 42% progress

image: part of the evaluation objects

Test objects

2. Pour (52.3% Progress, 24% Success)

Pour is an interesting task for robotics researchers, as it requires robot to "keep the container upright while transporting it, align the spout with the target container, and then tilt the cup at the correct angle to pour the liquid" (ReKep).

Here are some observation about \( \pi_0 \)'s performance.

Pour Toy Items (73.3% Progress)

  • Pour the sausepot(air) - 78.3% progress
  • Pour coffee bean into the bowl - 63% progress

Pour Real Items (20% Progress)

  • Pour water from teapot into the bowl(with water) - 0% progress
  • Pour water from the silver cup to the pink bowl(no liquid)- 70% progress

\( \pi_0 \) can exectute pour behavior on toy, empty and light container. But when you demand it to pour real liquid, it will fail.

However, this could also be limited by robot's physical capabilities. The robot's gripper is not suitable to grasp the teapot firmly.

"Pour coffee bean into the bowl"

"Pour the tea into cup."

3. Articulated Objects (37.8% Progress, 28.5% Success)

Articulated objects are common in daily life, such as drawer, cabinet, toilet, etc.

We evaluate \( \pi_0 \) on our mock kitchen, which contains many fully-size, real in use articulated objects.

Drawer Manipulation

  • Open the drawer - 63% progress
  • Close the drawer - 75% progress
  • Close the drawer on table - 0% progress (toy drawer)

Cabinet Manipulation

  • Open the cabinet - 14% progress
  • Close the cabinet - 62.5% progress
  • Open the fridge - 25% progress

Toy Toilet Manipulation

  • Open the toilet lid - N/A, not feasible for robot
  • Close the toilet - 0% progress
  • Close the white lid for the toilet - 100% progress

"Open the fridge"

"Close the cabinet door."

4. Fabric Manipulation (47.0% Progress, 19.4% Success)

People have been keen to let robot do housework for a long time. Folding clothes, towel and t-shirts can not only demonstrate the dexterity of policy, but also ability to handle the object's shape and texture.

Because pi0-fast-droid is trained on single franka arm, folding fabric becomes very challenging for it zero-shot. We evaluate \( \pi_0 \) on some simple fabric manipulation tasks, where the robot need to fold the fabric into a specific shape.

Fold Newspaper

  • fold the newspaper from right to left - 0% progress
  • fold up the newspaper- 62% progress (always try to fold from left to right)

Fold Cloth

  • fold the green, square cloth diagonally - 63% progress
  • fold the cloth in hamburger / hotdog style - 0% progress
  • fold the cloth from left to right - 50% progress

Fold Tshirt (35% Progress)

  • fold up the T-shirt from right to left - 80% progress
  • finish the task of folding up the T-shirt - 0% progress

"Fold up the T-shirt in half."

"Finish the task of folding up the T-shirt"

5. YCB Benchmark (53.5% Progress, 24% Success)

We also evaluate \( \pi_0 \) on the YCB benchmark, which includes many daily life objects in the kitchen.

We select spam can, Cheez-it box, sugar box and mustard bottle, doing pick and place under similar but different initial conditions.

For each task, we performed three trials, to determine how consistent the model is. We evaluated whether the robot succeeded in placing the object in the receptacle, and wrote short notes analyzing behavior and failure reasons.

Variation: vertical position, horizontal position, different color box, different trays.

  • "Place the can into the purple box": 16.7% success rate
  • "Place the bottle into the purple box": 16.7% success rate
  • "Place the red box into the purple box": 0% success rate
  • "Place the can into the red tray": 58.3% success rate
  • "Place the can into the tray": 0% success rate

We found \( \pi_0 \) cannot follow the brand name of object, so we use color to identify the object.

Result shows that \( \pi_0 \) on Franka Arm failed to handle YCB Benchmark.

YCB Success and Failure Orientation

YCB Success and Failure Orientation

YCB Behavior Distribution

YCB Behavior Distribution, most tasks are just trying to grasp but didn't hold object firmly

YCB All Objects

All objects in YCB Benchmark

6. Human Interaction (53.5% Progress, 24% Success)

We evaluated \( \pi_0 \) on some simple human interaction scenarios, where the robot need to work with a human without hurting them.

We evaluated the robot's ability to hand over objects to human, pick up objects, and perform precision interactions.

  • Pick up the pineapple and give it to the programmer - 62.5% progress
  • Give the whiteboard eraser to the programmer - 25% progress
  • touch the index finger of the outstretched hand - 76.7% progress
  • Shake hands with the human - 0% progress

How to collect robotics data with human involved, how to execute tasks without hurting human, how to make human feel comfortable with robot? There are still many space for those open questions. We tests these scenarios to see how \( \pi_0 \) performs, whether it can safely work with human involved.

If we want robots to work in people's home, we need to design the policy in a way that is safe for both human and robot.

"Give the pineapple to the programmer"

"Give the whiteboard eraser to the programmer"

7. Coffee Machine (8.00% Progress)

Coffee machine is a very challenging task for robot learning, as it requires both common-sense reasoning to understand how to use home appliances, active perception to find the manipulable knob and button, and precise control to work with the heavy object without collision.

Capsule Coffee Machine:

The capsule one is easier because it is more fixed and \( \pi_0 \) can do some simple task like closing the lid.

However, it cannot open the machine by pressing the button, and it cannot put the capsule into the machine.

  • Close the capsule lid of coffee machine - 50% progress
  • Pick up the capsule from the coffee machine - 0% progress
  • Place the capsule into the coffee machine - 0% progress

Espresso Coffee Machine:

The espresso one is more challenging because it is more complex and even people who don't love coffee cannot use it zero-shot.

We expect model should understand expert-level knowledge and equipped with tactile sensing.

  • Pick up the coffee portafilter - 0% progress
  • Pour the coffee into the cup - 0% progress
  • Pick up the silver milk frothing pitcher- 33% progress

"Place the capsule into the coffee machine"

"Close the capsule lid of the coffee machine"

"Pour the coffee bean into the grinder"