TL;DR: We propose a simple robot learning recipe leveraging privileged information to train active perception policies on real robots.
Abstract (click to expand)
Our method, Asymmetric Advantage Weighted Regression (AAWR), uses privileged information to learn high-quality value functions. The value functions are used to compute advantage estimates, which are used as weights in the weighted BC loss, leading to better policy extraction in POMDPs compared to BC and unprivileged AWR. AAWR inherits the versatility of AWR, which can be applied to both offline data (demonstrations, play data) and online data using the same update.
Goal: Search tasks still pose a challenge for VLAs. So we propose to learn an active perception “helper” policy that searches for a target object under clutter/occlusion, then handoff to a generalist VLA policy (π0) for grasping.
Bookshelf-Pineapple/Duck: Find the target pineapple or duck, placed on bookshelf of three shelves.
Shelf-Cabinet: adds cabinet/drawer hiding spots.
Complex: adds an extra bottom shelf that's heavily occluded from side views.
Deployment Sensors: Wrist Camera RGB + Proprioception + Occupancy Grid.
Privileged Sensors: Object detector bounding boxes and segmentation masks
AAWR scans the scenes more efficiently than the baselines. We notice that baselines tend to drift into out-of-distribution joint configurations or do inefficient motions (e.g. looking up toward the ceiling or straight at the ground), decreasing their long-horizon performance.
Here we show the left camera (not used by the robot) to show how AAWR smoothly scans scene without drifting or colliding with the environment.
An important characteristic for active perception policies is fixation, where the camera moves while keeping the target clearly in view. This is important for maximizing the success rate of the generalist policy which benefits from a clear viewpoint of the object.
As shown below, the privileged information from object segmentation helps the policy to recongnize the target pineapple toy without drifting away.
Please watch the difference between AAWR and baselines in the VGGT Reconstruction Section.
To visualize the exploration capabilities of different wrist camera policies, we use Visual Geometry Grounded Transformer (VGGT) to reconstruct the 3D scene from their trajectories. The visualizations show that AAWR covers the scene more quickly and extensively then baselines, and fixates on the target objects.
Long horizon • AAWR
We visualize the offline demonstration data and online robot rollouts. Use the controls to switch between offline/online rollouts, scenes, and objects. When a particular combination is unavailable, the option is greyed out.
Offline: select scene & object. Online: select algorithm; Exhaustive also uses the scene. Greyed options indicate that no trajectory is available for that combination.
Here, we explore the real-world online RL capabilities of AAWR. The task is to pick up objects given just the initial object position estimate. Normally, manipulation policies are given the object pose every timestep, but here we only have a single estimate at the beginning of the episode.
Task: Real-world “Blind Pick” on Koch robot: pick objects using just initial object position.
Sensors / data regime: Partial obs: joints + initial object position (policy input); Privileged obs (training): object position estimate at every timestep; Reward: dense; Demos: 100 suboptimal demos; Offline steps 20K; Online steps 1.2K
AAWR achieves substantially higher pick success rates than both BC and AWR, in both the offline and online training regimes on the Koch blind pick task. With just online 1200 transitions (~50 episodes), Online AAWR is able to improve by almost ~20% over offline AAWR.
We find the online RL is able to reduce jerkiness and increase the accuracy of grasping over the offline policy.
| Method | Pick % |
|---|---|
| BC | 41 |
| Off. AWR | 62 |
| On. AWR | 55 |
| Off. AAWR (ours) | 71 |
| On. AAWR (ours) | 89 |
We selected 3 different tasks with varied sensor setups and degrees of partial observability to benchmark AAWR.
Hard because object is barely visible; privileged true object position helps critics.
Even in fully observable scenarios, AAWR can still help because privileged critics don't need to learn object localization from pixels.
Only AAWR reaches 100% performance thorugh online RL by learning to scan workspace effectively; compared to other baselines like distillation.
Please read the paper for more details and full roll out videos at our appendix page.