Real World Reinforcement Learning of Active Perception Behaviors

* Equal contribution 1University of Pennsylvania 2University of Liège 3UC Berkeley

TL;DR: We propose a simple robot learning recipe leveraging privileged information to train active perception policies on real robots.

Abstract (click to expand)

Method

Method
Top row: The policy receives the partial observation. Bottom row: Privileged observations or state, available only during training, are given to the critic networks to estimate the advantage. The advantage estimates are used as weights in the loss, providing privileged supervision to the policy.

Our method, Asymmetric Advantage Weighted Regression (AAWR), uses privileged information to learn high-quality value functions. The value functions are used to compute advantage estimates, which are used as weights in the weighted BC loss, leading to better policy extraction in POMDPs compared to BC and unprivileged AWR. AAWR inherits the versatility of AWR, which can be applied to both offline data (demonstrations, play data) and online data using the same update.

Experiment 1: VLA Active Perception

Task Setup

Task Setup

Goal: Search tasks still pose a challenge for VLAs. So we propose to learn an active perception “helper” policy that searches for a target object under clutter/occlusion, then handoff to a generalist VLA policy (π0) for grasping.

Active Perception Scenes

Task Setup

Bookshelf-Pineapple/Duck: Find the target pineapple or duck, placed on bookshelf of three shelves.

Shelf-Cabinet: adds cabinet/drawer hiding spots.

Complex: adds an extra bottom shelf that's heavily occluded from side views.

Observation Space

Deployment Sensors: Wrist Camera RGB + Proprioception + Occupancy Grid.

Privileged Sensors: Object detector bounding boxes and segmentation masks

Reward Curve
Example reward trajectory during a search episode. Note the sharp peak at episode termination when the object enters the target region, this shape a high, sparse reward for advantage weighting.

AAWR learns long horizon search behaviors

AAWR (3x)

AAWR scans the scenes more efficiently than the baselines. We notice that baselines tend to drift into out-of-distribution joint configurations or do inefficient motions (e.g. looking up toward the ceiling or straight at the ground), decreasing their long-horizon performance.

Here we show the left camera (not used by the robot) to show how AAWR smoothly scans scene without drifting or colliding with the environment.

AWR: Drifts away but π0 is luckily able to grasp the target.
BC: Collides with the cabinet door
π0: Immediately collides with the environment and cannot recover.

AAWR learns to fixate

An important characteristic for active perception policies is fixation, where the camera moves while keeping the target clearly in view. This is important for maximizing the success rate of the generalist policy which benefits from a clear viewpoint of the object.

As shown below, the privileged information from object segmentation helps the policy to recongnize the target pineapple toy without drifting away.

Fixate when discover the target in the side and reorient towards it
Fixate when discover the target object in the drawer
Disable handover, AAWR will fixate on the target and keep getting closer

Please watch the difference between AAWR and baselines in the VGGT Reconstruction Section.

Handover to Generalist Policy

Here we show the handover to generalist policy after the active perception policy has found the target.
Complex Search
Shelf Search
Drawer Search
Duck Search

3D Scene Reconstructions from Wrist Videos

To visualize the exploration capabilities of different wrist camera policies, we use Visual Geometry Grounded Transformer (VGGT) to reconstruct the 3D scene from their trajectories. The visualizations show that AAWR covers the scene more quickly and extensively then baselines, and fixates on the target objects.

Long horizon • AAWR

Offline / Online Data Analysis

We visualize the offline demonstration data and online robot rollouts. Use the controls to switch between offline/online rollouts, scenes, and objects. When a particular combination is unavailable, the option is greyed out.

Offline: select scene & object. Online: select algorithm; Exhaustive also uses the scene. Greyed options indicate that no trajectory is available for that combination.

3D trajectory visualization (image)

Experiment 2: Learning Blind Pick through Online RL

Here, we explore the real-world online RL capabilities of AAWR. The task is to pick up objects given just the initial object position estimate. Normally, manipulation policies are given the object pose every timestep, but here we only have a single estimate at the beginning of the episode.

At the start of the episode, the policy is given the intial object pose. As a result, the policy must learn to be precise in its initial grasp and avoid perturbing the object.

Training Details

Koch task illustration

Task: Real-world “Blind Pick” on Koch robot: pick objects using just initial object position.

Sensors / data regime: Partial obs: joints + initial object position (policy input); Privileged obs (training): object position estimate at every timestep; Reward: dense; Demos: 100 suboptimal demos; Offline steps 20K; Online steps 1.2K

Results

AAWR achieves substantially higher pick success rates than both BC and AWR, in both the offline and online training regimes on the Koch blind pick task. With just online 1200 transitions (~50 episodes), Online AAWR is able to improve by almost ~20% over offline AAWR.

We find the online RL is able to reduce jerkiness and increase the accuracy of grasping over the offline policy.

Method Pick %
BC 41
Off. AWR 62
On. AWR 55
Off. AAWR (ours) 71
On. AAWR (ours) 89
AAWR
AWR
BC

Simulation Experiments

We selected 3 different tasks with varied sensor setups and degrees of partial observability to benchmark AAWR.

Camouflage Pick (Sim)

Hard because object is barely visible; privileged true object position helps critics.

Fully Obs Pick (Sim)

Even in fully observable scenarios, AAWR can still help because privileged critics don't need to learn object localization from pixels.

Active Perception Koch (Sim)

Only AAWR reaches 100% performance thorugh online RL by learning to scan workspace effectively; compared to other baselines like distillation.

Q&A

Please read the paper for more details and full roll out videos at our appendix page.