AAWR

A robot's instantaneous sensory observations do not always reveal task-relevant state information. Under such partial observability, optimal behavior typically involves explicitly acting to gain the missing information. Today's standard robot learning techniques struggle to produce such active perception behaviors. We propose a simple real-world robot learning recipe to efficiently train active perception policies. Our approach, advantage weighted regression (AAWR), exploits access to "privileged" extra sensors at training time. The privileged sensors enable training high-quality privileged value functions that aid in estimating the advantage of the target policy. Bootstrapping from a small number of potentially suboptimal demonstrations and an easy-to-obtain coarse policy initialization, AAWR quickly acquires active perception behaviors and boosts task performance. In evaluations on 8 manipulation tasks on 3 robots spanning varying degrees of partial observability, AAWR synthesizes reliable active perception behaviors that outperform all prior approaches. When initialized with a "generalist" robot policy that struggles with active perception tasks, AAWR efficiently generates information-gathering behaviors that allow it to operate under severe partial observability for manipulation tasks. Website:penn-pal-lab.github.io/aawr

Method

Our method, Asymmetric Advantage Weighted Regression (AAWR), uses privileged information to learn high-quality value functions. The value functions are used to compute advantage estimates, which are used as weights in the weighted BC loss, leading to better policy extraction in POMDPs compared to BC and unprivileged AWR. AAWR inherits the versatility of AWR, which can be applied to both offline data (demonstrations, play data) and online data using the same update.

Experiment 1: VLA Active Perception

Task Setup

Goal: Search tasks still pose a challenge for VLAs. So we propose to learn an active perception “helper” policy that searches for a target object under clutter/occlusion, then handoff to a generalist VLA policy (π0) for grasping.

Active Perception Scenes

Bookshelf-Pineapple/Duck: Find the target pineapple or duck, placed on bookshelf of three shelves.

Shelf-Cabinet: adds cabinet/drawer hiding spots.

Complex: adds an extra bottom shelf that's heavily occluded from side views.

Observation Space

Deployment Sensors: Wrist Camera RGB + Proprioception + Occupancy Grid.

Privileged Sensors: Object detector bounding boxes and segmentation masks

Reward Curve — Example reward trajectory during a search episode. Note the sharp peak at episode termination when the object enters the target region, this shape a high, sparse reward for advantage weighting.

AAWR learns long horizon search behaviors

AAWR (3x)

AAWR scans the scenes more efficiently than the baselines. We notice that baselines tend to drift into out-of-distribution joint configurations or do inefficient motions (e.g. looking up toward the ceiling or straight at the ground), decreasing their long-horizon performance.

Here we show the left camera (not used by the robot) to show how AAWR smoothly scans scene without drifting or colliding with the environment.

AWR: Drifts away but π0 is luckily able to grasp the target.

BC: Collides with the cabinet door

π0: Immediately collides with the environment and cannot recover.

AAWR learns to fixate

An important characteristic for active perception policies is fixation, where the camera moves while keeping the target clearly in view. This is important for maximizing the success rate of the generalist policy which benefits from a clear viewpoint of the object.

As shown below, the privileged information from object segmentation helps the policy to recongnize the target pineapple toy without drifting away.

Fixate when discover the target in the side and reorient towards it

Fixate when discover the target object in the drawer

Disable handover, AAWR will fixate on the target and keep getting closer

Please watch the difference between AAWR and baselines in the VGGT Reconstruction Section.

Handover to Generalist Policy

Here we show the handover to generalist policy after the active perception policy has found the target.

Complex Search

Shelf Search

Drawer Search

Duck Search

3D Scene Reconstructions from Wrist Videos

To visualize the exploration capabilities of different wrist camera policies, we use Visual Geometry Grounded Transformer (VGGT) to reconstruct the 3D scene from their trajectories. The visualizations show that AAWR covers the scene more quickly and extensively then baselines, and fixates on the target objects.

Behavior mode

Algorithm

Long horizon • AAWR

Offline / Online Data Analysis

We visualize the offline demonstration data and online robot rollouts. Use the controls to switch between offline/online rollouts, scenes, and objects. When a particular combination is unavailable, the option is greyed out.

Mode

Scene

Object

Algo (online)

Offline: select scene & object. Online: select algorithm; Exhaustive also uses the scene. Greyed options indicate that no trajectory is available for that combination.

Experiment 2: Learning Blind Pick through Online RL

Here, we explore the real-world online RL capabilities of AAWR. The task is to pick up objects given just the initial object position estimate. Normally, manipulation policies are given the object pose every timestep, but here we only have a single estimate at the beginning of the episode.

At the start of the episode, the policy is given the intial object pose. As a result, the policy must learn to be precise in its initial grasp and avoid perturbing the object.

Training Details

Task: Real-world “Blind Pick” on Koch robot: pick objects using just initial object position.

Sensors / data regime: Partial obs: joints + initial object position (policy input); Privileged obs (training): object position estimate at every timestep; Reward: dense; Demos: 100 suboptimal demos; Offline steps 20K; Online steps 1.2K

Results

AAWR achieves substantially higher pick success rates than both BC and AWR, in both the offline and online training regimes on the Koch blind pick task. With just online 1200 transitions (~50 episodes), Online AAWR is able to improve by almost ~20% over offline AAWR.

We find the online RL is able to reduce jerkiness and increase the accuracy of grasping over the offline policy.

Method	Pick %
BC	41
Off. AWR	62
On. AWR	55
Off. AAWR (ours)	71
On. AAWR (ours)	89

AAWR

AWR

Simulation Experiments

We selected 3 different tasks with varied sensor setups and degrees of partial observability to benchmark AAWR.

Camouflage Pick (Sim)

Hard because object is barely visible; privileged true object position helps critics.

Fully Obs Pick (Sim)

Even in fully observable scenarios, AAWR can still help because privileged critics don't need to learn object localization from pixels.

Active Perception Koch (Sim)

Only AAWR reaches 100% performance thorugh online RL by learning to scan workspace effectively; compared to other baselines like distillation.

Real World Reinforcement Learning of Active Perception Behaviors