Privileged Sensing Scaffolds Reinforcement Learning

University of Pennsylvania

TLDR: We use privileged sensors to improve RL.

Given a task and target observation space, RL trains policies to solve them. Our method, Scaffolder, enhances RL training by utilizing privileged sensors only available during training, and trains performant policies across 10 challenging tasks (shown below).

Summary

We need to look at our shoelaces as we first learn to tie them but having mastered this skill, can do it from touch alone. We call this phenomenon "sensory scaffolding": observation streams that are not needed by a master might yet aid a novice learner. We consider such sensory scaffolding setups for training artificial agents.

For these settings, we propose Scaffolder, shown above, a reinforcement learning approach which effectively exploits privileged sensing in critics, world models, reward estimators, and other such auxiliary components that are only used at training time, to improve the target policy.

To evaluate sensory scaffolding agents, we design a new "S3" suite of ten diverse simulated robotic tasks that explore a wide range of practical sensor setups (shown above). Agents must use privileged camera sensing to train blind hurdlers, privileged active visual perception to help robot arms overcome visual occlusions, privileged touch sensors to train robot hands, and more. Scaffolder easily outperforms relevant prior baselines and frequently performs comparably even to policies that have test-time access to the privileged sensors.

Experiments

The Sensory Scaffolding Suite (S3) consists of 10 distinct tasks and sensor configurations. Click each task icon to learn more.

Blind Pick
Blind Locomotion
Blind Deaf Pianist
Blind Numb Pen
Blind Numb Cube
Noisy Monkey
Wrist Pick Place
Occluded Pick Place
RGB Pen
RGB Cube

Task: The Fetch robot arm must pick up a randomly initialized block on the table with only joint and touch sensing.
Privileged Sensors: Camera that sees the table.
Target Sensors: Joints, Touch

Task: The Half Cheetah must run while overcoming hurdles that are randomly placed and randomly sized.
Privileged Sensors: Camera that sees the cheetah and nearby hurdles.
Target Sensors: Joints

Task: The agent must play "Twinkle Twinkle Little Star" with two Shadowhand 30-DoF hands and only joint sensors.
Privileged Sensors: Future notes (simulating sheet music), and piano key presses (simulating audio).
Target Sensors: Joints

Task: The agent must dexterously manipulate a randomly initialized pen into a random goal orientation with only proprioception and initial conditions.
Privileged Sensors: Pen Pose, Touch sensors (contacts visualized in red).
Target Sensors: Joints, Initial/Goal Pen Pose

Task: The Shadowhand must dexterously manipulate a randomly initialized cube into a random goal orientation with only proprioception and initial conditions.
Privileged Sensors: Cube Pose, Touch sensors (contacts visualized in red).
Target Sensors: Joints, Initial/Goal Cube Pose

Task: The monkey must swing from tree branch to tree branch using noisy estimates of joint and branch pose.
Privileged Sensors: Ground truth joint and branch position sensors.
Target Sensors: Noisy Joint and Branch Position sensors.

Task: The Fetch robot arm must pick up a randomly initialized block and place it into a randomly initialized bin using a wrist camera with a limited field of view. This studies if active perception policies benefit from privileged optimal viewpoints.
Privileged Sensors: Two fixed cameras, one that sees the block and one that sees the bin.
Target Sensors: Wrist Camera (limited FoV), Touch, Joints

Task: The Fetch robot arm must pick up a randomly initialized block and place it into a randomly initialized bin using a camera suffering from occlusion. This studies if occluded policies benefit from privileged active perception cameras.
Privileged Sensors: Wrist Camera that can move and see occluded objects.
Target Sensors: Occluded Camera where object is blocked by shelf (see GIF), Touch, Joints

Task: The Shadowhand must dexterously manipulate a randomly initialized pen into a random goal orientation using a camera.
Privileged Sensors: Object Pose, Touch
Target Sensors: Camera, Joints, Initial/Goal Pen Pose.

Task: The Shadowhand must dexterously manipulate a randomly initialized cube into a random goal orientation using a camera.
Privileged Sensors: Object Pose, Touch
Target Sensors: Camera, Joints, Initial/Goal Cube Pose.

Finding 1: Privileged sensors support skill learning.

We evaluate Scaffolder and other baselines that exploit privileged information as follows: we use the final score of DreamerV3 on target observations as the lower bound (1.0), and the final score of DreamerV3 trained on privileged observations as the upper bound (1.0). Scaffolder has the highest aggregate median performance across all 10 tasks.

Scaffolder bridges 79% of the gap between target and privileged observations, just by having temporary access to privileged sensors at training time. In other words, much of the gap between the privileged and target observations might lie not in whether they support the performing the same behaviors, but in whether they support learning them.

This result informs a more nuanced approach into RL training and sensor design. When setting up a new RL environment, users should consider how sensors impact learning instead of final performance. Sensors that were once thought necessary for execution may only be required for learning. Next, users may consider additional sensors that are obviously not required for execution yet still be useful for learning.

Finding 2: Privileged sensors improve learning through multiple routes.

We study the various routes through which privileged sensing influences learning in Scaffolder. We replace each privileged component with a non-privileged counterpart to assess component-wise contributions. We see that dropping components generally hurts performance, and that the contribution of each component is task dependent. For example, Blind Pick has difficult exploration, so removing privileged exploration ("No Scaff. Explore") hurts the most, and in RGB Cube, privileged representation learning is important for encoding high-dimensional images, so ("No Scaff. Repr.") hurts the most.

In all these cases, the combined Scaffolder benefits from cohesively integrating these many routes for privileged sensing to influence policy learning, and performs best.

Finding 3: Scaffolder discovers interesting behaviors.

We highlight several interesting behaviors learned by Scaffolder and baselines. Scaffolder behaviors broadly fall into two categories depending on the task - they either perform information-gathering or robust actions to solve the task.

Information Gathering Strategies

Scaffolder Episode

DreamerV3+BC Episode

Guided Observability Episode

In Blind Picking, the robot must pick up randomly initialized blocks with only touch and proprioception sensors. Only Scaffolder solves this task. We visualize the 3d position of the gripper over an episode and find a spiral pattern in the trajectory. Spiral search is a well-known strategy in robotics to efficiently find points in a space without prior knowledge, so it's interesting to see this strategy naturally emerge through RL. All other baselines fail - we visualize episodes from baselines DreamerV3+BC and Guided Observability above, and see they fail to find and pick up the block.

Robust Strategies

At other times, Scaffolder acquires robust behaviors invariant to unobservables - in Blind Locomotion, it discovers robust run-and-jump maneuvers to minimize collisions with unseen randomized hurdles and quickly recovers after collisions. Baselines move slowly and are not as robust to collisions.

Scaffolder Episode

Scaffolder moves quickly by frequently high-jumping to minimize collisions and quickly recovers after collisions.

Guided Observability Episode

The baseline Guided Observability moves quickly in unobstructed areas, but is frequently stopped by hurdles.

DreamerV3 Episode

DreamerV3 moves slowly all the time and gets stuck frequently.

Dexterous Manipulation

Now, we showcase behavior for the remaining tasks. 5/10 of the S3 tasks involve dexterous manipulation - Blind Deaf Piano, Blind Numb Pen, Blind Numb Cube, Visual Pen, Visual Cube.

Blind Deaf Piano

Scaffolder Episode

Scaffolder plays an imperfect but recognizable rendition of Twinkle Twinkle Little Star

DreamerV3 Episode

DreamerV3's rendition is unrecognizable.

Expert pianists can play songs from memory while being blind folded and deaf. Here, the agent must play “Twinkle Twinkle Little Star” given only proprioception during deployment. During training, the policy has access to future notes, piano key presses, and suggested fingerings, which emulates having vision to see sheet music and hearing to determine which keys were pressed.

Blind Numb Object Manipulation

Next, we evaluate if agents can rotate cubes and pens to arbitrary goal poses given only proprioception and initial pose. During training time, we grant access to privileged touch sensors and object pose.

Scaffolder

Scaffolder

Informed Dreamer

Dreamer

Scaffolder quickly achieves the desired object orientation and maintains stability. Baselines are worse; they display instability - the block and pen tend to slip and slide on the hand.

Visual Object Manipulation

Next, the Visual Object Manipulation tasks extend the Blind Numb Object Manipulation setup by adding a top-down camera to the target sensor setup. The visual target policy must rotate the object to a goal pose.

Scaffolder

Scaffolder

Guided Observability

Informed Dreamer

In the visual setting, we find that Scaffolder policies adeptly rotate the objects, while baselines are more unstable or completely fail.

Active Perception

Occluded Pick Place

We examines the impact of active-perception as privileged sensing at training time. The target policy must use an RGB camera from an occluded viewpoint, alongside proprioception and touch sensing, to pick up a block behind a shelf and place it into a bin. Both block and bin locations are randomly initialized.

Scaffolder

Scaffolder quickly locates the block with touch sensing and places it on the goal.

DreamerV3+BC

DreamerV3+BC locates the block but fails to fully pick it up and even attempts to go to the goal location long after dropping it.

Guided Observability

Guided Observability locates the block but fails to fully place it on the goal on its first attempt, as its picking is less robust.

Wrist Pick Place

Next, we investigate if privileged fixed optimal cameras can improve the training of an active perception policy. Here, a proprioceptive policy with a wrist camera must pick and place a randomly positioned block into a randomly positioned bin. Because the wrist camera has limited field of view, the robot must perform active perception to find the block to complete the task.

Scaffolder (Wrist Cam)

From a noisy, limited field of view wrist camera, Scaffolder learns to find the block, pick it up, and place it into the bin.

Scaffolder (Privileged Cam)

The privileged camera shows where the block and bin are randomly initialized.

DreamerV3 (Wrist Cam)

DreamerV3, with only access to the wrist camera, only learns to pick and fails to place the block.