LIV: Language-Image Representations and Rewards for Robotic Control

ICML 2023

1University of Pennsylvania, 2FAIR, Meta AI

Abstract

How can we pre-train and fine-tune a multi-modal representation suitable for language-conditioned robotics manipulation and reward specification?

We present Language-Image Value learning (LIV), a unified objective for vision-language representation and reward learning from action-free videos with text annotations. Exploiting a novel connection between dual reinforcement learning and mutual information contrastive learning, LIV proposes a simple and effective objective that learns a multi-modal representation that implicitly encodes a goal-conditioned value function that can express goals in all modalities. We use LIV to pre-train the first control-centric vision-language representation from large human video datasets such as EpicKitchen with no action information. This pre-trained LIV model can perform zero-shot language-conditioned reward specification on unseen human and robot videos alike. Then, with access to target domain data, the very same objective consistently improves this pre-trained LIV model as well as other pre-existing vision-language representations for improved language-conditioned reward specification and robotic control. On two simulated and one real-world robot environments that evaluate vision-language representations and rewards, LIV pre-trained and fine-tuned models consistently outperform the best prior approaches, establishing the advantages of joint vision-language representation and reward learning within its unified, compact framework.

Video (Short Teaser)

Video

LIV Zero-Shot Multi-Modal Reward

After pre-training on text-annotated human videos from EpicKitchen, LIV can produce multi-modal dense rewards with respect to both image goal and language goal on unseen human and robot videos.

Unseen Human Videos

We first visualize pre-trained LIV's multi-modal values on unseen EpicKitchen videos; for each video, we use its last frame as the image goal and the provided text annotation as the language goal. The y-axis is the negative value, so the more negative the higher the value computed using the current frame and the provided goal.

Unseen Robot Videos

We also test LIV's zero-shot rewards on unseen robot videos from WHIRL. LIV rewards are able to generalize to unseen robot domains and embodiments without ever being trained on them.

LIV Reward for Action Recognition

As a corollary of LIV's zero-shot language-reward capability, when the video contains actions that induce opposite object state changes (e.g., open and close dishwasher), LIV's rewards can accurately identify when these actions have occured in the video with the inverted progression that appear in the middle of the reward curves.

How can I generate LIV reward curves on my own videos?

This is quite easy! Try out our example code to generate LIV multi-modal reward curves on your own videos!

Failure Cases

Due to the inherent language grounding and visual domain gaps, LIV zero-shot rewards do not always work on unseen robot domains and tasks. Later, we will show how the LIV objective itself is a very effective fine-tuner, enabling pre-trained LIV as well as other VLMs (e.g., CLIP) to overcome the distribution shift and learn structured representations that both preserve semantic alignment and temporal coherence.

LIV as Representation for Language-Conditioned BC

We use LIV's frozen multi-modal representation as backbone for LCBC and achieve impressive performance (46% success rate, absolute ~30% better than the second best baseline) on a challenging real-world muli-task environment with wide initial state distribution (visualized below).

Environment Description

The real-world environment consists of a table-top toy kitchen setup, in which a Franka robot is tasked with placing various fruits (apple, pear, pineapple) to various containers, (tray, black pot, green pot) in the scene given a sentence task description (e.g., apple in black pot). For each collected demonstration and evaluation episode, the fruits are randomly placed in the center of the workspace:
Below, we visualize all successful LIV-LCBC episodes by task.

Pineapple in Black Pot (8/10)

Pineapple in Green Pot (6/10)

Pineapple in Tray (5/10)

Apple in Black Pot (4/10)

Apple in Green Pot (3/10)

Apple in Tray (5/10)

Pear in Black Pot (4/10)

Pear in Green Pot (3/10)

Pear in Tray (3/10)

Zero-Shot Generalization to Long-Horizon Composite Tasks

LIV-LCBC is able to consistently achieve partial successes (i.e., solving at least 2 out of 3 tasks in the specified sequences) on long-horizon, composite tasks that require solving the atomic tasks in some specified sequences. We note that the RealRobot dataset contains only demonstrations for the short-horizon, atomic tasks, and the demonstrations are never collected in configurations where some fruits have already been placed into some containers. As such, solving more than one task strictly requires the policy to generalize to unseen tabletop configuration, as success for an earlier task will change the scene into a novel configuration for the later tasks.

LIV Few-Shot Fine-Tuning

The pre-trained LIV necessarily faces language grounding and visual domain gaps on out-of-distribution target domains and its rewards do not always work. We show how the LIV objective is also an effective fine-tuning algorithm that can improve the semantic alignment as well as the temporal smoothness of pre-trained vision-language representations for the downstream robot domain. Below, we demonstrate how LIV fine-tuning can improve both the quality of LIV rewards and the LCBC policy trained with LIV backbone.

LIV Multi-Modal Reward Comparison (Pre-Trained vs. Fine-Tuned)

In the first row, we present the average reward curves on all FrankaKitchen tasks for the pre-trained LIV model (Left) and successive LIV checkpoints during fine-tuning. As shown, as fine-tuning goes on, the fine-tuned LIV gradually improves in temporal smoothness while aligning the two modalities. The next few rows qualitatively compare the reward curves on task demonstrations in our real-world dataset.

LIV (Pre-Trained)

LIV (LIV Fine-Tuned)

LIV LCBC Comparison (Pre-Trained vs. Fine-Tuned)

LIV (Pre-Trained)

LIV (LIV Fine-Tuned)

BibTeX


      @article{ma2023liv,
        title={LIV: Language-Image Representations and Rewards for Robotic Control},
        author={Ma, Yecheng Jason and Liang, William and Som, Vaidehi and Kumar, Vikash and Zhang, Amy and Bastani, Osbert and Jayaraman, Dinesh},
        journal={arXiv preprint arXiv:2306.00958},
        year={2023}
      }