SOAR: Autonomous Improvement of Instruction Following Skills via Foundation Models

UC Berkeley
*Equal contribution

Overview

The standard paradigm for improving instruction following policies involves a human manually collecting additional robot data, labelling it with language instructions, and then finetuning the policy on this data. Can we instead leverage the policy's pre-existing capabilities to bootstrap a self-improvement process?

We propose a particular formulation of an autonomous improvement loop, which we call SOAR, that enables self-improvement of a multi-task language-conditioned policy. The idea is to:

  1. Decouple language understanding from robotic control
  2. Use VLMs to help instantiate a complete autonomous improvement loop


SOAR first decouples a language-conditioned policy into an image-goal conditioned policy and a language-conditioned image subgoal generator. With such a formulation, any autonomously collected data can be used for learning with an entirely self-supervised learning algorithm, namely hindsight-relabeled goal-conditioned learning.

Instruction following can then build off of the Internet-scale knowledge stored in VLMs. VLMs can be used as task proposers to bias the policy to learn to reach semantically interesting goals, and the same VLMs can automate the success detection of autonomously collected trajectories.

System Diagram

When these components are put together, SOAR becomes an end-to-end system for autonomous improvement. SOAR can successfuly be deployed on a fleet of 5 WidowX robots to improve a language-conditioned policy on 9 different environments, collecting in the process over 30,000 autonomous trajectories in just a matter of a few weeks.


After running SOAR, the policy succeeds at tasks where it once failed



Time lapse videos of autonomous data collection. Generated subgoal images during data collection are depicted in the top right



Reformulating Language-Conditioned Control


We instantiate our instruction following policy with SuSIE, a language conditioned policy decomposed as a goal conditioned policy and an InstructPix2Pix style language conditioned image editing model. Language task commands from the VLM are converted into subgoal images with the diffusion model, after which the goal conditioned policy is rolled out for a number of timesteps. Then, a new subgoal is generated with the same language instruction, and the process repeats until the end of the trajectory.

In the context of autonomous improvement, such a formulation is very useful. Semantics are separated from motor skills, allowing the former to leverage cheap Internet-data and the latter cheap autonomously collected data. Goal conditioned learning provides a more dense learning signal than language conditioned learning, and can better leverage suboptimal data. And the goal conditioned policy can be trained with purely self-supervised objectives, in contrast to a direct language conditioned policy, which would require a separate model to hindsight relabel autonomous trajectories with language.


Improvement Results

Decomposed Instruction Following Policy

To test the improvement capabilities of SOAR, we deploy the system on nine different scenes, and on each of these scenes evaluate its ability to collect semantically useful autonomous data and subsequently learn from the data. The ability to improve is a test of both the learning procedure and the data collection procedure; if data for skills semantically relevant to the downstream tested skills has not been collected, improvement will not be evident. We find that indeed, SOAR enables improvement for multiple language skills on each of these nine scenes, with an average success rate improvement of 115%.

Decomposed Instruction Following Policy

We even see positive transfer. When we train a single generalist policy on all of the collected autonomous data across the nine scenes, average success rate is 7% higher. We also ask the question whether the decomposed language conditioned policy, GCBC+SuSIE, is really needed. To answer this we train a direct language conditioned behavior cloning (LCBC) policy on the autonomous data collected by GCBC+SuSIE. While improvement is obtained by LCBC, the final performance of the decomposed policy in SOAR is considerably better. We attribute this to the increased supervision provided by a goal image than a language instruction, and the better capability of goal conditioned learning to transfer suboptimality in an autonomous trajectory into optimality for reaching the goal that was achieved.

Autonomous Data Quality

Left: LCBC data collection. Right: SOAR's GCBC + SuSIE data collection


The quality of the collected autonomous data bounds the magnitude of improvement achievable. The above side-by-side video depicts a data collection run with LCBC as the instruction following policy compared against the GCBC + SuSIE policy used in SOAR. On this slightly out-of-distribution environment, some of the objects present like the purple eggplant and lemon were not seen in the pre-training dataset Bridge V2, and so LCBC is unable to ground language instructions containing these objects, leading to minimal interactions with them. In contrast the data collected by GCBC + SuSIE exhibits significantly more meaningful interactions. This can be attributed to the high degree of generalizability of the decomposed language-conditioned policy: for the low-level policy generalizing to novel goal images is more robust that generalizing to un-grounded language instructions, and the high-level SuSIE policy generalizes very well thanks to its Internet pre-training.

SOAR-Data

Decomposed Instruction Following Policy
We also release as a secondary contribution the autonomous dataset collected by our deployment of SOAR. This dataset, SOAR-Data, consists of more than 30,000 trajectories (3M transitions) collected with over 50 different sets of objects across 5 different table top setups. Each trajectory in SOAR-Data comes with language annotations (from a VLM), 5 commanded subgoal images generated by SuSIE during one episode, and a task success label predicted by the VLM. SOAR-Data is similar in size compared to other current robotic datasets, but is collected under much smaller time frames (in a matter of weeks) with minimal human effort in the loop. As SOAR-Data consists both of failure and success trajectories, contains diverse scenes and objects, includes language instructions and subgoal images, and is collected with a publicly available low-cost robotic arm, we hope it will be a useful resource for offline reinforcement learning research. See https://github.com/rail-berkeley/soar for instructions on how to download the data.

BibTeX

@article{zhou2024autonomous,
    title={Autonomous Improvement of Instruction Following Skills via Foundation Models},
    author={Zhiyuan Zhou and Pranav Atreya and Abraham Lee and Homer Walke and Oier Mees and Sergey Levine},
    journal = {arXiv preprint arXiv:407.20635},
    year={2024},
}