Autonomous Improvement

SOAR: Autonomous Improvement of Instruction Following Skills via Foundation Models

UC Berkeley
^*Equal contribution

Overview

The standard paradigm for improving instruction following policies involves manually collecting additional robot data, labelling it with language instructions, and then finetuning the policy on this data. This process is expensive, and hard to scale up without huge human effort. Can we instead deploy robot fleets to collect large-scale datasets without human supervision, and use that data to self-improve the policy?

We propose a particular formulation of an autonomous improvement loop, which we call SOAR, that enables self-improvement of a multi-task language-conditioned policy. The key ideas are:

Use VLMs and diffusion models to help guide large-scale autonomous data collection.

Decouple language understanding from robotic control, so semantic understanding benefits from internet-scale pre-training and low-level control improve with self-supervision.

SOAR decouples a language-conditioned policy into an image-goal conditioned policy and a language-conditioned image subgoal generator. The benefit of such a formulation is that any autonomously collected data can be used to improve the policy with an entirely self-supervised learning algorithm, namely hindsight-relabeled goal-conditioned learning.

Autonomous data collection can then build off of the Internet-scale knowledge stored in VLMs and diffusion models. VLMs can be used as task proposers to guide the policy towards semantically interesting goals, and can automate the success detection of autonomously collected trajectories. Diffusion models can be used to synthesize interesting and diverse image goals from the semantic goals generated by the VLM. Both of these components help guide data collection towards interesting yet diverse tasks.

Putting these components together, SOAR becomes an end-to-end system for autonomous improvement. We deployed SOAR on a fleet of 5 WidowX robots, enabling the collection of over 30,000 autonomous trajectories across more than 50 scenes within just a few weeks, and evaluated its ability to achieve a 2x improvement in multiple language skills across 10 test scenes.

Autonomous Data Quality

Left: LCBC data collection. Right: SOAR's GCBC + SuSIE data collection

The quality of the collected autonomous data bounds the magnitude of improvement achievable. The above side-by-side video depicts a data collection run with LCBC as the instruction following policy compared against the GCBC + SuSIE policy used in SOAR. On this slightly out-of-distribution environment, some of the objects present like the purple eggplant and lemon were not seen in the pre-training dataset Bridge V2, and so LCBC is unable to ground language instructions containing these objects, leading to minimal interactions with them. In contrast, the data collected by GCBC + SuSIE exhibits significantly more meaningful interactions. This can be attributed to the high degree of generalizability of the decomposed language-conditioned policy: the low-level policy generalizing to novel goal images is more robust than generalizing to un-grounded language instructions, and the high-level image-generation generalizes well thanks to its Internet pre-training.

@article{zhou2024autonomous, title={Autonomous Improvement of Instruction Following Skills via Foundation Models}, author={Zhiyuan Zhou and Pranav Atreya and Abraham Lee and Homer Walke and Oier Mees and Sergey Levine}, journal = {arXiv preprint arXiv:407.20635}, year={2024}, }

SOAR: Autonomous Improvement of Instruction Following Skills via Foundation Models

Overview

Reformulating Language-Conditioned Control

Improvement Results

Autonomous Data Quality

SOAR-Data

BibTeX