The standard paradigm for improving instruction following policies involves a human manually collecting additional robot data, labelling it with language instructions, and then finetuning the policy on this data. Can we instead leverage the policy's pre-existing capabilities to bootstrap a self-improvement process?
We propose a particular formulation of an autonomous improvement loop, which we call SOAR, that enables self-improvement of a multi-task language-conditioned policy. The idea is to:
SOAR first decouples a language-conditioned policy into an image-goal conditioned policy and a language-conditioned image subgoal generator. With such a formulation, any autonomously collected data can be used for learning with an entirely self-supervised learning algorithm, namely hindsight-relabeled goal-conditioned learning.
Instruction following can then build off of the Internet-scale knowledge stored in VLMs. VLMs can be used as task proposers to bias the policy to learn to reach semantically interesting goals, and the same VLMs can automate the success detection of autonomously collected trajectories.
When these components are put together, SOAR becomes an end-to-end system for autonomous improvement. SOAR can successfuly be deployed on a fleet of 5 WidowX robots to improve a language-conditioned policy on 9 different environments, collecting in the process over 30,000 autonomous trajectories in just a matter of a few weeks.
After running SOAR, the policy succeeds at tasks where it once failed
Time lapse videos of autonomous data collection. Generated subgoal images during data collection are depicted in the top right
We instantiate our instruction following policy with SuSIE, a language conditioned policy decomposed as a goal conditioned policy and an InstructPix2Pix style language conditioned image editing model. Language task commands from the VLM are converted into subgoal images with the diffusion model, after which the goal conditioned policy is rolled out for a number of timesteps. Then, a new subgoal is generated with the same language instruction, and the process repeats until the end of the trajectory.
In the context of autonomous improvement, such a formulation is very useful. Semantics are separated from motor skills, allowing the former to leverage cheap Internet-data and the latter cheap autonomously collected data. Goal conditioned learning provides a more dense learning signal than language conditioned learning, and can better leverage suboptimal data. And the goal conditioned policy can be trained with purely self-supervised objectives, in contrast to a direct language conditioned policy, which would require a separate model to hindsight relabel autonomous trajectories with language.
To test the improvement capabilities of SOAR, we deploy the system on nine different scenes, and on each of these scenes evaluate its ability to collect semantically useful autonomous data and subsequently learn from the data. The ability to improve is a test of both the learning procedure and the data collection procedure; if data for skills semantically relevant to the downstream tested skills has not been collected, improvement will not be evident. We find that indeed, SOAR enables improvement for multiple language skills on each of these nine scenes, with an average success rate improvement of 115%.
We even see positive transfer. When we train a single generalist policy on all of the collected autonomous data across the nine scenes, average success rate is 7% higher. We also ask the question whether the decomposed language conditioned policy, GCBC+SuSIE, is really needed. To answer this we train a direct language conditioned behavior cloning (LCBC) policy on the autonomous data collected by GCBC+SuSIE. While improvement is obtained by LCBC, the final performance of the decomposed policy in SOAR is considerably better. We attribute this to the increased supervision provided by a goal image than a language instruction, and the better capability of goal conditioned learning to transfer suboptimality in an autonomous trajectory into optimality for reaching the goal that was achieved.
@article{zhou2024autonomous,
title={Autonomous Improvement of Instruction Following Skills via Foundation Models},
author={Zhiyuan Zhou and Pranav Atreya and Abraham Lee and Homer Walke and Oier Mees and Sergey Levine},
journal = {arXiv preprint arXiv:407.20635},
year={2024},
}