Interactive Robot Learning from Verbal Correction

Huihan Liu1,2    Alice Chen1    Yuke Zhu2    Adith Swaminathan1    Andrey Kolobov1    Ching-An Cheng1   

1Microsoft Research    2The University of Texas at Austin   

Paper | Code | Bibtex


Abstract

The ability for robots to learn and refine behavior after deployment has become ever more important as we design them to operate in unstructured environments like households. In this work, we design a new learning system based on large language model (LLM), OLAF ☃️, that allows everyday users to teach a robot using verbal corrections when the robot makes mistakes, e.g., by saying "Stop what you're doing. You should move closer to the cup." A key feature of OLAF is its ability to update the robot's visuomotor neural policy based on the verbal feedback to avoid repeating mistakes in the future. This is in contrast to existing LLM-based robotic systems, which only follow verbal commands or corrections but not learn from them. We demonstrate the efficacy of our design in experiments where a user teaches a robot to perform long-horizon manipulation tasks both in simulation and on physical hardware, achieving on average 20.0% improvement in policy success rate.



Overview

OLAF is a LLM-based learning system designed for updating a robot's visuomotor neural-network-based policy using verbal corrections given by regular non-expert users. To train the robot, the user simply needs to watch to robot performing a task, stop the robot when the user thinks the robot is not able to finish the task, and then provide an instruction in natural language on how the robot can do better.




Motivating Example

We depicts an use case of OLAF for updating a robot manipulator's policy. Here the robot is tasked to place the tomato sauce in the basket. The robot opens its gripper and moves forward. But instead of going to the tomato sauce, it goes to the right. Upon seeing this surprising behavior, the user stops the robot by pressing a stop button and says "Stop. To pickup the tomato sauce, you should move to your left."
Given a robot trajectory interrupted by the user, OLAF relabels the suboptimal actions in the trajectory segment leading up to the trajectory's termination with good actions based on the user's verbal correction.




OLAF System

The OLAF pipeline consists of three steps: User Interaction, Data Synthesis, and Policy Update. In User Interaction, it collects pairs of {robot trajectory, verbal correction} of trajectories stopped by the user. In Data Synthesis, it uses the LLM as a critic to select the action (from a pool of action candidates) that best matches the user's verbal correction and relabels the pre-intervention trajectory segments (in red). In Policy Update, it updates the policy by performing behavior cloning on the newly synthesized data and the previously collected data.




Tasks

We evaluate OLAF on four tasks in simulation and two tasks on real robot. The tasks in simulation are fine-grained manipulation tasks while the tasks on real robot are long-horizon, multi-staged tasks.

Simulation

Pick Place Can

Threading

Square

Coffee Machine


Real Robot

PickPlace-Bin:
Pick up the pea can and popcorn and place them into the bin in sequence

PickPlace-Drawer-Basket:
Pick up the peach can and place it in the drawer, and pick up the chocolate box and place it into the basket

(played in 2X speed)



OLAF produces better quality policy behaviors

We show that OLAF helps the robot policy produce bette quality behaviors. Consider the case when the human intervention is available: while the BC baseline learn from data that corrects its behavior after it commits a mistake, OLAF with action relabeling learns what the robot could have done to avoid the mistakes in the first place. Overriding erroneous actions helps to learn the accurate actions that prevent the mistakes from happening, rather than merely recovering from the mistakes.

BC baseline: BC learns the mistake actions, and then how to correct from the mistakes. It can sometimes still repeat the same mistakes.

OLAF: OLAF directly learns the optimal behavior since action relabeling overrides wrong actions.



Real Robot Evaluation Rollouts

We provide the real robot evaluation rollouts of OLAF and BC baseline to visualize the trajectory behavior and failure modes, as well as to present the entire evaluate process. All videos are played in 1X speed.

OLAF: 73.5% Success Rate



BC: 35.3% Success Rate



Example Prompts

Prompts of an LLM as a critic for action relabeling: The system prompt (top) specify system-level desired behavior, the context prompt (middle) describes the task level instruction, and the action relabeling prompt (bottom) includes the trajectory-level information and the verbal correction. The black denotes the template and the blue denotes user- or sensor-dependent information. We highlight the action proposal in blue background.




Citation

@misc{liu2023interactive,
      title={Interactive Robot Learning from Verbal Correction}, 
      author={Huihan Liu and Alice Chen and Yuke Zhu and Adith Swaminathan and Andrey Kolobov and Ching-An Cheng},
      year={2023},
      eprint={2310.17555},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
}