Multi-Task Interactive Robot Fleet Learning with Visual World Models

Abstract

Recent advancements in large-scale multi-task robot learning offer the potential for deploying robot fleets in household and industrial settings, enabling them to perform diverse tasks across various environments. However, AI-enabled robots often face challenges with generalization and robustness when exposed to real-world variability and uncertainty. We introduce Sirius-Fleet, a multi-task interactive robot fleet learning framework to address these challenges. Sirius-Fleet monitors robot performance during deployment and involves humans to correct the robot's actions when necessary. We employ a visual world model to predict the outcomes of future actions and build anomaly predictors to predict whether they will likely result in anomalies. As the robot autonomy improves, the anomaly predictors automatically adapt their prediction criteria, leading to fewer requests for human intervention and gradually reducing human workload over time. Evaluations on large-scale benchmarks demonstrate Sirius-Fleet's effectiveness in improving multi-task policy performance and monitoring accuracy. We demonstrate Sirius-Fleet's performance in both RoboCasa in simulation and Mutex in the real world, two diverse, large-scale multi-task benchmarks.

Overview

We introduce Sirius-Fleet, a multi-task interactive robot fleet learning framework. The framework consists of two stages: 1) Visual World Model Training and Inference, where we pre-train a visual world model on diverse datasets to predict future latent embeddings from past video frames, and 2) Multi-Task Interactive Fleet Learning, where the pre-trained model is used to supervise multi-task robot fleet deployment. During deployment, anomaly predictors monitor task performance in real time, involving humans for control when necessary. The policy and anomaly predictors are continuously fine-tuned with deployment data, enabling improved task performance over time.

Visual World Model

A key challenge in runtime monitoring is to effectively predict future task scenarios, as this allows the system to preempt potential failures before they occur. We develop a visual world model trained on diverse robot trajectories performing a large variety of tasks, which enables the prediction of future task outcomes and helps prevent potential failures. The visual world model is trained by reconstructing image frames from input observations, which allows it to capture fine-grained visual details necessary for precise manipulation. The visual world model comprises a UNet-based encoder and decoder combined with a cVAE- and Transformer-based prediction model. This architecture allows the world model to predict future embeddings from the current state.

Anomaly Predictors

The learned embeddings from the world model are then shared across downstream anomaly prediction tasks. We train two distinct types of anomaly predictors: one for failure detection and one for out-of-distribution (OOD) detection. The two predictors complement each other in practice: failure prediction detects failures similar to those identified by humans previously, and OOD detection captures cases when the robot is in novel, unfamiliar scenarios.

Multi-Task Policy

The multi-task policy is a Transformer that processes images, proprioceptive data, and task language embeddings. It uses a Gaussian Mixture Model (GMM) to output robot actions.

Experiments

We evaluate policy learning and runtime monitoring using 12 tasks from the RoboCasa benchmark in simulation and 10 tasks from the Mutex benchmark in real-world environments.