VIOLA: Imitation Learning for Vision-Based Manipulation
with Object Proposal Priors

Yifeng Zhu1    Abhishek Joshi1    Peter Stone1, 2    Yuke Zhu1   

1The University of Texas at Austin    2Sony AI   

Paper | Video | Code | Bibtex

6th Conference on Robot Learning, Auckland, New Zealand

We introduce VIOLA, an object-centric imitation learning approach to learning closed-loop visuomotor policies for robot manipulation. Our approach constructs object-centric representations based on general object proposals from a pre-trained vision model. It uses a transformer-based policy to reason over these representations and attends to the task-relevant visual factors for action prediction. Such object-based structural priors improve deep imitation learning algorithm's robustness against object variations and environmental perturbations. We quantitatively evaluate VIOLA in simulation and on real robots. VIOLA outperforms the state-of-the-art imitation learning methods by 45.8% in success rates. It has also been deployed successfully on a physical robot to solve challenging long-horizon tasks, such as dining table arrangements and coffee making.

Method Overview

Overview of VIOLA. We use a pre-trained RPN to get general object proposals that allow us to learn object-centric visuomotor skills.

Real Robot Experiment

Our evaluation on real-robot tasks is shown in the following table. We show that VIOLA learns the manipulation policies with behavioral cloning algorithms much better than the state-of-the-art baseline, BC-RNN. Notably, in the Make-Coffee task, the baseline fails to complete the task in any attempt, while VIOLA is able to achieve 60%. This empirical result further proves the effectiveness of VIOLA.

image/svg+xml36.7 76.7 20.0 60.0 60.0 0.0 Dining - PlateFork Dining - Bowl Make - Coffee BC - RNN VIOLA

Qualitative Real Robot Demo

We can sequentially execute Dining-PlateFork and Dining-Bowl policies. This video shows that the learned policies making two coffees in a row.

Our policies are robust to scenarios where unseen distracting objects are present

(The cup and the strawberry in bowl were never present in demonstrations)

A no-cut video of 10 Make-Coffee rollouts


The authors would like to specially thank Yue Zhao for the great discussion on the project and the insightful feedback on the manuscript. This work has taken place in the Robot Perception and Learning Group (RPL) and Learning Agents Research Group (LARG) at UT Austin. RPL research has been partially supported by the National Science Foundation (CNS-1955523, FRR-2145283), the Office of Naval Research (N00014-22-1-2204), and the Amazon Research Awards. LARG research is supported in part by NSF (CPS-1739964, IIS-1724157, NRI-1925082), ONR (N00014-18-2243), FLI (RFP2-000), ARO (W911NF-19-2-0333), DARPA, Lockheed Martin, GM, and Bosch. Peter Stone serves as the Executive Director of Sony AI America and receives financial compensation for this work. The terms of this arrangement have been reviewed and approved by the University of Texas at Austin in accordance with its policy on objectivity in research.