VIOLA: Imitation Learning for Vision-Based Manipulation
with Object Proposal Priors

6th Conference on Robot Learning, Auckland, New Zealand

We introduce VIOLA, an object-centric imitation learning approach to learning closed-loop visuomotor policies for robot manipulation. Our approach constructs object-centric representations based on general object proposals from a pre-trained vision model. It uses a transformer-based policy to reason over these representations and attends to the task-relevant visual factors for action prediction. Such object-based structural priors improve deep imitation learning algorithm's robustness against object variations and environmental perturbations. We quantitatively evaluate VIOLA in simulation and on real robots. VIOLA outperforms the state-of-the-art imitation learning methods by 45.8% in success rates. It has also been deployed successfully on a physical robot to solve challenging long-horizon tasks, such as dining table arrangements and coffee making.

Method Overview

Overview of VIOLA. We use a pre-trained RPN to get general object proposals that allow us to learn object-centric visuomotor skills.

Real Robot Experiment

Our evaluation on real-robot tasks is shown in the following table. We show that VIOLA learns the manipulation policies with behavioral cloning algorithms much better than the state-of-the-art baseline, BC-RNN. Notably, in the Make-Coffee task, the baseline fails to complete the task in any attempt, while VIOLA is able to achieve 60%. This empirical result further proves the effectiveness of VIOLA.

Qualitative Real Robot Demo

We can sequentially execute Dining-PlateFork and Dining-Bowl policies. This video shows that the learned policies making two coffees in a row.

Our policies are robust to scenarios where unseen distracting objects are present

(The cup and the strawberry in bowl were never present in demonstrations)

A no-cut video of 10 `Make-Coffee` rollouts

Acknowledgements

The authors would like to specially thank Yue Zhao for the great discussion on the project and the insightful feedback on the manuscript. This work has taken place in the Robot Perception and Learning Group (RPL) and Learning Agents Research Group (LARG) at UT Austin. RPL research has been partially supported by the National Science Foundation (CNS-1955523, FRR-2145283), the Office of Naval Research (N00014-22-1-2204), and the Amazon Research Awards. LARG research is supported in part by NSF (CPS-1739964, IIS-1724157, NRI-1925082), ONR (N00014-18-2243), FLI (RFP2-000), ARO (W911NF-19-2-0333), DARPA, Lockheed Martin, GM, and Bosch. Peter Stone serves as the Executive Director of Sony AI America and receives financial compensation for this work. The terms of this arrangement have been reviewed and approved by the University of Texas at Austin in accordance with its policy on objectivity in research.

VIOLA: Imitation Learning for Vision-Based Manipulation
with Object Proposal Priors

Yifeng Zhu¹ Abhishek Joshi¹ Peter Stone^{1, 2} Yuke Zhu¹

¹The University of Texas at Austin ²Sony AI

Paper | Video | Code | Bibtex

Method Overview

Real Robot Experiment

Qualitative Real Robot Demo

Our policies are robust to scenarios where unseen distracting objects are present

(The cup and the strawberry in bowl were never present in demonstrations)

A no-cut video of 10 `Make-Coffee` rollouts

Acknowledgements

VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors

Yifeng Zhu1 Abhishek Joshi1 Peter Stone1, 2 Yuke Zhu1

1The University of Texas at Austin 2Sony AI

Paper | Video | Code | Bibtex

Method Overview

Real Robot Experiment

Qualitative Real Robot Demo

Our policies are robust to scenarios where unseen distracting objects are present

(The cup and the strawberry in bowl were never present in demonstrations)

A no-cut video of 10 Make-Coffee rollouts

Acknowledgements

VIOLA: Imitation Learning for Vision-Based Manipulation
with Object Proposal Priors

Yifeng Zhu¹ Abhishek Joshi¹ Peter Stone^{1, 2} Yuke Zhu¹

¹The University of Texas at Austin ²Sony AI

A no-cut video of 10 `Make-Coffee` rollouts