ORION: Vision-based Manipulation from Single Human Video with Open-World Object Graphs

1 The University of Texas at Austin 2 Sony AI

Abstract

We present an object-centric approach to empower robots to learn vision-based manipulation skills from human videos. We investigate the problem of imitating robot manipulation from a single human video in the open-world setting, where a robot must learn to manipulate novel objects from one video demonstration. We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB-D video and deriving a policy that conditions on the extracted plan. Our method enables the robot to learn from videos captured by daily mobile devices such as an iPhone and generalize the policies to deployment environments with varying visual backgrounds, camera angles, spatial layouts, and novel object instances. We systematically evaluate our method on both short-horizon and long-horizon tasks, demonstrating the efficacy of ORION in learning from a single human video in the open world.



Overview

In this animation, we explain at the high-level how ORION is able to extract object-centric information for a robot to imitate the humana.

Method - Plan Generation from Video

No image found.
ORION first tracks the objects and keypoints across the video frames. Then keyframes are identified based on the velocity statistics of the keypoint trajectories. Then ORION generates an Open-world Object Graph (OOG) for every keyframe, resulting in a sequence of OOG that serves as the spatiotemporal abstraction of the video. The figure is viewed best in color.


Method - Action Synthesis of ORION Policy

No image found.
At test time, ORION first localizes task-relevant objects and retrieves the matched OOG from the generated manipulation plan. Then ORION uses the retrieved OOGs to predict the object motions by first computing global registration of object point clouds and then warping the observed keypoint trajectories from video into the workspace. The predicted trajectories are then used to optimize the SE(3) action sequence of the robot end effector, which is subsequently used to command the robot.



Task:
Mug-on-coaster

Task:
Chip-on-plate

Task:
Simple-boat-assembly

Task:
Succulents-in-llama-vase

Task:
Rearrange-mug-box

Task:
Complext-boat-assembly

Task:
Prepare-breakfast

Here we show all the videos of human demonstrations. All the videos are taken using an iPad that is statically placed on the table.




Real Robot Rollouts

Here we provide rollout videos of ORION policies on real robots. All video playbacks are at 4x speed.























































Typical failure modes of ORION policies are due to missed grasp, fail to complete the goal due to misalignment during insertion, or fail to place the object on the correct region in the plate.