ORION: Vision-based Manipulation from Single Human Video with Open-World Object Graphs

*Equal Contribution
1 The University of Texas at Austin 2 Sony AI

Abstract

This work presents an object-centric approach to learning vision-based manipulation skills from human videos. We investigate the problem of robot manipulation via imitation in the open-world setting, where a robot learns to manipulate novel objects from a single video demonstration. We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB or RGB-D video and deriving a policy that conditions on the extracted plan. Our method enables the robot to learn from videos captured by daily mobile devices and to generalize the policies to deployment environments with varying visual backgrounds, camera angles, spatial layouts, and novel object instances. We systematically evaluate our method on both short-horizon and long-horizon tasks, using RGB-D and RGB-only demonstration videos. We observe an average success rate of 74.4%, demonstrating the efficacy of ORION in learning from a single human video in the open world.



Overview

In this animation, we explain at the high-level how ORION is able to extract object-centric information for a robot to imitate the humana.

Method - Plan Generation from Video

No image found.
ORION first tracks the objects and keypoints across the video frames. Then keyframes are identified based on the velocity statistics of the keypoint trajectories. Then ORION generates an Open-world Object Graph (OOG) for every keyframe, resulting in a sequence of OOG that serves as the spatiotemporal abstraction of the video. The figure is viewed best in color.


Method - Action Synthesis of ORION Policy

No image found.
At test time, ORION first localizes task-relevant objects and retrieves the matched OOG from the generated manipulation plan. Then, ORION uses the retrieved OOGs to predict the object motions by warping the object-centric feature trajectory from the video to match the test-time observation. The predicted trajectories are then used to optimize the SE(3) action sequence of the robot end effector, which is subsequently used to command the robot.



Task:
Mug-on-coaster

Task:
Chip-on-plate

Task:
Simple-boat-assembly

Task:
Succulents-in-llama-vase

Task:
Rearrange-mug-box

Task:
Complext-boat-assembly

Task:
Prepare-breakfast

Here we show all the videos of RGB-D human demonstrations. All the videos are taken using an iPad that is statically placed on the table.




Task:
Pour-juice

Task:
Mug-on-coaster

Task:
Peas-on-plate

Task:
Succulents-in-llama-vase

Task:
Cheese-on-plate

Here we show all the videos of RGB-only demonstrations. Pour-juice is retrieved from an in-the-wild YouTube video. In addition, Peas-on-plate and Mug-on-coaster are generated from Veo 2. Finally, the demo videos for the Succulents-in-llama-vase and Pour-juice tasks are the same as the RGB-D versions, but without depth.




Real Robot Rollouts

Here we provide rollout videos of ORION policies on real robots. All video playbacks are at 4x speed.

































































































Typical failure modes of ORION policies are due to missed grasp, fail to complete the goal due to misalignment during insertion, or fail to place the object on the correct region in the plate.