OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

1 The University of Texas at Austin 2 Nvidia Research

*: Equal Contributions

Conference on Robot Learning (CoRL), 2024

Abstract

We study the problem of teaching humanoid robots to imitate manipulation skills by watching single human videos. To tackle this problem, we investigate an object-aware retargeting approach, where humanoid robots mimic the human motions in the video while adapting to the object locations during deployment. We introduce OKAMI, an algorithm that generates a reference plan from a single RGB-D video, and derive a policy that follows the plan to complete the task. OKAMI sheds light on deploying humanoid robots in everyday environments, where the humanoid robot will quickly adapt to a new task given a single human video. Our experiments show that OKAMI outperforms the baseline by 58.33%, while showcasing systematic generalization across varying visual and spatial conditions.



Method

OKAMI is a two-staged method that enables a humanoid robot to imitate a manipulation task from a single human video. The first stage helps the humanoid to understand what is happening in the actionless video, and the second stage allows the humanoid to execute the task in various scenarios.
In the first stage, OKAMI processes the human video to generate a reference manipulatation plan, a spatiotemporal abstraction of the video that captures which and how objects are moved and how human moves between adjacent subgoals. It first identify task-relevant objects using VLM, then tracks the object motions throughout the video. It use a human reconstruction model to obtain SMPL-H trajectories. Subgoals are identified based on objects keypoints velocities, then all information are combined to form the reference plan.
In the second stage, the humanoid motion is synthesized through object-aware retargeting, which retargets the human motion to the humanoid while adapting to the object locations during deployment. It first localizes the task-relevant objects and retrives the subgoal. Then it retargets the SMPL-H trajectory to the humanoid, using inverse kinematics and dex-retargeting. The trajectory is warped based on test-time object's locations, and then sent to the real robot for execution.





Object-Aware Retarget Results

We test object-aware retarget on the real robot on six representative tasks, which covers a wide range of manipulation skills including picking, placing, pouring, pushing, manipulating articulated objects, and bimanual cooperation. Our method enables the humanoid to perform the task in scenerios with different visual backgrounds, different object instances, and different object layouts. Here we provide the rollout videos of the six tasks.

Task: Bagging

Human demonstration video
Robot rollout video

Task: Sprinkle-salt

Human demonstration video
Robot rollout video

Task: Plush-toy-in-basket

Human demonstration video
Robot rollout video

Task: Place-snacks-on-plate

Human demonstration video
Robot rollout video

Task: Close-the-drawer

Human demonstration video
Robot rollout video

Task: Close-the-laptop

Human demonstration video
Robot rollout video



Closed-Loop Visuomotor Policies

By randomly initializing the object layouts and running the object-aware retargeting pipeline each time, we can efficiently generate a large volume of successful rollout data using OKAMI without the need for human-teleoperation. The rollout data can then be used to train closed-loop visuomotor policies through behavioral cloning. We test on two tasks, bagging and sprinkle-salt. The success rates of the visuomotor policies achieve 83.3% and 75% respectively. Here we provide the rollouts of visuomotor policies.

Task: Bagging
Task: Sprinkle-salt



Failure Examples

OKAMI's policies may fail to grasp objects due to inaccuracies in the robot controllers, the human reconstruction model or the vision models, or fail to complete tasks because of unwanted collisions, undesired upper body rotations, or inaccuracy in solving inverse kinematics. Here we provide typical failure examples.

Failed to grasp the snack because the robot hand didn't move to a proper position for grasping.
Failed to grasp the bottle because the robot hand didn't move to a proper position for grasping.
Failed to complete the task because of inaccurate inverse kinematics results.
Failed to complete the task due to unwanted collisions and unwanted body rotation.



Team


Citation

@inproceedings{okami2024,
    title={OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation},
    author={Jinhan Li and Yifeng Zhu and Yuqi Xie and Zhenyu Jiang and Mingyo Seo and Georgios Pavlakos and Yuke Zhu},
    booktitle={8th Annual Conference on Robot Learning (CoRL)},
    year={2024}
}