OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

1 The University of Texas at Austin 2 Nvidia Research

*: Equal Contributions

Conference on Robot Learning (CoRL), 2024

Abstract

We study the problem of teaching humanoid robots manipulation skills by imitating from single video demonstrations. We introduce OKAMI, a method that generates a manipulation plan from a single RGB-D video and derives a policy for execution. At the heart of our approach is object-aware retargeting, which enables the humanoid robot to mimic the human motions in an RGB-D video while adjusting to different object locations during deployment. OKAMI uses open-world vision models to identify task-relevant objects and retarget the body motions and hand poses separately. Our experiments show that OKAMI achieves strong generalizations across varying visual and spatial conditions, outperforming the state-of-the-art baseline on open-world imitation from observation. Furthermore, OKAMI rollout trajectories are leveraged to train closed-loop visuomotor policies, which achieve an average success rate of 79.2% without the need for labor-intensive teleoperation.



Method

OKAMI is a two-staged method that enables a humanoid robot to imitate a manipulation task from a single human video. The first stage helps the humanoid to understand what is happening in the actionless video, and the second stage allows the humanoid to execute the task in various scenarios.
In the first stage, OKAMI processes the human video to generate a reference manipulatation plan, a spatiotemporal abstraction of the video that captures which and how objects are moved and how human moves between adjacent subgoals. It first identify task-relevant objects using VLM, then tracks the object motions throughout the video. It use a human reconstruction model to obtain SMPL-H trajectories. Subgoals are identified based on objects keypoints velocities, then all information are combined to form the reference plan.
In the second stage, the humanoid motion is synthesized through object-aware retargeting, which retargets the human motion to the humanoid while adapting to the object locations during deployment. It first localizes the task-relevant objects and retrives the subgoal. Then it retargets the SMPL-H trajectory to the humanoid, using inverse kinematics and dex-retargeting. The trajectory is warped based on test-time object's locations, and then sent to the real robot for execution.





Object-Aware Retarget Results

We test object-aware retarget on the real robot on six representative tasks, which covers a wide range of manipulation skills including picking, placing, pouring, pushing, manipulating articulated objects, and bimanual cooperation. Our method enables the humanoid to perform the task in scenerios with different visual backgrounds, different object instances, and different object layouts. Here we provide the rollout videos of the six tasks.

Task: Bagging

Human demonstration video
Robot rollout video

Task: Sprinkle-salt

Human demonstration video
Robot rollout video

Task: Plush-toy-in-basket

Human demonstration video
Robot rollout video

Task: Place-snacks-on-plate

Human demonstration video
Robot rollout video

Task: Close-the-drawer

Human demonstration video
Robot rollout video

Task: Close-the-laptop

Human demonstration video
Robot rollout video



Closed-Loop Visuomotor Policies

By randomly initializing the object layouts and running the object-aware retargeting pipeline each time, we can efficiently generate a large volume of successful rollout data using OKAMI without the need for human-teleoperation. The rollout data can then be used to train closed-loop visuomotor policies through behavioral cloning. We test on two tasks, bagging and sprinkle-salt. The success rates of the visuomotor policies achieve 83.3% and 75% respectively. Here we provide the rollouts of visuomotor policies.

Task: Bagging
Task: Sprinkle-salt



Failure Examples

OKAMI's policies may fail to grasp objects due to inaccuracies in the robot controllers, the human reconstruction model or the vision models, or fail to complete tasks because of unwanted collisions, undesired upper body rotations, or inaccuracy in solving inverse kinematics. Here we provide typical failure examples.

Failed to grasp the snack because the robot hand didn't move to a proper position for grasping.
Failed to grasp the bottle because the robot hand didn't move to a proper position for grasping.
Failed to complete the task because of inaccurate inverse kinematics results.
Failed to complete the task due to unwanted collisions and unwanted body rotation.



Team


Citation

@inproceedings{okami2024,
    title={OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation},
    author={Jinhan Li and Yifeng Zhu and Yuqi Xie and Zhenyu Jiang and Mingyo Seo and Georgios Pavlakos and Yuke Zhu},
    booktitle={8th Annual Conference on Robot Learning (CoRL)},
    year={2024}
}