Harmon: Whole-Body Motion Generation of Humanoid Robots from Language Descriptions

Zhenyu Jiang*1,2    Yuqi Xie*1,2    Jinhan Li1    Ye Yuan2    Yifeng Zhu1    Yuke Zhu1,2   

1 The University of Texas at Austin 2 NVIDIA Research   

*: Equal Contributions

Conference on Robot Learning (CoRL), 2024   

Paper

Humanoid robots, with their human-like embodiment, have the potential to integrate seamlessly into human environments. Critical to their coexistence and cooperation with humans is the ability to understand natural language communications and exhibit human-like behaviors. This work focuses on generating diverse whole-body motions for humanoid robots from language descriptions. We leverage human motion priors from extensive human motion datasets to initialize humanoid motions and employ the commonsense reasoning capabilities of Vision Language Models (VLMs) to edit and refine these motions. Our approach demonstrates the capability to produce natural, expressive, and text-aligned humanoid motions, validated through both simulated and real-world experiments.




Method

Given the language description of a motion, we first generate corresponding human motion and retarget it to the humanoid using inverse kinematics. Next, we utilize a VLM to refine the humanoid motion. This process involves extracting finger and head motion descriptions from the initial language description and generating the corresponding motions using the VLM. Given the rendered humanoid motion, the VLM iteratively evaluates and adjusts the motion to ensure alignment with the language description. Finally, Harmon generates whole-body humanoid motion that accurately aligns with the language description.



Motion Execution on the Real Robots

We execute the generated humanoid motions on two real humanoid robots, GR1-T1 (black) and GR1-T2 (silver). They are two versions of the same humanoid robot with different appearances. All videos are displayed in real-time.

Part1: Performing Hand Sign Languages

Using GPT-4, we segment a given sentence and generate motion descriptions for each segment in hand sign language. Harmon then converts these descriptions into humanoid motions, which are connected to perform the complete hand sign language.

Part2: Following Motion Descriptions

Harmon generates humanoid motion from motion descriptions.



Motion Editing Visualization in Simulation

We also visualize the humanoid motion before (left) and after (right) the VLM-based editing. The green texts on the right are the description of the motion refinements.



Failure Cases

The generated human motion clapped more than once. VLM is not able to fix this high-frequency motion.

The T shape should be in front of the chest to indicate a timeout. However the text-to-human-motion model is not able to figure out the semantic meaning and generated wrong motion that is too far away from the ground truth to refine.

VLM is not able to fix the orientation of right wrist due to the lack of motion editing primitive for this fix.



Citation


@inproceedings{harmon2024,
   title={Harmon: Whole-Body Motion Generation of Humanoid Robots from Language Descriptions},
   author={Jiang, Zhenyu and Xie, Yuqi and Li, Jinhan and Yuan, Ye and Zhu, Yifeng and Zhu, Yuke},
   booktitle={8th Annual Conference on Robot Learning (CoRL)},
   year={2024}
}