MimicDroid: In-Context Learning for Humanoid Robot Manipulation
from Human Play Videos

1 The University of Texas at Austin     2 Amazon Consumer Robotics     3 NVIDIA
*Equal Contribution

Overview

Abstract

We aim to enable humanoid robots to efficiently solve new manipulation tasks from a few video examples. In-context learning (ICL) is a promising framework for achieving this goal due to its test-time data efficiency and rapid adaptability. However, current ICL methods rely on labor-intensive teleoperated data for training, which restricts scalability. We propose using human play videos—continuous, unlabeled videos of people interacting freely with their environment—as a scalable and diverse training data source. We introduce MimicDroid, which enables humanoids to perform ICL using human play videos as the only training data MimicDroid extracts trajectory pairs with similar manipulation behaviors and trains the policy to predict the actions of one trajectory conditioned on the other. Through this process, the model acquired ICL capabilities for adapting to novel objects and environments at test time. To bridge the embodiment gap, MimicDroid first retargets human wrist poses estimated from RGB videos to the humanoid, leveraging kinematic similarity. It also applies random patch masking during training to reduce overfitting to human-specific cues and improve robustness to visual differences. To evaluate few-shot learning for humanoids, we introduce an open-source simulation benchmark with increasing levels of generalization difficulty. MimicDroid outperformed state-of-the-art methods and achieved nearly twofold higher success rates in the real world.

Method Overview

Overview of MimicDroid
MimicDroid performs meta-training for in-context learning (Meta-ICL) by constructing context-target pairs from human play videos.
  • Bottom. For a target segment, we retrieve the top-k most similar trajectory segments (bottom-left) based on observation-action similarity (bottom-right) to serve as context.
  • Top. These context-target pairs are used to train an ICL policy (top-left). To overcome the human-robot visual gap and avoid overfitting to human-specific visual cues, we apply visual masking to input images (top-right), improving transferability.

Top-K Similar Trajectory Segments

We provide a visualization of the top-k similar trajectory segments below, randomly sampled from the dataset and showing only three videos per query (leftmost) out of the top ten. Each such trajectory pair serves as a training instance for meta-training of in-context learning.

Simulation Results

Our simulation benchmark evaluates few-shot learning across three progressively challenging levels:

  • L1 (Seen Object, Seen Environment): Manipulation tasks with objects encountered during training in seen environments. This level evaluates the robot's ability to generalize to new object positions.
  • L2 (Unseen Object, Seen Environment): Manipulation tasks with novel objects in seen environments. This level evaluates the robot's ability to adapt to novel objects using few demonstrations.
  • L3 (Unseen Object, Unseen Environment): Manipulation tasks with novel objects in unseen environments. This level tests the robot's ability to generalize to different backgrounds, furniture layouts, and novel objects.

See details of simulation benchmark here

Baseline Comparisons

See qualitative videos of simulation rollouts here, and quantitative results below.

Number of Prompts Results
Number of Prompts Results

MimicDroid adapts instantly and robustly, outperforming test-time finetuning by 26% on the abstract embodiment and 29% on the humanoid embodiment, while incurring only a 3% drop across embodiments. In contrast, finetuning often overfits to the abstract embodiment, showing a larger 10% drop across embodiments, and suffers from catastrophic forgetting under distribution shifts, failing at the most challenging level (L3). Task-conditioned baselines that rely only on task specifications show lower performance by 14% on the abstract and 18% on the humanoid embodiment.

How does scaling dataset size impact MimicDroid's ability to perform ICL?

Success vs Frames

We observe consistent performance improvements across all generalization levels (L1–L3) as the amount of training data increases, demonstrating the scalability of learning from RGB play videos. These results also motivates a more systematic study of the factors that influence ICL performance on harder generalization tasks like L3.

Simulation Rollout Examples

Below are examples of MimicDroid's performance in the simulation benchmark. Note the videos below have wider angle than the actual observation for better viewing.

Real-World Rollout Examples

Each rollout first shows the human demonstration video, followed by the robot performing the same task via in-context learning.
The language description is provided only to help viewers understand the task.


Cluttered Environments

Articulated Object Manipulation

Varied Object Positions

Novel Object Categories

Failure Analysis

Failure Analysis

We further conduct a quantitative analysis of failure rates in simulation to understand common error modes.

Failures in downstream tasks arise from task misidentification (26%), missed grasps (16%), and other errors (8%) like incomplete cabinet closure, missed placement (Fig. 4). Compared to Vid2Robot, MimicDroid notably reduces both misidentification (−15%) and grasping errors (−5%) using ICL.

BibTeX

@article{shah2025mimicdroid,
  title={MimicDroid: In-Context Learning for Humanoid Manipulation from Human Play Videos},
  author={Shah, Rutav and Liu, Shuijing and Wang, Qi and Jiang, Zhenyu and Kumar, Sateesh and Seo, Mingyo and Mart{\'\i}n-Mart{\'\i}n, Roberto and Zhu, Yuke},
  journal={arXiv preprint arXiv:2509.09769},
  year={2025}
}