MimicDroid

MimicDroid: In-Context Learning for Humanoid Robot Manipulation
from Human Play Videos

Rutav Shah¹, Shuijing Liu^*,1, Qi Wang^*,1, Zhenyu Jiang^*,1, Sateesh Kumar¹, Mingyo Seo¹, Roberto Martín-Martín^1,2, Yuke Zhu^1,3

¹ The University of Texas at Austin ² Amazon Consumer Robotics ³ NVIDIA
^*Equal Contribution
2026 IEEE International Conference on Robotics and Automation (ICRA)

arXiv Benchmark Dataset

Overview

Abstract

We aim to enable humanoid robots to efficiently solve new manipulation tasks from a few video examples. In-context learning (ICL) is a promising framework for achieving this goal due to its test-time data efficiency and rapid adaptability. However, current ICL methods rely on labor-intensive teleoperated data for training, which restricts scalability. We propose using human play videos—continuous, unlabeled videos of people interacting freely with their environment—as a scalable and diverse training data source. We introduce MimicDroid, which enables humanoids to perform ICL using human play videos as the only training data MimicDroid extracts trajectory pairs with similar manipulation behaviors and trains the policy to predict the actions of one trajectory conditioned on the other. Through this process, the model acquired ICL capabilities for adapting to novel objects and environments at test time. To bridge the embodiment gap, MimicDroid first retargets human wrist poses estimated from RGB videos to the humanoid, leveraging kinematic similarity. It also applies random patch masking during training to reduce overfitting to human-specific cues and improve robustness to visual differences. To evaluate few-shot learning for humanoids, we introduce an open-source simulation benchmark with increasing levels of generalization difficulty. MimicDroid outperformed state-of-the-art methods and achieved nearly twofold higher success rates in the real world.

Method Overview

Overview of MimicDroid — MimicDroid performs meta-training for in-context learning (Meta-ICL) by constructing context-target pairs from human play videos.

**Bottom.** For a target segment, we retrieve the top-k most similar trajectory segments (bottom-left) based on observation-action similarity (bottom-right) to serve as context.

**Top.** These context-target pairs are used to train an ICL policy (top-left). To overcome the human-robot visual gap and avoid overfitting to human-specific visual cues, we apply visual masking to input images (top-right), improving transferability.

Top-K Similar Trajectory Segments

We provide a visualization of the top-k similar trajectory segments below, randomly sampled from the dataset and showing only three videos per query (leftmost) out of the top ten. Each such trajectory pair serves as a training instance for meta-training of in-context learning.

Simulation Results

Our simulation benchmark evaluates few-shot learning across three progressively challenging levels:

L1 (Seen Object, Seen Environment): Manipulation tasks with objects encountered during training in seen environments. This level evaluates the robot's ability to generalize to new object positions.
L2 (Unseen Object, Seen Environment): Manipulation tasks with novel objects in seen environments. This level evaluates the robot's ability to adapt to novel objects using few demonstrations.
L3 (Unseen Object, Unseen Environment): Manipulation tasks with novel objects in unseen environments. This level tests the robot's ability to generalize to different backgrounds, furniture layouts, and novel objects.

See details of simulation benchmark here

Baseline Comparisons

See qualitative videos of simulation rollouts here, and quantitative results below.

MimicDroid adapts instantly and robustly, outperforming test-time finetuning by 26% on the abstract embodiment and 29% on the humanoid embodiment, while incurring only a 3% drop across embodiments. In contrast, finetuning often overfits to the abstract embodiment, showing a larger 10% drop across embodiments, and suffers from catastrophic forgetting under distribution shifts, failing at the most challenging level (L3). Task-conditioned baselines that rely only on task specifications show lower performance by 14% on the abstract and 18% on the humanoid embodiment.

How does scaling dataset size impact MimicDroid's ability to perform ICL?

We observe consistent performance improvements across all generalization levels (L1–L3) as the amount of training data increases, demonstrating the scalability of learning from RGB play videos. These results also motivates a more systematic study of the factors that influence ICL performance on harder generalization tasks like L3.

Simulation Rollout Examples

Below are examples of MimicDroid's performance in the simulation benchmark. Note the videos below have wider angle than the actual observation for better viewing.

Real-World Rollout Examples

Each rollout first shows the human demonstration video, followed by the robot performing the same task via in-context learning.
The language description is provided only to help viewers understand the task.

Cluttered Environments

Move Bread from Oven Top to Plate
Move Chips from Plate to Oven Tray
Move Bread from Rotating Server to Plate

Articulated Object Manipulation

Close Drawer
Close Laptop
Open Drawer and Move Chips to Plate

Varied Object Positions

Move Chips on Bottom Right to Plate
Move Chips on Top Right to Plate
Place Scissors from Above on Table
Place Scissors from Bottom on Table

Novel Object Categories

Move Chocolates from Oven Top to Snack Basket
Move Cheese Cubes from Oven Top to Plate
Move Garlic from Rotating Server to Basket
Move Cheese Cubes from Rotating Server to Basket

Failure Analysis

Move Onion from Rotating Server to Basket
Slight errors in hand pose prediction caused the object to slip
Pump Syrup Bottle
Mispredicted motion direction prevented pumping
Move Bread from Plate to Oven Tray
Bias toward specific hand sizes seen in training caused a collision in the cluttered scene
Open the Drawer
Mispredicted hand pose caused the robot to displace the drawer while opening it
Pull out Oven Tray
Failure to generalize to new oven's grasp height prevented successful tray removal

We further conduct a quantitative analysis of failure rates in simulation to understand common error modes.

Failures in downstream tasks arise from task misidentification (26%), missed grasps (16%), and other errors (8%) like incomplete cabinet closure, missed placement (Fig. 4). Compared to Vid2Robot, MimicDroid notably reduces both misidentification (−15%) and grasping errors (−5%) using ICL.

BibTeX

@article{shah2025mimicdroid,
  title={MimicDroid: In-Context Learning for Humanoid Manipulation from Human Play Videos},
  author={Shah, Rutav and Liu, Shuijing and Wang, Qi and Jiang, Zhenyu and Kumar, Sateesh and Seo, Mingyo and Mart{\'\i}n-Mart{\'\i}n, Roberto and Zhu, Yuke},
  booktitle={2026 IEEE International Conference on Robotics and Automation (ICRA)},
  year={2026},
  organization={IEEE}
}

MimicDroid: In-Context Learning for Humanoid Robot Manipulationfrom Human Play Videos