Simulation Benchmark

MimicDroid's proposed few-shot learning benchmark for humanoid manipulation.

Few-shot Learning Generalization Levels

Evaluation is structured into three levels: L1, L2, and L3, which progressively increase in difficulty by varying objects and environments, enabling systematic assessment of few-shot learning. Each level is evaluated on a set of four manipulation tasks.

Below, we visualize the demonstrations for each task in the following videos.

Seen Object, Seen Environment

Manipulation tasks with objects it encountered during training in the environments.
This level evaluates the robot's ability to generalize to new object positions.

Close Left Cabinet Door

Pick and Place from Sink to Cabinet

Pick and Place from Sink to Right Counter Plate

Turn On Faucet

Unseen Object, Seen Environment

Manipulation tasks with novel objects it has not encountered during training.
This level evaluates the robot's ability to adapt to novel objects using few demonstrations.

Close Left Cabinet Door

Close Right Cabinet Door

Pick and Place from Sink to Cabinet

Pick and Place from Sink to Right Counter Plate

Unseen Object, Unseen Environment

Manipulation tasks with novel objects in new kitchen environments it has not encountered during training.
This level tests the robot's ability to generalize to different background, furniture layouts, and novel objects.

Close Left Cabinet Door

Pick and Place from Sink to Microwave Top

Pick and Place from Sink to Right Counter Plate

Turn On Faucet

Training Data

We collect human play data in simulation by interacting freely in randomized kitchen environments using a spacemouse. Each play session lasts 20 minutes and captures diverse interactions covering a broad range of tasks, object configurations, and manipulation behaviors. In total, we gather 8 hours of simulated play, amounting to 320k timesteps.

Below, we visualize random clips of play data from our training dataset.