SCIZOR: Self-Supervised Data Curation for Large-Scale Imitation Learning

Abstract

Imitation learning advances robot capabilities by enabling the acquisition of diverse behaviors from human demonstrations. However, large-scale datasets used for policy training often introduce substantial variability in quality, which can negatively impact performance. As a result, automatically curating datasets by filtering low-quality samples to improve quality becomes essential. Existing robotic curation approaches rely on costly manual annotations and perform curation at a coarse granularity, such as the dataset or trajectory level, failing to account for the quality of individual state-action pairs. To address this, we introduce SCIZOR, a self-supervised data curation framework that filters out low-quality state-action pairs to improve the performance of imitation learning policies. SCIZOR targets two complementary sources of low-quality data: suboptimal data, which hinder learning with undesirable actions, and redundant data, which dilute training with repetitive patterns. SCIZOR leverages a self-supervised task progress predictor for suboptimal data to remove samples lacking task progression, and a deduplication module operating on joint state-action representation for samples with redundant patterns. Empirically, we show that SCIZOR enables imitation learning policies to achieve higher performance with less data, yielding an average improvement of 15.4% across multiple benchmarks.

Overview

SCIZOR is a self-supervised data curation pipeline that slashes large imitation datasets by automatically pruning two types of low-value samples:

Suboptimal transitions: a progress-prediction model flags and removes frames that fail to make meaningful task progress, without needing reward signals.
Redundant transitions: joint visual-action embeddings are clustered to detect and discard look-alike segments, ensuring diversity.

By combining both filters, SCIZOR targets distinct sources of noise and repetition, yielding a higher-quality training set.

Method

Suboptimal Transition Removal

We train a self-supervised progress estimator that predicts how much progress a robot makes over short intervals. Transitions where predicted progress falls significantly below the elapsed time are flagged as suboptimal. This lets us automatically remove segments that stall or deviate from the task, all without any reward annotations.

State-Action Deduplication

We encode fixed-duration chunks into joint visual-action embeddings and use K-means clustering to group similar samples. Within each cluster, samples with high similarity scores are pruned, eliminating overrepresented patterns while preserving rare variations. This ensures a diverse, compact dataset that improves downstream learning.

Experiments

We evaluate SCIZOR's impact on policy success rates across three benchmark: Robomimic, Sirius-Fleet and Open-X. Compared to training on the full dataset, SCIZOR delivers absolute gains of 5.4%, on RoboMimic, 8.1% on OXE-Magic-Soup, and 32.9% on the Sirius-Fleet real-robot tasks. It also surpasses uniform curation by 16.1% on average, indicating that SCIZOR has a targeted selection of samples to be deleted. These improvements demonstrate that SCIZOR's data curation consistently filters out low-quality samples and improves policy learning in both simulated and real-world robotic environments.

Visualization of our evaluation environments.

Main results illustrating SCIZOR's performance improvements.

We also evaluated SCIZOR against two leading curation methods: DemInf (trajectory-level mutual information estimation) and Re-Mix (dataset-level domain weighting). In the RT-X mixture setting, SCIZOR beat Re-Mix by 3.5% on average. On RoboMimic, DemInf slightly outperformed SCIZOR due to pre-divided trajectory quality tiers. However, on the Sirius-Fleet dataset—where quality varies across human and policy data—SCIZOR outshone DemInf by 19.2%, highlighting the advantage of fine-grained state-action curation under uneven data distributions.

SCIZOR greatly boosts policy performance on real-world datasets.

Bowl To Plate:
93.3% (↑28.3%)

Mug To Basket:
66.7% (↑31.7%)

Book to Caddy:
66.7% (↑26.7%)

Bread to Plate:
90.0% (↑40.0%)

Mug To Plate:
83.3% (↑48.3%)

Bowl To Basket:
96.7% (↑36.7%)

Cup To Caddy:
73.3% (↑8.3%)

Mug to Oven:
66.7% (↑31.7%)

We visualize the types of suboptimal data identified by SCIZOR.

Manipulation Failure: The bowl held by the robot dropped accidentally.

Slow Motion: The robot gripper moved towards the bowl at a slow pace.

Manipulation Failure: The robot gripper failed to grasp the blue mug.

Stuck at Collision: The book held by the robot collided with the caddy, leading to a halt.

Pause: The robot arm stopped at behind the cereal box for a long time.

Move Back and Forth: The robot arm moved aimlessly and didn’t contribute to the task progress.

BibTeX

@misc{zhang2025scizorselfsupervisedapproachdata,
  title={SCIZOR: A Self-Supervised Approach to Data Curation for Large-Scale Imitation Learning}, 
  author={Yu Zhang and Yuqi Xie and Huihan Liu and Rutav Shah and Michael Wan and Linxi Fan and Yuke Zhu},
  year={2025},
  eprint={2505.22626},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2505.22626},
}

SCIZOR : Self-Supervised Data Curation for Large-Scale Imitation Learning