Casper casper icon : Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models

Huihan Liu1, Rutav Shah1, Shuijing Liu1, Jack Pittenger1, Mingyo Seo1,
Yuchen Cui2, Yonatan Bisk3, Roberto Martín-Martín1, Yuke Zhu1

1 The University of Texas at Austin   2 The University of California, Los Angeles  
3 Carnegie Mellon University

UT Austin logo UCLA logo Carnegie Mellon University logo

Casper Demo (Audio On!)


Abstract

Assistive teleoperation, where control is shared between a human and a robot, enables efficient and intuitive human-robot collaboration in diverse and unstructured environments. A central challenge in real-world assistive teleoperation is for the robot to infer a wide range of human intentions from user control inputs and to assist users with correct actions. Existing methods are either confined to simple, predefined scenarios or restricted to task-specific data distributions at training, limiting their support for real-world assistance. We introduce Casper, an assistive teleoperation system that leverages commonsense knowledge embedded in pre-trained visual language models (VLMs) for real-time intent inference and flexible skill execution. Casper incorporates an open-world perception module for a generalized understanding of novel objects and scenes, a VLM-powered intent inference mechanism that leverages commonsense reasoning to interpret snippets of teleoperated user input, and a skill library that expands the scope of prior assistive teleoperation systems to support diverse, long-horizon mobile manipulation tasks. Extensive empirical evaluation, including human studies and system ablations, demonstrates that Casper improves task performance, reduces human cognitive load, and achieves higher user satisfaction than direct teleoperation and assistive teleoperation baselines.


Overview

Casper infers user intents and offers help when confident. Given user teleoperation input, Casper uses VLMs to predict human intent using commonsense reasoning. Upon user confirmation, Casper performs autonomous execution to fulfill the intent using a skill library. Casper's background reasoning runs in parallel with foreground human control to minimize disruption.



Method

Casper has VLM-based intent inference run in parallel with human teleoperation. Casper generates task candidates from observations and infers intent from user inputs among the task candidates, repeating until predictions are self-consistent. Once confirmed by the user, Casper executes the corresponding skill with estimated parameters.



Experiments

We conduct experiments on 3 multi-step mobile manipulation tasks: Toy, Shelf and Door. At each step, the robot disambiguates user intent among multiple plausible goals, selecting the correct one based on user inputs and visual context.


We conducted an IRB-approved user study with N=13 participants, all of whom gave informed consent. Participants completed a practice session before completing each method in randomized order. After each, they answered user satisfaction and NASA-TLX questionnaires.


User study: user workload and user satisfaction. Casper consistently outperforms the baselines in terms of user workload (left) and user satisfaction (right) with statistical significance. Note that for user satisfaction scores, ''assist helpfully'' and ''correct intent'' are not applicable to Full Teleop.


User study: task success rate and completion time. Casper outperforms baselines in both task success and completion time.


Quantitative results from unit testing and ablation studies:
Left: Casper outperforms all baselines in intent inference success rate. Note that no STD is reported for deterministic baselines. The ablation of Casper vs. Capser - No Visual Prompting (VP) highlights the benefit of visual prompting.
Middle: Success rates improve with longer teleoperation history.
Right: Removing confidence estimation increases false prediction rates across all history lengths.


BibTeX


@misc{liu2025casperinferringdiverseintents,
      title={Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models}, 
      author={Huihan Liu and Rutav Shah and Shuijing Liu and Jack Pittenger and Mingyo Seo and Yuchen Cui and Yonatan Bisk and Roberto Martín-Martín and Yuke Zhu},
      year={2025},
      eprint={2506.14727},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2506.14727}, 
}