Enhancing Reusability of Learned Skills for Robot Manipulation via Gaze and Bottleneck

Ryo Takizawa*, Izumi Karino, Koki Nakagawa, Yoshiyuki Ohmura, Yasuo Kuniyoshi
The University of Tokyo
*Indicates Corresponding Author
MY ALT TEXT

GazeBot achieves high reusability of learned skills for unseen object positions and end-effector poses. Demonstrations collected within restricted ranges of object positions and end-effector poses, and then the success rate is evaluated for in-distribution (ID) cases within these ranges and out-of-distribution (OOD) cases outside them.

Abstract

Autonomous agents capable of diverse object manipulations should be able to acquire a wide range of manipulation skills with high reusability. Although advances in deep learning have made it increasingly feasible to replicate the dexterity of human teleoperation in robots, generalizing these acquired skills to previously unseen scenarios remains a significant challenge. In this study, we propose a novel algorithm, Gaze-based Bottleneck-aware Robot Manipulation (GazeBot), which enables high reusability of the learned motions even when the object positions and end-effector poses differ from those in the provided demonstrations. By leveraging gaze information and motion bottlenecks—both crucial features for object manipulation—GazeBot achieves high generalization performance compared with state-of-the-art imitation learning methods, without sacrificing its dexterity and reactivity. Furthermore, the training process of GazeBot is entirely data-driven once a demonstration dataset with gaze data is provided.

Gaze-based Visual Representation

MY ALT TEXT

Although both the left and right scenes represent a similar state, their positions on the table differ. In conventional gaze-centered image (a), which lacks 3D-awareness, the scenes appear substantially different, whereas in our proposed gaze-centered point cloud (b), their underlying three-dimensional structure is captured as similar.

Bottleneck-aware Action Segmentation

MY ALT TEXT

By observing object manipulation in gaze-centered point cloud, an action can be automatically segmented into two phases at the bottleneck: (1) the reaching motion and (2) the gaze-centered dexterous action. The motion after the bottleneck is reusable irrespective of the object position and the initial end-effector pose.

Model Architecture

MY ALT TEXT

GazeBot architecture. (left) The gaze prediction model is trained to estimate the gaze position across the entire image as a classification problem. (right) The action policy model achieves robust reaching motion by estimating the bottleneck pose and the shape of the trajectory up to the bottleneck, and uses a Transformer to predict the gaze-centered dexterous action in a full-parametric manner. Both actions and gaze transitions are predicted by the gaze-centered point cloud to improve the reusability.

Experiments

Evaluation Setup (ID and OOD)

MY ALT TEXT

Examples of ID and OOD trials in the PenInCup, OpenCap, and PileBox tasks, where object positions and the initial end-effector poses are controlled. The images show the initial states of each trial. The checkboxes correspond to the method order in the table below and indicate whether each method succeeded in the task from that initial state.


Rollouts

PenInCup Task
ID trials
GazeBot (Ours)
ACT [Zhao et al. 2023]
OOD trials (unseen object positions)
GazeBot
ACT
OOD trials (unseen initial poses)
GazeBot
ACT
OpenCap Task
ID trials
GazeBot
ACT
OOD trials (unseen object positions)
GazeBot
ACT
OOD trials (unseen initial poses)
GazeBot
ACT

Results

MY ALT TEXT

Comparison of success rates (%) for ID and OOD trials in PenInCup, OpenCap and PileBox tasks. Our method is compared with seven ablation models, two of which are conventional baselines. We conducted 12 trials for PenInCup and OpenCap, and 20 trials for PileBox. Notably, only the proposed method maintains a high success rate in OOD situations for both tasks.


Reactiveness of GazeBot

BibTeX

@article{Takizawa2025,
    title={Enhancing Reusability of Learned Skills for Robot Manipulation via Gaze and Bottleneck},
    author={Ryo Takizawa and Izumi Karino and Koki Nakagawa and Yoshiyuki Ohmura and Yasuo Kuniyoshi},
    journal={ArXiv},
    year={2025},
    volume={2502.18121},
}