Enhancing Reusability of Learned Skills for Robot Manipulation via Gaze Information and Motion Bottlenecks

Ryo Takizawa^*, Izumi Karino, Koki Nakagawa, Yoshiyuki Ohmura, Yasuo Kuniyoshi

The University of Tokyo
^*Indicates Corresponding Author
IEEE Robotics and Automation Letters (RA-L) 2025

Paper Code arXiv

MY ALT TEXT

GazeBot achieves high reusability of learned skills for unseen object positions and end-effector poses without sacrificing dexterity or reactivity. Demonstrations collected within restricted ranges of object positions and end-effector poses, and then the success rate is evaluated for in-distribution (ID) cases within these ranges and out-of-distribution (OOD) cases outside them.

Abstract

Autonomous agents capable of diverse object manipulations should be able to acquire a wide range of manipulation skills with high reusability. Although advances in deep learning have made it increasingly feasible to replicate the dexterity of human teleoperation in robots, generalizing these acquired skills to previously unseen scenarios remains a significant challenge. In this study, we propose a novel algorithm, Gaze-based Bottleneck-aware Robot Manipulation (GazeBot), which enables high reusability of learned motions without sacrificing dexterity or reactivity. By leveraging gaze information and motion bottlenecks—both crucial features for object manipulation—GazeBot achieves high success rates compared with state-of-the-art imitation learning methods, particularly when the object positions and end-effector poses differ from those in the provided demonstrations. Furthermore, the training process of GazeBot is entirely data-driven once a demonstration dataset with gaze data is provided.

Gaze-based Visual Representation

MY ALT TEXT

Although both the left and right scenes represent a similar state, their positions on the table differ. In conventional gaze-centered image (a), which lacks 3D-awareness, the scenes appear substantially different, whereas in our proposed gaze-centered point cloud (b), their underlying three-dimensional structure is captured as similar.

Bottleneck-aware Action Segmentation

MY ALT TEXT

By observing object manipulation in gaze-centered point cloud, an action can be automatically segmented into two phases at the bottleneck: (1) the reaching motion and (2) the gaze-centered dexterous action. The motion after the bottleneck is reusable irrespective of the object position and the initial end-effector pose.

Model Architecture

MY ALT TEXT

GazeBot architecture. (left) The gaze prediction model is trained to estimate the gaze position across the entire image as a classification problem. (right) The action policy model achieves robust reaching motion by estimating the bottleneck pose and the shape of the trajectory up to the bottleneck, and uses a Transformer to predict the gaze-centered action in a full-parametric manner. Both actions and gaze transitions are predicted by the gaze-centered point cloud to improve the reusability.

Experiments

Evaluation of Reusability (ID and OOD)

MY ALT TEXT

Examples of ID and OOD trials in the PenInCup, OpenCap, and PileBox tasks, where object positions and the initial end-effector poses are controlled. The images show the initial states of each trial. The checkboxes correspond to the method order in the table below and indicate whether each method succeeded in the task from that initial state.

Rollouts

PenInCup Task

ID trials

GazeBot (Ours)

ACT [Zhao et al. 2023]

OOD trials (unseen object positions)

GazeBot

ACT

OOD trials (unseen initial poses)

GazeBot

ACT

OpenCap Task

ID trials

GazeBot

ACT

OOD trials (unseen object positions)

GazeBot

ACT

OOD trials (unseen initial poses)

GazeBot

ACT

MY ALT TEXT

Comparison of success rates (%) for ID and OOD trials in PenInCup, OpenCap and PileBox tasks. Our method is compared with seven ablation models, two of which are conventional baselines. We conducted 12 trials for PenInCup and OpenCap, and 20 trials for PileBox. Notably, only the proposed method maintains a high success rate in OOD situations for both tasks.

Evaluation of Applicability (More Complex and Realistic Tasks)

WipeTray Task

Success

GazeBot

ACT

Typical Failure (for both models)

BaggingGoods Task

Success

GazeBot

ACT

Typical Failure

GazeBot

ACT

Gaze Visualization

WipeTray

BaggingGoods

MY ALT TEXT

Comparison of success rates (%) in WipeTray and BaggingGoods tasks. GazeBot successfully learned both tasks, similar to ACT. Moreover, it clearly outperformed ACT in BaggingGoods, likely because this task involves many objects and large variations in their configurations.

Other Properties

Robustness of Gaze Prediction

Reactiveness of GazeBot

BibTeX

@article{Takizawa2025,
    title={Enhancing Reusability of Learned Skills for Robot Manipulation via Gaze Information and Motion Bottlenecks},
    author={Ryo Takizawa and Izumi Karino and Koki Nakagawa and Yoshiyuki Ohmura and Yasuo Kuniyoshi},
    journal={IEEE Robotics and Automation Letters},
    year={2025},
}