Learning to Imitate Object Interactions
from Internet Videos

Imitating object interactions from Internet videos. Every robot video (right) starts with the robot picking up the object from the table and ends in a segment where the robot imitates the human motion (left).

Abstract

We study the problem of imitating object interactions from Internet videos. This requires understanding the hand-object interactions in 4D, spatially in 3D and over time, which is challenging due to mutual hand-object occlusions. In this paper we make two main contributions: (1) a novel reconstruction technique RHOV (Reconstructing Hands and Objects from Videos), which reconstructs 4D trajectories of both the hand and the object using 2D image cues and temporal smoothness constraints; (2) a system for imitating object interactions in a physics simulator with reinforcement learning. We apply our reconstruction technique to 100 challenging Internet videos. We further show that we can successfully imitate a range of different object interactions in a physics simulator. Our object-centric approach is not limited to human-like end-effectors and can learn to imitate object interactions using different embodiments, like a robotic arm with a parallel jaw gripper.

Approach

We present an approach for imitating object interactions from Internet videos. We first reconstruct hand-object trajectories in 4D, 3D plus time. We then train a policy to imitate the reconstructed trajectory with reinforcement learning.



Reconstructions from Internet Videos

To reconstruct hands and objects from Internet videos, we introduce an optimization-based technique, called RHOV (for Reconstructing Hands and Objects from Videos), leveraging 2D spatial cues (keypoints, masks, depth) and temporal smoothness constraints (4D, optical flow). We show example RHOV reconstructions on Internet videos below:


Hand-object trajectories overlaid on Internet videos (top) seen from six different viewpoints (bottom).


RHOV reconstructions for Internet videos from the 100 Days of Hands dataset.

Imitating Object Interactions

Given the reconstructed 4D trajectory, we learn to imitate the object interaction with reinforcement learning in a physics simulator. We compute the reward based on the object pose distance. We show an example run over the course of training:


Ablation Studies

RHOV jointly reconstructs the hand and object across all frames of the video at once. We conduct a series of ablation studies to further understand the importance of joint reconstruction. In particular, we find that a) reconstructing hands and objects independently produces a reconstruction where the hand is far from the object and b) reconstructing hands and objects for each video frame separately leads to flipping in the object model due to symmetrical ambiguities.

RHOV (default settings)

a) Independent hand and object

b) Independent video frames

Failure Cases

RHOV relies on 2D image cues to inform 3D hand-object reconstruction and thus is limited by poor 2D segmentation masks and mutual hand-object occlusion. We present a series of failure cases below:


Changing object model

Severe hand occlusion

Incorrect 2D segmentation masks

Citation



The website template was borrowed from Jon Barron.