Skip to content

ai4ce/EgoPAT3D

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 

Repository files navigation

EgoPAT3D: Egocentric Prediction of Action Target in 3D [CVPR 2022]

Yiming Li* , Ziang Cao* , Andrew Liang, Benjamin Liang, Luoyao Chen, Hang Zhao, Chen Feng

News

[2022.07] Our dataset EGOPAT3D 1.0 is available here.

[2022.03] Our paper is available on arxiv.

[2022.03] EgoPAT3D is accepted at CVPR 2022.

Abstract

We are interested in anticipating as early as possible the target location of a person's object manipulation action in a 3D workspace from egocentric vision. It is important in fields like human-robot collaboration, but has not yet received enough attention from vision and learning communities. To stimulate more research on this challenging egocentric vision task, we propose a large multimodality dataset of more than 1 million frames of RGB-D and IMU streams, and provide evaluation metrics based on our high-quality 2D and 3D labels from semi-automatic annotation. Meanwhile, we design baseline methods using recurrent neural networks (RNNs) and conduct various ablation studies to validate their effectiveness. Our results demonstrate that this new task is worthy of further study by researchers in robotics, vision, and learning communities.

Raw Data

EgoPAT3D contains multimodal data generated from RGBD first-person videos recorded using a helmet-mounted Azure Kinect depth camera. In each recording, the camera wearer reaches for, grabs, and moves objects randomly placed in a household scene.

Each recording features a different configuration of household objects within the scene. The dataset also contains scene point clouds for 3D registration, binary masks for hand presence frame detection, and hand pose inference results for each frame detected to contain a hand.

Table of Contents

Specifications

  • 15 household scenes (see scene index table below)
  • 15 point cloud files (one for each scene)
  • 150 total recordings (10 recordings in each scene, with different object configurations in each recording)
  • 15000 hand-object actions (100 per recording)
  • ~600 min of RGBD video (~4 min per video)
  • ~1,080,000 RGB frames at 30 fps
  • ~900,000 hand action frames (assuming ~2 seconds per hand-object action)

Data modalities:

Azure Kinect recording specifications:

Color camera Depth camera
Resolution 3840x2160 (4K) 512x512
Frames per second 30 30
Recording mode -- WFOV 2x2 binned

Access dataset:

Dataset without raw .MKV recordings and individual pre-extracted RGB frames

Raw .MKV recordings

Dataset folder hierarchy

Dataset/
    ├──1/ # one folder containing multimedia data for each of the 15 scenes (see scene index table below)
        ├── bathroomCabinet_1/
            ├── hand_frames/
                ├── clip_ranges.txt
                ├── hand_landmarks.txt
                └── masks.zip
            └── rgb_video.mp4
        ├── bathroomCabinet_2/ # all sceneName_# directories share the same folder/subdirectory structure
        ├── ...
        ├── bathroomCabinet_10/
        └── bathroomCabinet.ply
    ├──2/ # all 15 scene # directories share the same folder/subdirectory structure
        └── ...
    .
    .
    .
    
    └── 15

Scene index

Index Scene Index Scene Index Scene
1 bathroomCabinet 6 kitchenCounter 11 pantryshelf
2 bathroomCounter 7 kitchenCupboard 12 smallbins
3 bin 8 kitchenSink 13 stovetop
4 desk 9 microwave 14 windowsillAC
5 drawer 10 nightstand 15 woodenTable

Alternative access

  1. Direct download using the links above
  2. Manual build using HPC (currently only available on NYU HPC)
    Download the raw multimodal recording data and generate a local copy of the dataset on your NYU HPC /scratch space, following instructions provided in this repository's model_build folder. May take 1-3 days to build on NYU Greene depending on compute resource availability.

Groundtruth Generation

Our annotation can be semi-automatic with several off-the-shelf machine learning algorithms, thus is quite efficient. Given a recording, we manually divide it into multiple action clips. To localize the 3D target in each clip, we use the following procedures. Firstly, we take the last frame of each clip based on the index provided by the manual division. Secondly, we use an off-the-shelf hand pose estimation model to localize the hand center in the last frame of each clip. Thirdly, we use colored point cloud registration to calculate the transformation matrices between the adjacent frames. Finally, for each clip, we transform the hand location in the last frame to historical frames according to the results of the third step, and the transformed locations can describe the 3D action target location in each frame's coordinate. Detailed procedures are presented in our paper.

Baseline Method

To obtain the training data from the raw datastreams, please follow the instructions. To use our processed data, please follow the baseline method implementation.

Citation

If you find our work useful in your research, please cite:

@InProceedings{Li_2022_CVPR,
      title = {Egocentric Prediction of Action Target in 3D},
      author = {Li, Yiming and Cao, Ziang and Liang, Andrew and Liang, Benjamin and Chen, Luoyao and Zhao, Hang and Feng, Chen},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      month = {June},
      year = {2022}
}