This is the official implementation for our paper:
Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos.
Junyi Ma1, Jingyi Xu1, Xieyuanli Chen2, Hesheng Wang1*
1SJTU 2NUDT *Corresponding author
Diff-IP2D is the first work using the devised denoising diffusion probabilistic model to jointly forecast future hand trajectories and object affordances with only 2D egocentric videos as input. It provides a foundation generative paradigm in the field of HOI prediction.
white: ours, blue: baseline, red: GT. Diff-IP2D generates plausible future hand waypoints and final hand positions (even if there is a large error in the early stage) with bidirectional constraints.
If you find our work helpful to your research, please cite our paper as
@article{ma2024diffip2d,
title={Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos},
author={Ma, Junyi and Xu, Jingyi and Chen, Xieyuanli and Wang, Hesheng},
journal={arXiv preprint arXiv:2405.04370},
year={2024}}
Clone the repository (requires git):
git clone https://github.com/IRMVLab/Diff-IP2D.git
cd Diff-IP2D
Create the environment and install dependencies into it:
conda create -n diffip python=3.8 pip
conda activate diffip
pip install -r requirements.txt
We suggest using our proposed data structure for faster reproducing, which is posted here:
├── base_models
│ └── model.pth.tar
├── common
│ ├── epic-kitchens-100-annotations # from OCT or merged ourselves
│ │ ├── EPIC_100_test_timestamps.csv
│ │ ├── EPIC_100_test_timestamps.pkl
│ │ ├── EPIC_100_train.csv
│ │ ├── EPIC_100_train.pkl
│ │ ├── EPIC_100_train_val_test.csv
│ │ ├── EPIC_100_verb_classes.csv
│ │ ├── EPIC_100_video_info.csv
│ │ ├── actions.csv
│ │ └── ...
│ └── rulstm # raw rulstm repo
│ ├── FEATEXT
│ ├── FasterRCNN
│ └── RULSTM
├── data
│ ├── ek100 # manually generated ourselves or from OCT or from raw EK
│ │ ├── feats_train
│ │ │ ├── full_data_with_future_train_part1.lmdb
│ │ │ └── full_data_with_future_train_part2.lmdb
│ │ ├── feats_test
│ │ │ └── data.lmdb
│ │ ├── labels #
│ │ │ ├── label_0.pkl
│ │ │ └── ...
│ │ ├── ek100_eval_labels.pkl
│ │ └── video_info.json
│ ├── raw_images # raw EPIC-KITCHENS dataset
│ │ └── EPIC-KITCHENS
│ ├── homos_train # auto generated when first running
│ ├── homos_test # auto generated when first running
├── diffip_weights # auto generated when first saving checkpoints
│ ├── checkpoint_1.pth.tar
│ └── ...
├── collected_pred_traj # auto generated when first eval traj
├── collected_pred_aff # auto generated when first eval affordance
├── log # auto generated when first running
└── uid2future_file_name.pickle
Here we provide the links to access all the above-mentioned files that cannot be generated automatically by running scripts of this repo:
- base_models/model.pth.tar: Base model from OCT [1].
- common/epic-kitchens-100-annotations: Annotations from raw EK [2] and our mannually merged files. Please do not confuse this folder with the one provided by OCT [1].
- common/rulstm: Original RULSTM [3] repo.
- data/ek100/feats_train: Our mannually generated feature files for training our model.
- data/ek100/feats_test: Feature files provided by OCT [1] for testing our model.
- data/ek100/labels: Labels from OCT [1] for training models.
- data/ek100/ek100_eval_labels.pkl: Labels from OCT [1] for affordance evaluation. Please refer to the original OCT folder.
- data/ek100/video_info.json: Used video index.
- data/raw_images: Original EK images [2]. Following the instructions in EK repo for downloading raw RGB images by
python epic_downloader.py --rgb-frames
since only raw images are required in Diff-IP2D. - uid2future_file_name.pickle: Indicator generated ourselves.
We have released the deployment of Diff-IP2D on EK100. We are going to release relevant codes and data on EK55 and EG soon ...
Version | Download link | Notes |
---|---|---|
1.1 | OneDrive / Google Drive | pretrained on EK100 (two val) |
1.2 | OneDrive / Google Drive | pretrained on EK100 (one val) |
Please change the paths to pretrained weights in run_train.py
, run_val_traj.py
, and run_val_affordance.py
.
bash train.sh
Please test trajectory prediction by
bash val_traj.sh
Test affordance prediction by
bash val_affordance.sh
- We are working hard to organize and release a more polished version of the code, along with its application on the new dataset.
- You may obtain results that slightly differ from those presented in the paper due to the stochastic nature of diffusion inference with different seeds. Prediction clusters can be obtained using multiple different seeds.
- Homography will be automatically saved to
data/homos_train
anddata/homos_test
after the first training/test epoch for quick reuse. - Separate validation sets will lead to checkpoints at different epochs for two tasks.
- Please modify the params in config files before training and testing. For example, change the paths in
options/expopts.py
. You can also setfast_test=True
for faster inference without sacrificing much accuracy.
We sincerely appreciate the fantastic pioneering works that provide codebases and datasets for this work. Please also cite them if you use the relevant code and data.
[1] Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xiaolong Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. In CVPR, pages 3282–3292, 2022. Paper Code
[2] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. IJCV, pages 1–23, 2022. Paper Code
[3] Antonino Furnari and Giovanni Maria Farinella. Rolling-unrolling lstms for action anticipation from first-person video. IEEE TPAMI, 43(11):4021–4036, 2020. Paper Code
[4] Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. In ICLR, 2023. Paper Code
[5] Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, and Yu Kong. Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting. In ICCV, pages 13702–13711, 2023. Paper Code