The official implementation of our paper "Human-Object Interaction Prediction in Videos through Gaze Following" accepted by CVIU with doi: https://doi.org/10.1016/j.cviu.2023.103741. ArXiv preprint available: https://arxiv.org/abs/2306.03597.
On VidHOI (Oracle mode):
Future Time | mAP (QPIC) | Person-wise top-5 Recall | Precision | Accuracy | F1-score |
---|---|---|---|---|---|
0s (Detection) | 38.61 | 70.91 | 59.84 | 51.29 | 62.24 |
1s | 37.59 | 72.17 | 59.98 | 51.65 | 62.78 |
3s | 33.14 | 71.88 | 60.44 | 52.08 | 62.87 |
5s | 32.75 | 71.25 | 59.09 | 51.14 | 61.92 |
7s | 31.70 | 70.48 | 58.80 | 50.56 | 61.36 |
On Action Genome (PredCls mode):
Rec@10 | Rec@20 | Rec@50 |
---|---|---|
75.4 | 83.7 | 84.3 |
- TODO: show qualitative result
- Clone the repository recursively:
git clone --recurse-submodules https://github.com/nizhf/hoi-prediction-gaze-transformer.git
- Create conda environment. We use mamba to accelerate the installation. In addition, as issue #7, opencv in conda-forge seems to be incompatible with torchvision 0.11.0, so we install opencv via pip.
conda install mamba -c conda-forge # install mamba in base environment
mamba create -n hoi_torch110 python=3.9 -c conda-forge
conda activate hoi_torch110
mamba install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 torchtext==0.11.0 cudatoolkit=11.3 black Cython easydict gdown imageio ipywidgets matplotlib notebook numpy pandas Pillow PyYAML requests scikit-learn scipy seaborn tqdm tensorboard wandb -c pytorch -c conda-forge
pip install opencv-python
- Our training is using wandb to record training and validation metrics. You may create an account at https://wandb.ai and follow their instruction to login on your PC.
- Some weights can be downloaded automatically. Some weights need to be downloaded manually from here: all weights in weights/sttrangaze, weights/yolov5/vidor_yolov5l.pt. Put them into weights/... folder in this repo. Also, if automatic download does not work, you can download them from the same link.
- Run the run.py script
# Detection
python run.py --source path/to/video --out path/to/output_folder --future 0 --hoi-thres 0.3 --print
# For anticipation, set future to 1, 3, 5, or 7
This script will create a video with object tracking and gaze following. The HOIs are saved in the output folder and printed in 1 FPS in the console.
Please follow this instruction to prepare the datasets and train our model.
If our work is helpful for your research, please consider citing our publication:
@article{NI2023103741,
title = {Human–Object Interaction Prediction in Videos through Gaze Following},
journal = {Computer Vision and Image Understanding},
volume = {233},
pages = {103741},
year = {2023},
issn = {1077-3142},
doi = {https://doi.org/10.1016/j.cviu.2023.103741},
url = {https://www.sciencedirect.com/science/article/pii/S1077314223001212},
author = {Zhifan Ni and Esteve {Valls Mascar\'o} and Hyemin Ahn and Dongheui Lee},
}