Code for SAAT from "Syntax-Aware Action Targeting for Video Captioning" (Accepted to CVPR 2020). The implementation is based on "Consensus-based Sequence Training for Video Captioning".
- Python 3.6
- Pytorch 1.1
- CUDA 10.0
- Microsoft COCO Caption Evaluation
- CIDEr
(Check out the coco-caption
and cider
projects into your working directory)
Data can be downloaded here (1.6GB). This folder contains:
- input/msrvtt: annotatated captions (note that
val_videodatainfo.json
is a symbolic link totrain_videodatainfo.json
) - output/feature: extracted features of IRv2, C3D and Category embeddings
- output/metadata: preprocessed annotations
- output/model_svo/xe: model file and generated captions on test videos, the reported result can be reproduced by the model provided in this folder (CIDEr 49.1 for XE training)
make -f SpecifiedMakefile test [options]
Please refer to the Makefile (and opts_svo.py file) for the set of available train/test options. For example, to reproduce the reported result
make -f Makefile_msrvtt_svo test GID=0 EXP_NAME=xe FEATS="irv2 c3d category" BFEATS="roi_feat roi_box" USE_RL=0 CST=0 USE_MIXER=0 SCB_CAPTIONS=0 LOGLEVEL=DEBUG LAMBDA=20
To train the model using XE loss
make -f Makefile_msrvtt_svo train GID=0 EXP_NAME=xe FEATS="irv2 c3d category" BFEATS="roi_feat roi_box" USE_RL=0 CST=0 USE_MIXER=0 SCB_CAPTIONS=0 LOGLEVEL=DEBUG MAX_EPOCH=100 LAMBDA=20
If you want to change the input features, modify the FEATS
variable in above commands.
@InProceedings{Zheng_2020_CVPR,
author = {Zheng, Qi and Wang, Chaoyue and Tao, Dacheng},
title = {Syntax-Aware Action Targeting for Video Captioning},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}