Code for the video captioning methods from "Consensus-based Sequence Training for Video Captioning" (Phan, Henter, Miyao, Satoh. 2017).
- Python 3
- Pytorch 0.4
- Microsoft COCO Caption Evaluation
- CIDEr
(Check out the coco-caption
and cider
projects into your working directory)
Data can be downloaded here (643 MB). This folder contains:
- input/msrvtt: annotatated captions (note that
val_videodatainfo.json
is a symbolic link totrain_videodatainfo.json
) - output/feature: extracted features
- output/model/cst_best: model file and generated captions on test videos of our best run (CIDEr 54.2)
Extract video features
- Extracted features of ResNet, C3D, MFCC and Category embeddings are shared in the above link
Generate metadata
make pre_process
Pre-compute document frequency for CIDEr computation
make compute_ciderdf
Pre-compute evaluation scores (BLEU_4, CIDEr, METEOR, ROUGE_L) for each caption
make compute_evalscores
make train [options]
make test [options]
Please refer to the Makefile (and opts.py file) for the set of available train/test options
Train XE model
make train GID=0 EXP_NAME=xe FEATS="resnet c3d mfcc category" USE_RL=0 USE_CST=0 USE_MIXER=0 SCB_CAPTIONS=0 LOGLEVEL=DEBUG MAX_EPOCHS=50
Train CST_GT_None/WXE model
make train GID=0 EXP_NAME=WXE FEATS="resnet c3d mfcc category" USE_RL=1 USE_CST=1 USE_MIXER=0 SCB_CAPTIONS=0 LOGLEVEL=DEBUG MAX_EPOCHS=50
Train CST_MS_Greedy model (using greedy baseline)
make train GID=0 EXP_NAME=CST_MS_Greedy FEATS="resnet c3d mfcc category" USE_RL=1 USE_CST=0 SCB_CAPTIONS=0 USE_MIXER=1 MIXER_FROM=1 USE_EOS=1 LOGLEVEL=DEBUG MAX_EPOCHS=200 START_FROM=output/model/WXE
Train CST_MS_SCB model (using SCB baseline, where SCB is computed from GT captions)
make train GID=0 EXP_NAME=CST_MS_SCB FEATS="resnet c3d mfcc category" USE_RL=1 USE_CST=1 USE_MIXER=1 MIXER_FROM=1 SCB_BASELINE=1 SCB_CAPTIONS=20 USE_EOS=1 LOGLEVEL=DEBUG MAX_EPOCHS=200 START_FROM=output/model/WXE
Train CST_MS_SCB(*) model (using SCB baseline, where SCB is computed from model sampled captions)
make train GID=0 MODEL_TYPE=concat EXP_NAME=CST_MS_SCBSTAR FEATS="resnet c3d mfcc category" USE_RL=1 USE_CST=1 USE_MIXER=1 MIXER_FROM=1 SCB_BASELINE=2 SCB_CAPTIONS=20 USE_EOS=1 LOGLEVEL=DEBUG MAX_EPOCHS=200 START_FROM=output/model/WXE
If you want to change the input features, modify the FEATS
variable in above commands.
@article{cst_phan2017,
author = {Sang Phan and Gustav Eje Henter and Yusuke Miyao and Shin'ichi Satoh},
title = {Consensus-based Sequence Training for Video Captioning},
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1712.09532},
year = {2017},
}
- Test on Youtube2Text dataset (different number of captions per video)
- Torch implementation of NeuralTalk2
- PyTorch implementation of Self-critical Sequence Training for Image Captioning (SCST)
- PyTorch Team