Paper | TIP 2022
Figure 1. Overview of the proposed PKOL architecture for video question answering.
- Ubuntu 20.04
- CUDA 11.5
- Python 3.7
- PyTorch 1.7.0 + cu110
- Clone this repository:
git clone https://github.com/zchoi/PKOL.git
- Install dependencies:
conda create -n vqa python=3.7
conda activate vqa
pip install -r requirements.txt
-
Download pre-extracted text features from here (code: zca5), and place it into
data/{dataset}-qa/
for MSVD-QA, MSRVTT-QA anddata/tgif-qa/{question_type}/
for TGIF-QA, respectively. -
-
For appearance and motion features, we used this repo [1].
-
For object features, we used the Faster R-CNN [2] pre-trained with Visual Genome [3].
Download pre-extracted visual features from here (code: zca5), and place it into
data/{dataset}-qa/
for MSVD-QA, MSRVTT-QA anddata/tgif-qa/{question_type}/
for TGIF-QA, respectively. -
Important
The object features are huge, (especially ~700GB for TGIF-QA), please be cautious of disk space when downloading.
Training:
python train_iterative.py --cfg configs/msvd_qa.yml
Evaluation:
python validate_iterative.py --cfg configs/msvd_qa.yml
Choose a suitable config file in configs/{task}.yml
for one of 4 tasks: action, transition, count, frameqa
to train/val the model. For example, to train with action task, run the following command:
Training:
python train_iterative.py --cfg configs/tgif_qa_action.yml
Evaluation:
python validate_iterative.py --cfg configs/tgif_qa_action.yml
Performance on MSVD-QA and MSRVTT-QA datasets:
Model | MSVD-QA | MSRVTT-QA |
---|---|---|
PKOL | 41.1 | 36.9 |
Performance on TGIF-QA dataset:
Model | Count ↓ | FrameQA ↑ | Trans. ↑ | Action ↑ |
---|---|---|---|---|
PKOL | 3.67 | 61.8 | 82.8 | 74.6 |
[1] Le, Thao Minh, et al. "Hierarchical conditional relation networks for video question answering." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
[2] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems 28 (2015).
[3] Krishna, Ranjay, et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations." International journal of computer vision 123.1 (2017): 32-73.
@article{PKOL,
title = {Video Question Answering with Prior Knowledge and Object-sensitive Learning},
author = {Pengpeng Zeng and
Haonan Zhang and
Lianli Gao and
Jingkuan Song and
Heng Tao Shen
},
journal = {IEEE Transactions on Image Processing},
doi = {10.1109/TIP.2022.3205212},
pages = {5936--5948}
year = {2022}
}
Our code implementation is based on this repo.