Video Question Answering with Prior Knowledge and Object-sensitive Learning

Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, Heng Tao Shen

Paper | TIP 2022

Figure 1. Overview of the proposed PKOL architecture for video question answering.

Setups

Ubuntu 20.04
CUDA 11.5
Python 3.7
PyTorch 1.7.0 + cu110

Clone this repository：

git clone https://github.com/zchoi/PKOL.git

Install dependencies：

conda create -n vqa python=3.7
conda activate vqa
pip install -r requirements.txt

Data Preparation

Text Features

Download pre-extracted text features from here (code: zca5), and place it into data/{dataset}-qa/ for MSVD-QA, MSRVTT-QA and data/tgif-qa/{question_type}/ for TGIF-QA, respectively.
Visual Features
- For appearance and motion features, we used this repo [1].
- For object features, we used the Faster R-CNN [2] pre-trained with Visual Genome [3].
Download pre-extracted visual features from here (code: zca5), and place it into data/{dataset}-qa/ for MSVD-QA, MSRVTT-QA and data/tgif-qa/{question_type}/ for TGIF-QA, respectively.

Important

The object features are huge, (especially ~700GB for TGIF-QA), please be cautious of disk space when downloading.

Experiments

For MSVD-QA and MSRVTT-QA：

Training：

python train_iterative.py --cfg configs/msvd_qa.yml

Evaluation：

python validate_iterative.py --cfg configs/msvd_qa.yml

For TGIF-QA：

Choose a suitable config file in configs/{task}.yml for one of 4 tasks: action, transition, count, frameqa to train/val the model. For example, to train with action task, run the following command:

Training：

python train_iterative.py --cfg configs/tgif_qa_action.yml

Evaluation：

python validate_iterative.py --cfg configs/tgif_qa_action.yml

Results

Performance on MSVD-QA and MSRVTT-QA datasets:

Model	MSVD-QA	MSRVTT-QA
PKOL	41.1	36.9

Performance on TGIF-QA dataset:

Model	Count ↓	FrameQA ↑	Trans. ↑	Action ↑
PKOL	3.67	61.8	82.8	74.6

Reference

[1] Le, Thao Minh, et al. "Hierarchical conditional relation networks for video question answering." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

[2] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems 28 (2015).

[3] Krishna, Ranjay, et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations." International journal of computer vision 123.1 (2017): 32-73.

Citation

@article{PKOL,
  title   = {Video Question Answering with Prior Knowledge and Object-sensitive Learning},
  author  = {Pengpeng Zeng and 
             Haonan Zhang and 
             Lianli Gao and 
             Jingkuan Song and 
             Heng Tao Shen
             },
  journal = {IEEE Transactions on Image Processing},
  doi     = {10.1109/TIP.2022.3205212},
  pages   = {5936--5948}
  year    = {2022}
}

Acknowledgements

Our code implementation is based on this repo.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
configs		configs
model		model
.gitignore		.gitignore
Dataloder_iterative.py		Dataloder_iterative.py
LICENSE		LICENSE
config.py		config.py
framework.jpg		framework.jpg
init_glove.py		init_glove.py
readme.md		readme.md
requirements.txt		requirements.txt
train_iterative.py		train_iterative.py
utils.py		utils.py
validate_iterative.py		validate_iterative.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video Question Answering with Prior Knowledge and Object-sensitive Learning

Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, Heng Tao Shen

Table of Contents

Setups

Data Preparation

Text Features

Visual Features

Experiments

For MSVD-QA and MSRVTT-QA：

For TGIF-QA：

Results

Reference

Citation

Acknowledgements

About

Releases

Packages

Languages

License

zchoi/PKOL

Folders and files

Latest commit

History

Repository files navigation

Video Question Answering with Prior Knowledge and Object-sensitive Learning

Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, Heng Tao Shen

Table of Contents

Setups

Data Preparation

Text Features

Visual Features

Experiments

For MSVD-QA and MSRVTT-QA：

For TGIF-QA：

Results

Reference

Citation

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages