This is the Pytorch code for our paper "Cross-view and Multi-step Interaction for Change Captioning" (under review).
- Clone this repository
- cd CVMSI
- Make virtual environment with Python 3.10.14
- Install requirements (
pip install -r requirements.txt
) - Setup COCO caption eval tools (github)
- An NVIDA 4090 GPU or others.
-
Download data from Baidu drive link.
-
Download clevr-change dataset from RobustChangeCaptioning.
-
Extract visual features using ImageNet pretrained ResNet-101:
# processing default images
python scripts/extract_features.py --input_image_dir ./data/images --output_dir ./data/features --batch_size 128
# processing semantically changes images
python scripts/extract_features.py --input_image_dir ./data/sc_images --output_dir ./data/sc_features --batch_size 128
# processing distractor images
python scripts/extract_features.py --input_image_dir ./data/nsc_images --output_dir ./data/nsc_features --batch_size 128
We provide pre-trained weights, download it from Baidu drive link.
python test_trans_c.py --cfg configs/transformer-c.yaml --snapshot 25000 --gpu 0