Official implementation for the paper "Exploring Discrete Diffusion Models for Image Captioning"
You can use docker. Also, you can create environment and install dependencies:
conda env create -f environment.yml
or
bash install_req.sh
or
pip install -r requirements.txt
Download train_captions.
Download training images and validation images and unzip (We use Karpathy et el. split).
Download oscar_split_ViT-B_32_train_512.pkl in ./data/coco/
│MSCOCO_Caption/
├──annotations/
│ ├── captions_train2014.json
│ ├── captions_val2014.json
├──train2014/
│ ├── COCO_train2014_000000000009.jpg
│ ├── ......
├──val2014/
│ ├── COCO_val2014_000000000042.jpg
│ ├── ......
Change the work directory and set up the code of evaluation :
cd ./captioneval/coco_caption
bash ./get_stanford_models.sh
MKL_THREADING_LAYER=GPU python -m torch.distributed.launch --nproc_per_node 8 train.py --out_dir /results_diff --tag caption_diff_vitb16
If you want train the model with trainable clip, you can use the command:
MKL_THREADING_LAYER=GPU python -m torch.distributed.launch --nproc_per_node 8 train_tclip.py --out_dir /results_diff --tag caption_diff_vitb16
Please noting that we detach the gradients of [CLS] tokens during the training process of clip model. Because We observe that when the image encoder (clip) is trainable, the gradient backward of [CLS] tokens will damage the training of image encoder (clip).
If you use this code for your research, please cite:
@article{zhu2022exploring,
title={Exploring Discrete Diffusion Models for Image Captioning},
author={Zhu, Zixin and Wei, Yixuan and Wang, Jianfeng and Gan, Zhe and Zhang, Zheng and Wang, Le and Hua, Gang and Wang, Lijuan and Liu, Zicheng and Hu, Han},
journal={arXiv preprint arXiv:2211.11694},
year={2022}
}
This repository is heavily based on CLIP, CLIP_prefix_caption and Hugging-faces repositories. For training we used the data of COCO dataset.