Exploring Discrete Diffusion Models for Image Captioning.

Official implementation for the paper "Exploring Discrete Diffusion Models for Image Captioning"

Training prerequisites

You can use docker. Also, you can create environment and install dependencies:

conda env create -f environment.yml

or

bash install_req.sh

or

pip install -r requirements.txt

COCO training

Download train_captions.

Download training images and validation images and unzip (We use Karpathy et el. split).

Download oscar_split_ViT-B_32_train_512.pkl in ./data/coco/

Microsoft COCO

│MSCOCO_Caption/
├──annotations/
│  ├── captions_train2014.json
│  ├── captions_val2014.json
├──train2014/
│  ├── COCO_train2014_000000000009.jpg
│  ├── ......
├──val2014/ 
│  ├── COCO_val2014_000000000042.jpg
│  ├── ......

Prepare evaluation

Change the work directory and set up the code of evaluation :

cd ./captioneval/coco_caption
bash ./get_stanford_models.sh

Run

MKL_THREADING_LAYER=GPU  python -m torch.distributed.launch --nproc_per_node 8  train.py  --out_dir /results_diff --tag caption_diff_vitb16

If you want train the model with trainable clip, you can use the command:

MKL_THREADING_LAYER=GPU  python -m torch.distributed.launch --nproc_per_node 8  train_tclip.py  --out_dir /results_diff --tag caption_diff_vitb16

Please noting that we detach the gradients of [CLS] tokens during the training process of clip model. Because We observe that when the image encoder (clip) is trainable, the gradient backward of [CLS] tokens will damage the training of image encoder (clip).

Citation

If you use this code for your research, please cite:

@article{zhu2022exploring,
  title={Exploring Discrete Diffusion Models for Image Captioning},
  author={Zhu, Zixin and Wei, Yixuan and Wang, Jianfeng and Gan, Zhe and Zhang, Zheng and Wang, Le and Hua, Gang and Wang, Lijuan and Liu, Zicheng and Hu, Han},
  journal={arXiv preprint arXiv:2211.11694},
  year={2022}
}

Acknowledgments

This repository is heavily based on CLIP, CLIP_prefix_caption and Hugging-faces repositories. For training we used the data of COCO dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.idea		.idea
Images		Images
captioneval		captioneval
clip1		clip1
data/coco/annotations		data/coco/annotations
notebooks		notebooks
LICENSE		LICENSE
README.md		README.md
coco_test.txt		coco_test.txt
cog.yaml		cog.yaml
dynamic_module_utils.py		dynamic_module_utils.py
environment.yml		environment.yml
fix_pre.py		fix_pre.py
install_req.sh		install_req.sh
lr_scheduler.py		lr_scheduler.py
misc.py		misc.py
parse_coco.py		parse_coco.py
parse_conceptual.py		parse_conceptual.py
predict.py		predict.py
requirements.txt		requirements.txt
tf_adpt.py		tf_adpt.py
tf_adpt_grad.py		tf_adpt_grad.py
train.py		train.py
train_tclip.py		train_tclip.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring Discrete Diffusion Models for Image Captioning.

Official implementation for the paper "Exploring Discrete Diffusion Models for Image Captioning"

Training prerequisites

COCO training

Microsoft COCO

Prepare evaluation

Run

Citation

Acknowledgments

About

Releases

Packages

Languages

License

buxiangzhiren/DDCap

Folders and files

Latest commit

History

Repository files navigation

Exploring Discrete Diffusion Models for Image Captioning.

Official implementation for the paper "Exploring Discrete Diffusion Models for Image Captioning"

Training prerequisites

COCO training

Microsoft COCO

Prepare evaluation

Run

Citation

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages