【ICCV'2023 🔥】DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

The implementation of the paper DiffusionRet: Generative Text-Video Retrieval with Diffusion Model.

In this paper, we propose a novel diffusion-based text-video retrieval framework, called DiffusionRet, which addresses the limitations of current discriminative solutions from a generative perspective.

📌 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@inproceedings{jin2023diffusionret,
  title={DiffusionRet: Generative Text-Video Retrieval with Diffusion Model},
  author={Jin, Peng and Li, Hao and Cheng, Zesen and Li, Kehan and Ji, Xiangyang and Liu, Chang and Yuan, Li and Chen, Jie},
  booktitle={ICCV},
  pages={2470-2481},
  year={2023}
}

💡 I also have other text-video retrieval projects that may interest you ✨.

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
Accepted by CVPR 2023 (Highlight) | [HBI Code]
Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Accepted by NeurIPS 2022 | [EMCL Code]
Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David Clifton, Jie Chen

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
Accepted by IJCAI 2023 | [DiCoSA Code]
Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, Zhennan Wang, Li Yuan, Chang Liu, Jie Chen

📣 Updates

[2023/08/27]: We release the training code.
[2023/07/14]: Our DiffusionRet has been accepted by ICCV 2023! We will release the training code asap.
[2023/06/28]: We release the inference code.
[2023/03/31]: Our paper is under review. After our paper is published, we will release the code as soon as possible.

📕 Overview

Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise.

🚀 Quick Start

Setup

Setup code environment

conda create -n DiffusionRet python=3.9
conda activate DiffusionRet
pip install -r requirements.txt
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html

Download CLIP Model

cd DiffusionRet/models
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
# wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
# wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt

Download Datasets

Datasets	Google Cloud	Baidu Yun	Peking University Yun
MSR-VTT	Download	Download	Download
MSVD	Download	Download	Download
ActivityNet	TODO	Download	Download
DiDeMo	TODO	Download	Download

Model Zoo

Checkpoint	Google Cloud	Baidu Yun	Peking University Yun
MSR-VTT	Download	Download	Download
ActivityNet	Download	Download	Download

Evaluate

Eval on MSR-VTT

CUDA_VISIBLE_DEVICES=0 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=1 \
eval.py \
--workers 8 \
--batch_size_val 128 \
--anno_path data/MSR-VTT/anns \
--video_path ${DATA_PATH}/MSRVTT_Videos \
--datatype msrvtt \
--max_words 32 \
--max_frames 12 \
--video_framerate 1 \
--diffusion_steps 50 \
--noise_schedule cosine \
--init_model ${CHECKPOINT_PATH} \
--output_dir ${OUTPUT_PATH}

Eval on ActivityNet Captions

CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=2 \
eval.py \
--workers 8 \
--batch_size_val 128 \
--anno_path ${DATA_PATH}/ActivityNet \
--video_path ${DATA_PATH}/ActivityNet/Activity_Videos \
--datatype activity \
--max_words 64 \
--max_frames 64 \
--video_framerate 1 \
--diffusion_steps 50 \
--noise_schedule cosine \
--init_model ${CHECKPOINT_PATH} \
--output_dir ${OUTPUT_PATH}

Train

Discrimination Pretrain

Train the feature extractor from the discrimination perspective.

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=4 \
main_retrieval.py \
--do_train 1 \
--workers 8 \
--n_display 50 \
--epochs 5 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 128 \
--anno_path data/MSR-VTT/anns \
--video_path ${DATA_PATH}/MSRVTT_Videos \
--datatype msrvtt \
--max_words 32 \
--max_frames 12 \
--video_framerate 1 \
--stage discrimination \
--output_dir ${OUTPUT_PATH}

Generation Finetune

Optimize the generator from the generation perspective.

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=4 \
main_retrieval.py \
--do_train 1 \
--workers 8 \
--n_display 50 \
--epochs 5 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 128 \
--anno_path data/MSR-VTT/anns \
--video_path ${DATA_PATH}/MSRVTT_Videos \
--datatype msrvtt \
--max_words 32 \
--max_frames 12 \
--video_framerate 1 \
--stage generation \
--diffusion_steps 50 \
--noise_schedule cosine \
--init_model ${CHECKPOINT_PATH} \
--output_dir ${OUTPUT_PATH}

🎗️ Acknowledgments

Our code is based on EMCL, CLIP, CLIP4Clip and DRL. We sincerely appreciate for their contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
DiffusionRet		DiffusionRet
data/MSR-VTT/anns		data/MSR-VTT/anns
diffusion_models		diffusion_models
pictures		pictures
preprocess		preprocess
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
main_retrieval.py		main_retrieval.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

【ICCV'2023 🔥】DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

📌 Citation

📣 Updates

📕 Overview

🚀 Quick Start

Setup

Setup code environment

Download CLIP Model

Download Datasets

Model Zoo

Evaluate

Eval on MSR-VTT

Eval on ActivityNet Captions

Train

Discrimination Pretrain

Generation Finetune

🎗️ Acknowledgments

About

Releases

Packages

Languages

License

jpthu17/DiffusionRet

Folders and files

Latest commit

History

Repository files navigation

【ICCV'2023 🔥】DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

📌 Citation

📣 Updates

📕 Overview

🚀 Quick Start

Setup

Setup code environment

Download CLIP Model

Download Datasets

Model Zoo

Evaluate

Eval on MSR-VTT

Eval on ActivityNet Captions

Train

Discrimination Pretrain

Generation Finetune

🎗️ Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages