Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student's text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers.
Check out our ICASSP presentation on YouTube!
We support two demos:
(1) Multilingual text-video retrieval: given a text query and a candidate set of videos, rank the videos according to the text-video similarity.
(2) Multilingual text-video moment detection: given a text query and clips from a single video, find the most relevant clips in the video according to the text-video similarity.
The model we demo was trained on MultiMSRVTT on text-video pairs in English, Dutch, French, Mandarin, Czech, Russian, Vietnamese, Swahili, and Spanish. However, thanks to LaBSE's pre-training on over 100 languages (https://aclanthology.org/2022.acl-long.62.pdf), text-video retrieval works in many more languages like Ukrainian and Igbo (shown in the demo). You can try it in whatever language you speak / write.
Multilingual text-video retrieval demo:
Multilingual video moment detection demo:
Repository contains:
- code for the main experiments
- model weights to obtain main results
- data for fine-tuning and evaluation on the Multi-MSRVTT, Multi-YouCook2, Vatex, and RUDDER datasets
-
Create an environment (tested on May 1st, 2023):
conda create python=3.8 -y -n c2kd conda activate c2kd conda install -y pytorch==1.11.0 cudatoolkit=10.2 -c pytorch pip install numpy==1.19.2 transformers==4.16.2 librosa==0.8.1 timm==0.5.4 scipy==1.5.2 gensim==3.8.3 sacred==0.8.2 humanize==3.14.0 braceexpand typing-extensions psutil ipdb dominate # optional - for neptune.ai experiment logging pip install numpy==1.19.2 neptune-sacred
-
Download the model weights here and the data here. Extract the tars:
mkdir data && tar -xvf data.tar.gz -C data
andmkdir weights && tar -xvf weights.tar.gz -C weights
. They should be in thedata
andweights
directory, respectively. -
See
./scripts/
for the commands to train the models with our proposed C2KD knowledge distillation, as well as the baseline translate-train and zero-shot (English-only training) methods.
- Note: the results in the paper are the average of 3 runs, so your results might be slightly different than ours.
- Note: for YouCook2, the final results are reported with S3D features from MIL-NCE as the performance was better than with the CLIP features. We include S3D and CLIP features for YouCook2 and MSR-VTT.
This repository uses Sacred with a neptune.ai for logging and tracking experiments. If you want to activate this:
- Create a neptune.ai account.
- Create a project, copy in your credentials (api_token, project_name) in
train.py
- Add
--neptune
key to the training (e.g.python train.py --neptune ..
)
If you use this code in your research, please cite:
@inproceedings{rouditchenko2023c2kd,
title={C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval},
author={Rouditchenko, Andrew and Chuang, Yung-Sung and Shvetsova, Nina and Thomas, Samuel and Feris, Rogerio and Kingsbury, Brian and Karlinsky, Leonid and Harwath, David and Kuehne, Hilde and Glass, James},
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2023},
organization={IEEE}
}
If you have any problems with the code or have a question, please open an issue or send an email.
The main structure of the code is based on everything-at-once https://github.com/ninatu/everything_at_once and frozen-in-time https://github.com/m-bain/frozen-in-time, which itself is based on the pytorch-template https://github.com/victoresque/pytorch-template.
The code in davenet.py
, layers.py
, avlnet.py
is partly derived from https://github.com/dharwath/DAVEnet-pytorch/, https://github.com/wnhsu/ResDAVEnet-VQ, https://github.com/antoine77340/howto100m, and https://github.com/roudimit/AVLnet, and is licensed under BSD-3 (David Harwath, Wei-Ning Hsu, Andrew Rouditchenko) and Apache License 2.0 (Antoine Miech).