This is the implementation for the paper "Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval"
torch >= 1.7.1
transformers
opencv-python
The pretrained models used in DASD (CLIP & mBERT, for initialization) can be downloaded here:
unzip pretrained_model.zip
If you do not want the dataset and code to be placed together, please modify the 'datapath' parameter in the configuration file.
Download captions used in our experiments and unzip it to ./dataset/
:
unzip dataset.zip
Conceptual Caption images can be crawled here. After crawled from the web, place all images under dataset/ConceptualCaption/images
.
CC300K are also used to train the released models. This subset can be found here dataset/ConceptualCaption/cc300k.json
.
Flickr30K images can be requested here. Untar it to dataset/Multi30k
.
tar -xzvf flickr30k_images.tar.gz -C dataset/Multi30k
MSCOCO images can be downloaded and prepared with the following scripts:
wget -c http://images.cocodataset.org/zips/train2014.zip
wget -c http://images.cocodataset.org/zips/val2014.zip
wget -c http://images.cocodataset.org/zips/test2014.zip
mkdir -p dataset/MSCOCO/images
unzip -d dataset/MSCOCO/images http://images.cocodataset.org/zips/train2014.zip
unzip -d dataset/MSCOCO/images http://images.cocodataset.org/zips/val2014.zip
unzip -d dataset/MSCOCO/images http://images.cocodataset.org/zips/test2014.zip
We conduct experiments under two CCR settings:
(1) Cross-lingual Finetune: we first train models using English data in Downstream Task Dataset (DTD) and then further finetune models with target-language data produced by MT tools. Finally, models are tested on DTD target-language datasets.
(2) Zero-shot: models are trained on commonly-used datasets~(e.g., CC300K) and then directly evaluated on DTD without any DTD finetuning.
Under the Cross-lingual Finetuning Setting, we train the model using the following scripits:
# Finetune on DTD English data:
bash train.sh expr/vitb32/Cross-lingual_Finetune/config.json 0
# For cross-lingual cross-modal alignment:
bash CLCMA.sh 0
Under the Zero-shot Setting, we train the model using the following scripits:
# For cross-lingual cross-modal alignment:
bash CLCMA.sh 0
For both settings, please specify the training dataset in the corresponding configuration file (config.json).
For both settings, we use the same command for evalution:
bash inference.sh expr/vitb32/CMA/config.json 0
You can specify the test dataset and trained model in the corresponding configuration file (config.json).
We release some checkpoints trained on Multi30k and MSCOCO, which can be obtained here.
If you find the package useful, please consider citing our paper:
@inproceedings{Cai2025Dynamic,
title={Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval},
author={Rui Cai and Zhiyu Dong and Jianfeng Dong and Xun Wang},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2025}
}