Dynamic Adapter with Semantics Disentangling for Cross-Lingual Cross-Modal Retrieval

This is the implementation for the paper "Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval"

Requirments

torch >= 1.7.1
transformers
opencv-python

Pretrained models used in our DASD

The pretrained models used in DASD (CLIP & mBERT, for initialization) can be downloaded here:

unzip pretrained_model.zip

Datasets

If you do not want the dataset and code to be placed together, please modify the 'datapath' parameter in the configuration file.

Download captions used in our experiments and unzip it to ./dataset/:

unzip dataset.zip

Conceptual Caption images can be crawled here. After crawled from the web, place all images under dataset/ConceptualCaption/images.

CC300K are also used to train the released models. This subset can be found here dataset/ConceptualCaption/cc300k.json.

Flickr30K images can be requested here. Untar it to dataset/Multi30k.

tar -xzvf flickr30k_images.tar.gz -C dataset/Multi30k

MSCOCO images can be downloaded and prepared with the following scripts:

wget -c http://images.cocodataset.org/zips/train2014.zip
wget -c http://images.cocodataset.org/zips/val2014.zip
wget -c http://images.cocodataset.org/zips/test2014.zip

mkdir -p dataset/MSCOCO/images

unzip -d dataset/MSCOCO/images http://images.cocodataset.org/zips/train2014.zip 
unzip -d dataset/MSCOCO/images http://images.cocodataset.org/zips/val2014.zip 
unzip -d dataset/MSCOCO/images http://images.cocodataset.org/zips/test2014.zip

CCR settings

We conduct experiments under two CCR settings:

(1) Cross-lingual Finetune: we first train models using English data in Downstream Task Dataset (DTD) and then further finetune models with target-language data produced by MT tools. Finally, models are tested on DTD target-language datasets.

(2) Zero-shot: models are trained on commonly-used datasets~(e.g., CC300K) and then directly evaluated on DTD without any DTD finetuning.

Training

Under the Cross-lingual Finetuning Setting, we train the model using the following scripits:

# Finetune on DTD English data:
bash train.sh  expr/vitb32/Cross-lingual_Finetune/config.json 0

# For cross-lingual cross-modal alignment:
bash CLCMA.sh 0

Under the Zero-shot Setting, we train the model using the following scripits:

# For cross-lingual cross-modal alignment:
bash CLCMA.sh 0

For both settings, please specify the training dataset in the corresponding configuration file (config.json).

Evaluation

For both settings, we use the same command for evalution:

bash inference.sh  expr/vitb32/CMA/config.json 0

You can specify the test dataset and trained model in the corresponding configuration file (config.json).

We release some checkpoints trained on Multi30k and MSCOCO, which can be obtained here.

Reference

If you find the package useful, please consider citing our paper:

@inproceedings{Cai2025Dynamic,
  title={Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval},
  author={Rui Cai and Zhiyu Dong and Jianfeng Dong and Xun Wang},
  journal={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
__pycache__		__pycache__
clip		clip
data		data
expr/vitb32		expr/vitb32
CLCMA.sh		CLCMA.sh
README.md		README.md
__init__.py		__init__.py
adv.py		adv.py
dataloader.py		dataloader.py
evaluate.py		evaluate.py
framework.png		framework.png
inference.sh		inference.sh
metrics.py		metrics.py
optimization.py		optimization.py
train.py		train.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dynamic Adapter with Semantics Disentangling for Cross-Lingual Cross-Modal Retrieval

Table of Contents

Requirments

Pretrained models used in our DASD

Datasets

CCR settings

Training

Evaluation

Reference

About

Releases

Packages

Contributors 2

Languages

HuiGuanLab/DASD

Folders and files

Latest commit

History

Repository files navigation

Dynamic Adapter with Semantics Disentangling for Cross-Lingual Cross-Modal Retrieval

Table of Contents

Requirments

Pretrained models used in our DASD

Datasets

CCR settings

Training

Evaluation

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages