Skip to content

HuiGuanLab/DASD

Repository files navigation

Dynamic Adapter with Semantics Disentangling for Cross-Lingual Cross-Modal Retrieval

This is the implementation for the paper "Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval"

Table of Contents

Requirments

torch >= 1.7.1
transformers
opencv-python

Pretrained models used in our DASD

The pretrained models used in DASD (CLIP & mBERT, for initialization) can be downloaded here:

unzip pretrained_model.zip

Datasets

If you do not want the dataset and code to be placed together, please modify the 'datapath' parameter in the configuration file.

Download captions used in our experiments and unzip it to ./dataset/:

unzip dataset.zip

Conceptual Caption images can be crawled here. After crawled from the web, place all images under dataset/ConceptualCaption/images.

CC300K are also used to train the released models. This subset can be found here dataset/ConceptualCaption/cc300k.json.

Flickr30K images can be requested here. Untar it to dataset/Multi30k.

tar -xzvf flickr30k_images.tar.gz -C dataset/Multi30k

MSCOCO images can be downloaded and prepared with the following scripts:

wget -c http://images.cocodataset.org/zips/train2014.zip
wget -c http://images.cocodataset.org/zips/val2014.zip
wget -c http://images.cocodataset.org/zips/test2014.zip

mkdir -p dataset/MSCOCO/images

unzip -d dataset/MSCOCO/images http://images.cocodataset.org/zips/train2014.zip 
unzip -d dataset/MSCOCO/images http://images.cocodataset.org/zips/val2014.zip 
unzip -d dataset/MSCOCO/images http://images.cocodataset.org/zips/test2014.zip 

CCR settings

We conduct experiments under two CCR settings:

(1) Cross-lingual Finetune: we first train models using English data in Downstream Task Dataset (DTD) and then further finetune models with target-language data produced by MT tools. Finally, models are tested on DTD target-language datasets.

(2) Zero-shot: models are trained on commonly-used datasets~(e.g., CC300K) and then directly evaluated on DTD without any DTD finetuning.

Training

Under the Cross-lingual Finetuning Setting, we train the model using the following scripits:

# Finetune on DTD English data:
bash train.sh  expr/vitb32/Cross-lingual_Finetune/config.json 0

# For cross-lingual cross-modal alignment:
bash CLCMA.sh 0

Under the Zero-shot Setting, we train the model using the following scripits:

# For cross-lingual cross-modal alignment:
bash CLCMA.sh 0

For both settings, please specify the training dataset in the corresponding configuration file (config.json).

Evaluation

For both settings, we use the same command for evalution:

bash inference.sh  expr/vitb32/CMA/config.json 0

You can specify the test dataset and trained model in the corresponding configuration file (config.json).

We release some checkpoints trained on Multi30k and MSCOCO, which can be obtained here.

Reference

If you find the package useful, please consider citing our paper:

@inproceedings{Cai2025Dynamic,
  title={Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval},
  author={Rui Cai and Zhiyu Dong and Jianfeng Dong and Xun Wang},
  journal={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published