This is the PyTorch implementation of the paper "Universal Adversarial Perturbations for Vision-Language Pre-trained Models" at SIGIR 24.
- pytorch 1.10.2
- transformers 4.8.1
- timm 0.4.9
- bert_score 0.3.11
Download the datasets, Flickr30k and MSCOCO (the annotations are provided in ./data_annotation/), and put them into ./Dataset
. Set the root path of the dataset in ./configs/Retrieval_flickr.yaml, image_root
.
The checkpoints of the fine-tuned VLP models are accessible in CLIP, ALBEF, TCL, BLIP, and put them into ./checkpoint
.
Set paths of source/target model names and checkpoints, dataset names and roots, test file path, original_rank_index_path and so on in corresponding main files before running them.
# Learn UAPs by taking CLIP as the victim
python Attack_CLIP.py
# Learn UAPs by taking ALBEF/TCL as the victim
python Attack_ALBEFTCL.py
# Eval CLIP models:
python Eval_Retrieval_CLIP.py
# Eval ALBEF models:
python Eval_Retrieval_ALBEF.py
# Eval TCL models:
python Eval_Retrieval_TCL.py
Download Refcoco+ datasets from the origin website, and set 'image_root' in configs/Grounding.yaml accordingly.
# Eval:
python Eval_Grounding.py
Download the MSCOCO dataset from the original websites, and set 'image_root' in configs/caption_coco.yaml accordingly.
# Eval:
python Eval_ImgCap_BLIP.py
If you find this code to be useful for your research, please consider citing our paper .
@inproceedings{zhang2024universal,
title={Universal Adversarial Perturbations for Vision-Language Pre-trained Models},
author={Zhang, Peng-Fei and Huang, Zi and Bai, Guangdong},
booktitle={Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages={862--871},
year={2024}
}