Universal Vision-Language Dense Retrieval (UniVL-DR)

There are source codes for Universal Vision-Language Dense Retrieval Our Paper.

Requirement

Python==3.7
Pytorch
transformers
clip
faiss-cpu==1.7.0
tqdm
numpy
base64
Install the pytrec_eval from https://github.com/cvangysel/pytrec_eval

Data and Checkpoint

All these files can be downloaded and you should put them in the corresponding folders.
All data can be found at Ali Drive. Please note that the imgs.tsv file should be downloaded from the project of WebQA (by downloading the data from this link and running 7z x imgs.7z.001).
The checkpoint_multi_inb (The checkpoint of CLIP-DPR) can be found at Ali Drive.
The checkpoint_multi_hn (The checkpoint of UniVL-DR) can be found at Ali Drive.

Train UniVL-DR

UniVL-DR inherits CLIP (ViT-B/32). The texts must be truncated by 77 tokens and you can try different vision-language models. As shown in our experiments, we suggest to use the dual encoder models.
There are two steps to train UniVL-DRR:
First step: Go to the CLIP-DPR folder and train models using inbatch negatives:

bash train_multi.sh

Second step: Then using CLIP-DPR to generate hard negatives for training UniVL-DR:

bash get_hn.sh

Final step: Go to the UniVL-DR folder and train models using hard negatives:

bash train_multi.sh

Evaluate Retrieval Effectiveness

These experimental results are shown in Table 1 of our paper.
Go to the CLIP-DPR or UniVL-DR folder and evaluate model performance as follow:

bash gen_embeds.sh

bash retrieval.sh

Results

The results are shown as follows.

Setting	Model	MRR@10	NDCG@10	MRR@20	NDCG@20	Rec@20	Rec@100
Single Modality\(Text Only)	BM25	53.75	49.60	54.10	51.72	68.16	80.69
	DPR (Zero-Shot)	22.72	20.06	23.14	21.79	32.78	45.43
	CLIP (Zero-Shot)	18.16	16.76	18.60	18.27	27.97	39.83
	BERT-DPR	42.16	39.57	42.76	42.26	60.85	77.10
	NQ-DPR	41.88	39.65	42.44	42.35	61.71	78.57
	NQ-ANCE	45.54	42.05	45.93	43.83	58.42	69.31
Divide-Conquer	VinVL-DPR	22.11	22.92	22.80	25.41	46.27	62.82
	CLIP-DPR	37.35	37.56	37.93	40.77	69.38	85.53
	BM25 & CLIP-DPR	42.27	41.58	42.79	44.69	73.34	87.50
	BM25 & CLIP-DPR (Oracle Modality)	61.05	58.18	61.37	60.45	80.82	90.83
UnivSearch	CLIP (Zero-Shot)	10.59	8.69	10.80	9.52	14.32	20.21
	VinVL-DPR	38.14	35.43	38.74	37.79	53.89	69.42
	CLIP-DPR	48.83	46.32	49.34	49.11	69.84	86.43
	UniVL-DR	62.40	59.32	62.69	61.22	80.37	89.42

Citation

@inproceedings{liu2023univldr,
  title={Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval},
  author={Liu, Zhenghao and Xiong, Chenyan and Lv, Yuanhuiyi and Liu, Zhiyuan and Yu, Ge},
  booktitle={Proceedings of ICLR},
  year={2023}
}

Contact

If you have questions, suggestions, and bug reports, please email:

liuzhenghao0819@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Universal Vision-Language Dense Retrieval (UniVL-DR)

Requirement

Data and Checkpoint

Train UniVL-DR

Evaluate Retrieval Effectiveness

Results

Citation

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

Universal Vision-Language Dense Retrieval (UniVL-DR)

Requirement

Data and Checkpoint

Train UniVL-DR

Evaluate Retrieval Effectiveness

Results

Citation

Contact