[ACL-16]
Multi30K(multi-lingual version of Filickr30K)-[English|German|French|Czech]: Multi30K: Multilingual English-German Image Descriptions. [paper] [dataset]- MSCOCO-[English|Chinese|Japanese]:
- (English)
[ARXIV-15]
Microsoft COCO Captions: Data Collection and Evaluation Server. [paper] [dataset] - (Chinese)
[TMM-19]
COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval. [paper] [dataset] - (Japanese)
[ACL-17]
STAIR Captions:Constructing a Large-Scale Japanese Image Caption Dataset. [paper] [dataset]
[ICCV-19]
VATEX-[English|Chinese]: VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. [Paper] [dataset][ACM MM-22]
MSRVTT-CN(multi-lingual version of MSRVTT)-[English|Chinese]: Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning. [paper] [dataset]
Note: This repository provides English captions and other language(Machine-translation version) captions of Multi30K, MSCOCO, VATEX, and MSRVTT-CN.
[Wang et al. TIP]
Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal Retrieval. [paper][Wang et al. AAAI]
CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer. [paper][Wang et al. ACM MM]
Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval.[Cai et al. TKDE]
Cross-Lingual Cross-Modal Retrieval with Noise-Robust Fine-Tuning. [paper]
[Zeng et al. ACL]
Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training. [paper] [code][Li et al. ACL]
Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training. [paper][Rouditchenko et al. ICASSP]
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval. [paper] [code]
[Zhou et al. CVPR21]
UC2:Universal Cross-lingual Cross-modal Vision-and-Language Pre-training. [paper] [code][Ni et al. CVPR21]
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training. [paper] [code][Huang et al. NAACL21]
Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models. [paper][Fei et al. NAACL21]
Cross-lingual Cross-modal Pretraining for Multimodal Retrieval. [paper]
[Aggarwal et al. ARXIV]
Towards Zero-shot Cross-lingual Image Retrieval. [paper]
[Portaz et al. ARXIV]
Image search using multilingual texts: a cross-modal learning approach between image and text. [paper]