Skip to content

multi30k/dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Multi30k Data Repository

Getting ready

Along with the data files, we also provide:

  • subword-nmt as a GIT submodule
  • A recent (December 2017) snapshot of Moses preprocessing scripts

under scripts/ in order to minimize processing differences across the users.

In order to fetch everything correctly, you need to clone the repository with --recursive flag:

$ git clone --recursive https://github.com/multi30k/dataset.git multi30k-dataset

Visual features

Pre-extracted visual features can be downloaded from Google Drive and the raw images can be requested here for Flickr30k. test_2017_flickr and test_2018_flickr images can be downloaded from here.

Task 1

Multi30K 2018 test set

You can evaluate your model on the 2018 test sets using the ongoing Codalab competition.

Statistics

train
 (en) 29000 sentences, 377534 words, 13.0 words/sent
 (de) 29000 sentences, 360706 words, 12.4 words/sent
 (fr) 29000 sentences, 409845 words, 14.1 words/sent
 (cs) 29000 sentences, 297212 words, 10.2 words/sent
val
 (en) 1014 sentences, 13308 words, 13.1 words/sent
 (de) 1014 sentences, 12828 words, 12.7 words/sent
 (fr) 1014 sentences, 14381 words, 14.2 words/sent
 (cs) 1014 sentences, 10342 words, 10.2 words/sent
test_2016_flickr
 (en) 1000 sentences, 12968 words, 13.0 words/sent
 (de) 1000 sentences, 12103 words, 12.1 words/sent
 (fr) 1000 sentences, 13988 words, 14.0 words/sent
 (cs) 1000 sentences, 10497 words, 10.5 words/sent
test_2017_flickr
 (en) 1000 sentences, 11376 words, 11.4 words/sent
 (de) 1000 sentences, 10758 words, 10.8 words/sent
 (fr) 1000 sentences, 12596 words, 12.6 words/sent
test_2017_mscoco
 (en) 461 sentences, 5239 words, 11.4 words/sent
 (de) 461 sentences, 5158 words, 11.2 words/sent
 (fr) 461 sentences, 5710 words, 12.4 words/sent

If you use these resources in your research, please consider citing the following papers:

English and German data:

@InProceedings{W16-3210,
  author = 	"Elliott, Desmond
		and Frank, Stella
		and Sima'an, Khalil
		and Specia, Lucia",
  title = 	"Multi30K: Multilingual English-German Image Descriptions",
  booktitle = 	"Proceedings of the 5th Workshop on Vision and Language",
  year = 	"2016",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"70--74",
  location = 	"Berlin, Germany",
  doi = 	"10.18653/v1/W16-3210",
  url = 	"http://www.aclweb.org/anthology/W16-3210"
}

French data, Ambiguous COCO evaluation data, and Test 2017 data:

@InProceedings{elliott-EtAl:2017:WMT,
  author    = {Elliott, Desmond  and  Frank, Stella  and  Barrault, Lo\"{i}c  and  Bougares, Fethi  and  Specia, Lucia},
  title     = {Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description},
  booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  pages     = {215--233},
  url       = {http://www.aclweb.org/anthology/W17-4718}
}

Czech data:

@inproceedings{barrault2018findings,
  title={Findings of the Third Shared Task on Multimodal Machine Translation},
  author={Barrault, Lo{\"\i}c and Bougares, Fethi and Specia, Lucia and Lala, Chiraag and Elliott, Desmond and Frank, Stella},
  booktitle={Proceedings of the Third Conference on Machine Translation: Shared Task Papers},
  pages={304--323},
  year={2018}
}

Special Thanks

Thanks to Oliver Maunoury and Laure Behue for producing most of French Translations of 2018 Test set.