Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



3 Commits

Repository files navigation

Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-shot Learning

Below are the steps to reproduce the primary results from Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-shot Learning. Sean MacAvaney, Luca Soldaini, Nazli Goharian. ECIR 2020 (short). pdf

The code to reproduce the paper is incorporated into OpenNIR. Refer to go get started.

Note that the precise values differ slightly from the numbers reported in the original paper (usually these results are higher) due to improvements in Anserini and other dependent software packages since the experiments were originally run. We note that the trends we observe are the same as reported in the paper.

First Step

Before all else, you need to initialize the TREC Robust 2004 dataset:

$ scripts/ dataset=robust
dataset is initialized (528030 documents)


Start by initializing the dataset. You'll need a copy of LDC2001T55 as the document collections, but all other files will be downloaded from TREC.

$ scripts/ dataset=trec_arabic
dataset is initialized (383743 documents)

Let's run BM25 as a baseline:

$ bash scripts/ config/trivial/bm25 config/multiling/arabic_2001
map=0.3582 ndcg@20=0.6018 p@20=0.5420

$ bash scripts/ config/trivial/bm25 config/multiling/arabic_2002
map=0.2925 ndcg@20=0.4066 p@20=0.3670

Now train a multi-lingual BERT model on TREC Robust, and evaluate it on TREC Aarbic:

$ bash scripts/ config/vanilla_bert config/multiling/arabic_2001
map=0.3645 ndcg@20=0.6464 p@20=0.5840

$ bash scripts/ config/vanilla_bert config/multiling/arabic_2002
map=0.3073 ndcg@20=0.4223 p@20=0.3830


Start by initializing the dataset. You'll need a copy of LDC2000T51 as the document collections, but all other files will be downloaded from TREC.

$ scripts/ dataset=trec_mandarin
dataset is initialized (164778 documents)

Let's run BM25 as a baseline:

$ bash scripts/ config/trivial/bm25 config/multiling/mandarin_5
map=0.2953 ndcg@20=0.4125 p@20=0.3946

$ bash scripts/ config/trivial/bm25 config/multiling/mandarin_6
map=0.3720 ndcg@20=0.6272 p@20=0.5885

Now train a multi-lingual BERT model on TREC Robust, and evaluate it on TREC Mandarin:

$ bash scripts/ config/vanilla_bert config/multiling/mandarin_5
map=0.3490 ndcg@20=0.5256 p@20=0.5107

$ bash scripts/ config/vanilla_bert config/multiling/mandarin_6
map=0.4093 ndcg@20=0.7169 p@20=0.6788


Start by initializing the dataset. You'll need a copy of LDC2000T51 as the document collections, but all other files will be downloaded from TREC.

bash scripts/ dataset=trec_spanish
dataset is initialized (57868 documents)

Let's run BM25 as a baseline (note that TREC 4 only has description queries, so we use those there):

$ bash scripts/ config/trivial/bm25 config/multiling/spanish_3
map=0.3425 ndcg@20=0.5149 p@20=0.5000

$ bash scripts/ config/trivial/bm25 config/multiling/spanish_4
map=0.2099 ndcg@20=0.4197 p@20=0.3820

Now train a multi-lingual BERT model on TREC Robust, and evaluate it on TREC Spanish:

$ bash scripts/ config/vanilla_bert config/multiling/spanish_3
map=0.3684 ndcg@20=0.6344 p@20=0.6200

$ bash scripts/ config/vanilla_bert config/multiling/spanish_4
map=0.2158 ndcg@20=0.4780 p@20=0.4400

Result summary

Dataset BM25 P@20 BERT P@20 BM25 nDCG@20 BERT nDCG@20 BM25 MAP BERT MAP
TREC Arabic 2001 0.5420 0.5840 0.6018 0.6464 0.3582 0.3645
TREC Arabic 2002 0.3670 0.3830 0.4066 0.4223 0.2925 0.3073
TREC Mandarin 5 0.3946 0.5107 0.4125 0.5256 0.2953 0.3490
TREC Mandarin 6 0.5885 0.6788 0.6272 0.7169 0.3720 0.4093
TREC Spanish 3 0.5000 0.6200 0.5149 0.6344 0.3425 0.3684
TREC Spanish 4 0.3820 0.4400 0.4197 0.4780 0.2099 0.2158


If you use this work, please cite:

  author = {MacAvaney, Sean and Soldaini, Luca and Goharian, Nazli},
  title = {Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-shot Learning},
  booktitle = {ECIR},
  year = {2020}


No description, website, or topics provided.






No releases published


No packages published