Reproducing CEDR-KNRM results on ANTIQUE #20

stepgazaille · 2020-06-11T15:15:49Z

Hello,
I'm trying to reproduce results from the OpenNIR paper using the Vanilla BERT and CEDR-KNRM models on the ANTIQUE dataset.

Taking my cues from the wsdm2020_demo.sh script, I trained my models as follow:

First I fine-tuned and tested a Vanilla BERT model:

BERT_MODEL_PARAMS="trainer.grad_acc_batch=1 valid_pred.batch_size=4 test_pred.batch_size=4"
python -m onir.bin.pipeline config/antique config/vanilla_bert $BERT_MODEL_PARAMS 
python -m onir.bin.pipeline config/antique config/vanilla_bert $BERT_MODEL_PARAMS  pipeline.test=true

Which produced the following results: test epoch=60 judged@10=0.6110 map_rel-3=0.2540 [mrr_rel-3=0.7288] p_rel-3@1=0.6450 p_rel-3@3=0.4917
However, published results for Vanilla BERT are as follow:

MAP: 0.2801
MRR: 0.7101
P@1: 0.5950
P@3: 0.4967

I then initialized a CEDR-KNRM model using weights from the fine-tuned Vanilla BERT model and trained and tested it:

MODEL_PATH=[PATH_TO_FINE_TUNED_BERT]/60.p
BERT_MODEL_PARAMS="trainer.grad_acc_batch=1 valid_pred.batch_size=4 test_pred.batch_size=4"

python -m onir.bin.extract_bert_weights config/antique config/vanilla_bert $BERT_MODEL_PARAMS pipeline.bert_weights=$MODEL_PATH pipeline.overwrite=True
python -m onir.bin.pipeline config/antique config/cedr/knrm $BERT_MODEL_PARAMS vocab.bert_weights=$MODEL_PATH pipeline.overwrite=True
python -m onir.bin.pipeline config/antique config/cedr/knrm $BERT_MODEL_PARAMS vocab.bert_weights=$MODEL_PATH pipeline.test=true

Which produced the following results: test epoch=30 judged@10=0.6030 map_rel-3=0.2563 [mrr_rel-3=0.7302] p_rel-3@1=0.6400 p_rel-3@3=0.5083
However, published results for CEDR-KNRM are as follow:

MAP: 0.2861
MRR: 0.7238
P@1: 0.6300
P@3: 0.4933

According to the logs, I understand that the inference is deterministic ([trainer:pairwise][DEBUG] using GPU (deterministic)).
Could anyone let me know what I am doing wrong?
Where does the differences come from (especially w.r.t. MAP)?

The text was updated successfully, but these errors were encountered:

seanmacavaney · 2020-06-11T16:07:09Z

Hi Stéphane,

Unfortunately the deterministic indicator only corresponds to torch.backends.cudnn.deterministic flag-- which doesn't actually control for differences across specific GPUs or CUDA versions. Anecdotally, I've seen that different GPUs can yield difference results. So I suspect that these differences lead to the performance discrepancies you're observing. Which GPU do you have? What version of CUDA are you using?

sean

stepgazaille · 2020-06-11T16:21:36Z

Hello Sean,
Thank you for the quick answer! Here's my current setup:

OS: Ubuntu 19.10 Eoan Ermine
GPU: GeForce RTX 2080 SUPER
NVIDIA Driver Version: 440.33.01
CUDA Version: 10.2

Is that very different from your setup?

seanmacavaney · 2020-06-11T16:37:03Z

Our setup for running the experiment was:

OS: Ubuntu 18.04
GPU: GeForce GTX 1080 Ti
NVIDIA Driver Version: 418.67
CUDA Version: 10.1

So there are differences there. To rule out other possibilities, do you get the same results as reported for BM25? The version of Anserini in the repository was updated since OpenNIR was originally released.

stepgazaille · 2020-06-11T17:27:04Z

Executing the following commands:

scripts/pipeline.sh config/grid_search config/antique
scripts/pipeline.sh config/grid_search config/antique pipeline.test=True

I obtain the following results: test bm25_k1-1.4_b-0.40 judged@10=0.5960 map_rel-3=0.1945 [mrr_rel-3=0.5793] p_rel-3@1=0.4550 p_rel-3@3=0.3650

Published results for BM25 are as follow:

MAP: 0.1888
MRR: 0.5464
P@1: 0.4450
P@3: 0.3467

So a couple of differences here too.
Do you remember which commit you were at when you ran the tests that lead to the reported results?
I could try re-executing the commands above using that version OpenNIR.

seanmacavaney · 2020-06-11T18:45:32Z

It should be the initial commit: ca14dfa5e7...

Note that you'll need to clear the ~/data/onir directory (or rename it), otherwise it will use the indices built from the newer version.

stepgazaille · 2020-06-12T18:15:33Z

Hello Sean,

Today I cleaned up my ~/data/onir directory, pulled the initial commit and re-ran the experiments.

The BM25 baseline produced the following results: test bm25_k1-1.4_b-0.40 judged@10=0.5960 map_rel-3=0.1945 [mrr_rel-3=0.5797] p_rel-3@1=0.4550 p_rel-3@3=0.3667
So here P@1 and P@3 do match the reported results (0.4450 and 0.3467 respectively), however I'm surprised to find out that MAP and MRR do not.

Fine-tuning BERT produced the following results: test epoch=22 judged@10=0.6050 map_rel-3=0.2536 [mrr_rel-3=0.7125] p_rel-3@1=0.6200 p_rel-3@3=0.5033, which do not match the reported results.

Training CEDR-KNRM model (initialised using the newly fine-tuned BERT weights) produced the following results: test epoch=14 judged@10=0.6105 map_rel-3=0.2537 [mrr_rel-3=0.7105] p_rel-3@1=0.6100 p_rel-3@3=0.5017, which do not match the reported results either.

Here I'm surprised to find out that CEDR-KNRM's performance is lower than the fine-tuned BERT's.
I used all the same commands as in my previous comments.
Please let me know if you have any other lead I might try.

On another subject, is there anyway to produce a human-readable version of the models' output?
I'd like to do an ad-hoc evaluation of the models I trained so far (compare the predictions to the gold standard, etc).

Thank you for all you help!

seanmacavaney · 2020-06-12T19:06:22Z

Hmmm, fascinating! Thanks for running these tests. The BM25 discrepancies are puzzling, as well as the performance differences between Vanilla BERT and CEDR-KNRM. I'm out of ideas about what could cause these differences.

The pipeline saves run files under ~/data/onir/models/.../runs/[epoch].run (should be in pipeline output) in the standard TREC run format. You can find the queries in ~/data/onir/datasets/antique/[subset].queries.txt and document content (which is indexed) can be found here. If you'd like to run the system over arbitrary queries/documents, you can use the flex dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing CEDR-KNRM results on ANTIQUE #20

Reproducing CEDR-KNRM results on ANTIQUE #20

stepgazaille commented Jun 11, 2020 •

edited

Loading

seanmacavaney commented Jun 11, 2020

stepgazaille commented Jun 11, 2020

seanmacavaney commented Jun 11, 2020

stepgazaille commented Jun 11, 2020 •

edited

Loading

seanmacavaney commented Jun 11, 2020

stepgazaille commented Jun 12, 2020 •

edited

Loading

seanmacavaney commented Jun 12, 2020

Reproducing CEDR-KNRM results on ANTIQUE #20

Reproducing CEDR-KNRM results on ANTIQUE #20

Comments

stepgazaille commented Jun 11, 2020 • edited Loading

seanmacavaney commented Jun 11, 2020

stepgazaille commented Jun 11, 2020

seanmacavaney commented Jun 11, 2020

stepgazaille commented Jun 11, 2020 • edited Loading

seanmacavaney commented Jun 11, 2020

stepgazaille commented Jun 12, 2020 • edited Loading

seanmacavaney commented Jun 12, 2020

stepgazaille commented Jun 11, 2020 •

edited

Loading

stepgazaille commented Jun 11, 2020 •

edited

Loading

stepgazaille commented Jun 12, 2020 •

edited

Loading