Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing CEDR-KNRM results on ANTIQUE #20

Open
stepgazaille opened this issue Jun 11, 2020 · 7 comments
Open

Reproducing CEDR-KNRM results on ANTIQUE #20

stepgazaille opened this issue Jun 11, 2020 · 7 comments

Comments

@stepgazaille
Copy link

stepgazaille commented Jun 11, 2020

Hello,
I'm trying to reproduce results from the OpenNIR paper using the Vanilla BERT and CEDR-KNRM models on the ANTIQUE dataset.

Taking my cues from the wsdm2020_demo.sh script, I trained my models as follow:

  1. First I fine-tuned and tested a Vanilla BERT model:
BERT_MODEL_PARAMS="trainer.grad_acc_batch=1 valid_pred.batch_size=4 test_pred.batch_size=4"
python -m onir.bin.pipeline config/antique config/vanilla_bert $BERT_MODEL_PARAMS 
python -m onir.bin.pipeline config/antique config/vanilla_bert $BERT_MODEL_PARAMS  pipeline.test=true

Which produced the following results: test epoch=60 judged@10=0.6110 map_rel-3=0.2540 [mrr_rel-3=0.7288] p_rel-3@1=0.6450 p_rel-3@3=0.4917
However, published results for Vanilla BERT are as follow:

  • MAP: 0.2801
  • MRR: 0.7101
  • P@1: 0.5950
  • P@3: 0.4967
  1. I then initialized a CEDR-KNRM model using weights from the fine-tuned Vanilla BERT model and trained and tested it:
MODEL_PATH=[PATH_TO_FINE_TUNED_BERT]/60.p
BERT_MODEL_PARAMS="trainer.grad_acc_batch=1 valid_pred.batch_size=4 test_pred.batch_size=4"

python -m onir.bin.extract_bert_weights config/antique config/vanilla_bert $BERT_MODEL_PARAMS pipeline.bert_weights=$MODEL_PATH pipeline.overwrite=True
python -m onir.bin.pipeline config/antique config/cedr/knrm $BERT_MODEL_PARAMS vocab.bert_weights=$MODEL_PATH pipeline.overwrite=True
python -m onir.bin.pipeline config/antique config/cedr/knrm $BERT_MODEL_PARAMS vocab.bert_weights=$MODEL_PATH pipeline.test=true

Which produced the following results: test epoch=30 judged@10=0.6030 map_rel-3=0.2563 [mrr_rel-3=0.7302] p_rel-3@1=0.6400 p_rel-3@3=0.5083
However, published results for CEDR-KNRM are as follow:

  • MAP: 0.2861
  • MRR: 0.7238
  • P@1: 0.6300
  • P@3: 0.4933

According to the logs, I understand that the inference is deterministic ([trainer:pairwise][DEBUG] using GPU (deterministic)).
Could anyone let me know what I am doing wrong?
Where does the differences come from (especially w.r.t. MAP)?

@seanmacavaney
Copy link
Contributor

Hi Stéphane,

Unfortunately the deterministic indicator only corresponds to torch.backends.cudnn.deterministic flag-- which doesn't actually control for differences across specific GPUs or CUDA versions. Anecdotally, I've seen that different GPUs can yield difference results. So I suspect that these differences lead to the performance discrepancies you're observing. Which GPU do you have? What version of CUDA are you using?

  • sean

@stepgazaille
Copy link
Author

Hello Sean,
Thank you for the quick answer! Here's my current setup:

  • OS: Ubuntu 19.10 Eoan Ermine
  • GPU: GeForce RTX 2080 SUPER
  • NVIDIA Driver Version: 440.33.01
  • CUDA Version: 10.2

Is that very different from your setup?

@seanmacavaney
Copy link
Contributor

Our setup for running the experiment was:

  • OS: Ubuntu 18.04
  • GPU: GeForce GTX 1080 Ti
  • NVIDIA Driver Version: 418.67
  • CUDA Version: 10.1

So there are differences there. To rule out other possibilities, do you get the same results as reported for BM25? The version of Anserini in the repository was updated since OpenNIR was originally released.

@stepgazaille
Copy link
Author

stepgazaille commented Jun 11, 2020

Executing the following commands:

scripts/pipeline.sh config/grid_search config/antique
scripts/pipeline.sh config/grid_search config/antique pipeline.test=True

I obtain the following results: test bm25_k1-1.4_b-0.40 judged@10=0.5960 map_rel-3=0.1945 [mrr_rel-3=0.5793] p_rel-3@1=0.4550 p_rel-3@3=0.3650

Published results for BM25 are as follow:

  • MAP: 0.1888
  • MRR: 0.5464
  • P@1: 0.4450
  • P@3: 0.3467

So a couple of differences here too.
Do you remember which commit you were at when you ran the tests that lead to the reported results?
I could try re-executing the commands above using that version OpenNIR.

@seanmacavaney
Copy link
Contributor

It should be the initial commit: ca14dfa5e7...

Note that you'll need to clear the ~/data/onir directory (or rename it), otherwise it will use the indices built from the newer version.

@stepgazaille
Copy link
Author

stepgazaille commented Jun 12, 2020

Hello Sean,

Today I cleaned up my ~/data/onir directory, pulled the initial commit and re-ran the experiments.

The BM25 baseline produced the following results: test bm25_k1-1.4_b-0.40 judged@10=0.5960 map_rel-3=0.1945 [mrr_rel-3=0.5797] p_rel-3@1=0.4550 p_rel-3@3=0.3667
So here P@1 and P@3 do match the reported results (0.4450 and 0.3467 respectively), however I'm surprised to find out that MAP and MRR do not.

Fine-tuning BERT produced the following results: test epoch=22 judged@10=0.6050 map_rel-3=0.2536 [mrr_rel-3=0.7125] p_rel-3@1=0.6200 p_rel-3@3=0.5033, which do not match the reported results.

Training CEDR-KNRM model (initialised using the newly fine-tuned BERT weights) produced the following results: test epoch=14 judged@10=0.6105 map_rel-3=0.2537 [mrr_rel-3=0.7105] p_rel-3@1=0.6100 p_rel-3@3=0.5017, which do not match the reported results either.

Here I'm surprised to find out that CEDR-KNRM's performance is lower than the fine-tuned BERT's.
I used all the same commands as in my previous comments.
Please let me know if you have any other lead I might try.

On another subject, is there anyway to produce a human-readable version of the models' output?
I'd like to do an ad-hoc evaluation of the models I trained so far (compare the predictions to the gold standard, etc).

Thank you for all you help!

@seanmacavaney
Copy link
Contributor

Hmmm, fascinating! Thanks for running these tests. The BM25 discrepancies are puzzling, as well as the performance differences between Vanilla BERT and CEDR-KNRM. I'm out of ideas about what could cause these differences.

The pipeline saves run files under ~/data/onir/models/.../runs/[epoch].run (should be in pipeline output) in the standard TREC run format. You can find the queries in ~/data/onir/datasets/antique/[subset].queries.txt and document content (which is indexed) can be found here. If you'd like to run the system over arbitrary queries/documents, you can use the flex dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants