Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reproduce "monot5-base-msmarco-10k" via pytorch script #307

Open
polgrisha opened this issue Nov 25, 2022 · 6 comments
Open

Cannot reproduce "monot5-base-msmarco-10k" via pytorch script #307

polgrisha opened this issue Nov 25, 2022 · 6 comments

Comments

@polgrisha
Copy link

polgrisha commented Nov 25, 2022

Hello!

I am trying to reproduce quality of monoT5 on BEIR benchmark from the recent article. But after running script finetune_monot5.py on one epoch, as stated in the description of the checkpoint "monot5-base-msmarco-10k", my results are quite lower.

For example, on NQ, when I use my checkpoint the result is 0.5596 ndcg@10. But when I use the original checkpoint - 0.5676 ndcg@10. On NFCorpus: 0.3604 ndcg@10 with my checkpoint, 0.3778 ndcg@10 with the original.

So, is one epoch of training monoT5 with pytorch script similar to one epoch of training with TF? And with what hyperparameters can I reproduce performance of "monot5-base-msmarco-10k"?

@rodrigonogueira4
Copy link
Member

Hi @polgrisha, we haven't tested that pytorch script extensively, especially in zero-shot, but it seems that some hyperparameters were wrong.

I opened a PR with the ones we used to train the model on TPUs + TF:
#308

Could you please give it a try?

@rodrigonogueira4
Copy link
Member

I was looking at my logs and I was never able to reproduce the results on pytorch+GPU using the same hyperparameters used to finetune on TF+TPUs. The best ones I found were the ones already in the repo.

However, in another project, I found that this configuration gives good results to finetune T5 on PT+GPUs:

--train_batch_size=4
--accumulate_grad_batches=32
--optimizer=AdamW
--lr=3e-4 (or 3e-5)
--weight_decay=5e-5

Could you please give it a try?

@polgrisha
Copy link
Author

@rodrigonogueira4 Thanks for your response

I tried the hyperparams you suggested:

--train_batch_size=4
--accumulate_grad_batches=32
--optimizer=AdamW
--lr=3e-5
--weight_decay=5e-5

And so far, the closest result was obtained by training mono-t5 for 9k steps (10k is one epoch with batch_size=4, accum_steps=32 and 2 gpus)

(TREC-COVID: original-0.7845, my-0.7899; NFCorpus: original-0.3778, my-0.3731, NQ: original-0.5676, my-0.5688, FIQA-2018: original: 0.4129, my: 0.4130)

@polgrisha polgrisha reopened this Dec 6, 2022
@rodrigonogueira4
Copy link
Member

Hi @polgrisha, thanks for running this experiment. It seems that you go pretty close to the original training in mesh-tensorflow+TPUs.

I expected those small differences in the individual datasets from BEIR, especially since you are using a different optimizer.
However, to be really sure, I would run on a few more datasets and compare the average against the results reported in the "No parameter left behind" paper.

@zlh-source
Copy link

zlh-source commented Jul 2, 2023

@rodrigonogueira4 Thanks for your response

I tried the hyperparams you suggested:

--train_batch_size=4 --accumulate_grad_batches=32 --optimizer=AdamW --lr=3e-5 --weight_decay=5e-5

And so far, the closest result was obtained by training mono-t5 for 9k steps (10k is one epoch with batch_size=4, accum_steps=32 and 2 gpus)

(TREC-COVID: original-0.7845, my-0.7899; NFCorpus: original-0.3778, my-0.3731, NQ: original-0.5676, my-0.5688, FIQA-2018: original: 0.4129, my: 0.4130)

Hello, thank you very much for your work! But I still have some questions.
batch_size=4, accum_steps=32 and 2 gpus. Then, 1 step is 4*32*2=256 batch size. The huggingface checkpoint "monot5-base-msmarco-10k" is 10k step of 128 batch size , using the first 6.4e5 lines of data from the training set. So (1) you used twice as much data as the "monot5-base-msmarco-10k"? (2) Or did you also use the first 6.4e5 lines, but train twice? (3) Or did you also use the first 6.4e5 lines, but because the batch size is twice as large, you trained 5K steps?

@rodrigo-f-nogueira
Copy link

Sorry about the late reply. The correct configuration should be batches of 128 examples, so 10k steps means 6.4M lines of the triples.train.small.tsv file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants