Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MonoBERT and MonoT5 defaults #86

Closed
lintool opened this issue Sep 13, 2020 · 3 comments · Fixed by #93
Closed

MonoBERT and MonoT5 defaults #86

lintool opened this issue Sep 13, 2020 · 3 comments · Fixed by #93

Comments

@lintool
Copy link
Member

lintool commented Sep 13, 2020

Follow up to #83, we have:

from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import MonoT5

model_name = 'castorini/monot5-base-msmarco'
tokenizer_name = 't5-base'
reranker =  MonoT5(model_name, tokenizer_name)

I think the model_name and tokenizer_name should have defaults? So we can boil down to:

from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import MonoT5

reranker =  MonoT5()

Also, the regression scripts should be modified to use this new abstraction?

@yuxuan-ji
Copy link
Contributor

yuxuan-ji commented Sep 13, 2020

@lintool it already defaults to 'castorini/monot5-base-msmarco' and 't5-base'! I included it in the example because I thought it'd be helpful to show the user that they can easily change model/tokenizer using pretrained stuff from huggingface., but maybe it's confusing?

as for regression scripts (if you mean the evaluation scripts), there shouldn't be any changes necessary because monoBERT/T5 also accepts instances of PretrainedModel or PretrainedTokenizer

@lintool
Copy link
Member Author

lintool commented Sep 13, 2020

I see. Perhaps change to a comment?

Something like:

reranker =  MonoT5()
# Initializes with defaults, model_name = 'castorini/monot5-base-msmarco', tokenizer_name = 't5-base'
# Same as: MonoT5('castorini/monot5-base-msmarco', 't5-base')

README docs should be as simple as possible IMO.

re: regression scripts, e.g., https://github.com/castorini/pygaggle/blob/master/pygaggle/run/evaluate_passage_ranker.py#L80

Wouldn't it make sense to use this new abstraction so we don't have multiple code paths (that might subtly diverge down the road...)? And same with MonoBERT?

@yuxuan-ji
Copy link
Contributor

yuxuan-ji commented Sep 13, 2020

agree with changing it to a comment! will make a PR for it

as for the regression scripts, I did think of changing those, but wasn't sure how because of PassageRankingEvaluationOptions, since values like options.device, options.from_tf, options.batch_size don't fit in nicely in the current abstraction.

Agreed that forcing the manual construction of models/tokenizers can cause divergence issues in the future, but I wouldn't particularly want to add more parameters to MonoBERT/MonoT5's __init__ method for these options as it'd be imo an anti-pattern, maybe two separate model_options: dict/ tokenizer_options: dict parameters? or is that too dirty?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants