Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify boilerplate for monoT5 and monoBERT #83

Merged
merged 3 commits into from
Sep 13, 2020

Conversation

yuxuan-ji
Copy link
Contributor

@yuxuan-ji yuxuan-ji commented Sep 10, 2020

closes #80

usage is like so:

from pygaggle.rerank.pretrained import monoBERT, monoT5
reranker = monoBERT()
or
reranker = monoBERT('model-name', 'tokenizer-name')
or
reranker = monoBERT(model, tokenizer)

tested outputs are the same as https://github.com/castorini/pygaggle#a-simple-reranking-example

I was a bit confused on the goal of these two functions, see my comment here: #80 (comment)

Copy link
Member

@rodrigonogueira4 rodrigonogueira4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for implementing this!

@rodrigonogueira4
Copy link
Member

Can you also change the README accordingly?

@lintool
Copy link
Member

lintool commented Sep 10, 2020

hey @rodrigonogueira4 - do you prefer this impl or the alternative of folding the boilerplate code into constructors for existing models? e.g., https://github.com/castorini/pygaggle/blob/master/pygaggle/rerank/transformer.py#L25

@rodrigonogueira4
Copy link
Member

hey @rodrigonogueira4 - do you prefer this impl or the alternative of folding the boilerplate code into constructors for existing models? e.g., https://github.com/castorini/pygaggle/blob/master/pygaggle/rerank/transformer.py#L25

Yes, it is actually better to rename T5Reranker to monoT5 and add the option to initialize it without arguments.

@yuxuan-ji
Copy link
Contributor Author

yuxuan-ji commented Sep 10, 2020

@rodrigonogueira4 @lintool which do you prefer?

folding into constructors:

class T5Reranker(Reranker):
    def __init__(self,
                 model: T5ForConditionalGeneration = None,
                 tokenizer: QueryDocumentBatchTokenizer = None):
        if not model:
            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
            model = T5ForConditionalGeneration.from_pretrained('castorini/monot5-base-msmarco').to(device).eval()
        self.model = model

        if not tokenizer:
            tokenizer = T5BatchTokenizer(AutoTokenizer.from_pretrained('t5-base'), batch_size=8)
        self.tokenizer = tokenizer

        self.device = next(self.model.parameters(), None).device


class SequenceClassificationTransformerReranker(Reranker):
    def __init__(self,
                 model: PreTrainedModel,
                 tokenizer: PreTrainedTokenizer):
        if not model:
            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
            model = AutoModelForSequenceClassification.from_pretrained('castorini/monobert-large-msmarco').to(device).eval()
        self.model = model

        if not tokenizer:
            tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased')
        self.tokenizer = tokenizer

        self.device = next(model.parameters()).device

making subclasses:

class MonoT5(T5Reranker):
    def __init__(self,
                 model: T5ForConditionalGeneration = None,
                 tokenizer: QueryDocumentBatchTokenizer = None):
        if not model:
            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
            model = T5ForConditionalGeneration.from_pretrained('castorini/monot5-base-msmarco').to(device).eval()
        if not tokenizer:
            tokenizer = T5BatchTokenizer(AutoTokenizer.from_pretrained('t5-base'), batch_size=8)
        super().__init__(model, tokenizer)

class MonoBERT(SequenceClassificationTransformerReranker):
    def __init__(self,
                 model: T5ForConditionalGeneration = None,
                 tokenizer: QueryDocumentBatchTokenizer = None):
        if not model:
            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
            model = AutoModelForSequenceClassification.from_pretrained('castorini/monobert-large-msmarco').to(device).eval()
        if not tokenizer:
            tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased')
        super().__init__(model, tokenizer)

Option 2 is in case T5Reranker & SequenceClassificationTransformerReranker have some lower level meaning we'd like to keep, but I don't mind option 1 if MonoBERT & MonoT5 mean the same thing

@lintool
Copy link
Member

lintool commented Sep 10, 2020

My vote is for option 1, but renaming the classes monoT5 and monoBERT. I know this is against Python conventions, but is started with lowercase "m" really that bad? Makes it consistent with our papers, etc.

@ronakice should chime in also...

Also, once we build these abstractions we should propagate to the replications also, e.g.,:
https://github.com/castorini/pygaggle/blob/master/pygaggle/run/evaluate_passage_ranker.py

@ronakice
Copy link
Member

I agree with @lintool, lowercase "m" as it is consistent with our previous work!

@lintool
Copy link
Member

lintool commented Sep 10, 2020

I agree with @lintool, lowercase "m" as it is consistent with our previous work!

But re: Option 1 vs. Option 2? I.e., is there something special about our current abstractions that we should keep?

@yuxuan-ji
Copy link
Contributor Author

yuxuan-ji commented Sep 10, 2020

the main things I can think of with using lowercase are

  1. pretty much every linter will complain unless it's added as an exception
  2. it's not exactly clear from seeing from pygaggle.rerank.transformers import monoBERT, monoT5 that those are classes

I'd avoid it unless the consistency w/ paper is particularly important and worth ignoring Python's conventions

As for the current abstractions, I was mainly concerned about SequenceClassificationTransformerReranker; the way it's written seems like it was intended to work with any model architecture supported by AutoModelForSequenceClassification i.e. you could use a distilbert, roberta, etc. model. If so then I think calling it monoBERT wouldn't exactly fit its usage

@ronakice
Copy link
Member

Hey @yuxuan-ji sorry I got a bit carried away with some other work.

Firstly, yes I think option 1 is better. You make a fair point about linters complaining about monoBERT/monoT5 as well as the lack of clarity to the general dev. So I concede, I think it is better to go with MonoBERT and MonoT5. @lintool what do you think?

As to the usage of MonoBERT in the general form, I think it is mostly fine since all these other Sequence Classification transformers are kinda BERT derived and use the same "principles", unlike monoT5 which signified a bit of a shift. Also, we currently only support MonoBERT so until we find a better name :)/add support for other models, we shouldn't worry about it.

@lintool
Copy link
Member

lintool commented Sep 13, 2020

Okay, we've converged. Option 1, MonoBERT and MonoT5 as class names.

FWIW - huggingface made the model names fugly, favoring conformance to conventions - see Bert and MBart.

@yuxuan-ji please execute.

Copy link
Member

@ronakice ronakice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Merging! Thanks @yuxuan-ji for swiftly finishing this :)

@ronakice ronakice merged commit 41513a9 into castorini:master Sep 13, 2020
@yuxuan-ji yuxuan-ji deleted the simplify-boilerplate branch September 15, 2020 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify monoT5 and monoBERT boilerplate
4 participants