-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Roberta, XLNet and redesign Tokenizer #125
Conversation
…nge featurization for text_classif and ner
…functional tests ok. Not tested for performance + large scale
…s. fix pooled_output for xlnet. add tests
…tializing LM and PH seems to matter here. Do first LM then PH. Further investigation needed for the reason.
…Tokenizer.load() and LanguageModel.load() style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. I like how you adjust the model inputs in the forward pass of each LM model type.
Lets delay the discussion about the whitespace tokenization + normal tokenization afterwards and continue. It seems it doesnt change in the current PR.
Lets go forward with this, very amazing to have other models in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
Processors saved via FARM <= 0.2.2 stored the information about "lower_case" in the processor config. In order to still load these old processors correctly after the latest tokenizer refactorings (see #125), we can do some simple check + forwarding.
We have more models joining the FARM: XLNet and Roberta 🎉
This PR includes fundamental changes in FARM to make it easier to add other language models.
Support for tokenizers from the transformers repo.
Pros:
Cons:
Generic loading of tokenizers:
BertTokenizer.from_pretrained("bert-base-cased") -> Tokenizer.load("bert-base-cased")
Generic loading of language models:
Bert.load("bert-base-cased") -> LanguageModel.load("bert-base-cased")
Using other models is as easy as switching the name:
LanguageModel.load("roberta-base")
Downstream tasks
So far, the new models are available for text classification, multilabel text classification, regression and NER.
For QA and LM finetuning the processors have still to be updated.