Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Roberta, XLNet and redesign Tokenizer #125

Merged
merged 22 commits into from
Oct 25, 2019
Merged

Add Roberta, XLNet and redesign Tokenizer #125

merged 22 commits into from
Oct 25, 2019

Conversation

tholor
Copy link
Member

@tholor tholor commented Oct 23, 2019

We have more models joining the FARM: XLNet and Roberta 🎉
This PR includes fundamental changes in FARM to make it easier to add other language models.

Support for tokenizers from the transformers repo.

Pros:

  • It's quite easy to add a tokenizer for any of the models implemented in transformers.
  • We rather support the development there than building something in parallel
  • The additional metadata during tokenization (offsets, start_of_word) is still created via tokenize_with_metadata
  • We can use encode_plus to add model specific special tokens (CLS, SEP ...)

Cons:

  • We had to deprecate our attribute "never_split_chars" that allowed to adjust the BasicTokenizer of BERT.
  • Custom vocab is now realized by increasing vocab_size instead of replacing unused tokens

Generic loading of tokenizers:

BertTokenizer.from_pretrained("bert-base-cased") -> Tokenizer.load("bert-base-cased")

Generic loading of language models:

Bert.load("bert-base-cased") -> LanguageModel.load("bert-base-cased")

Using other models is as easy as switching the name: LanguageModel.load("roberta-base")

Downstream tasks

So far, the new models are available for text classification, multilabel text classification, regression and NER.
For QA and LM finetuning the processors have still to be updated.

Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. I like how you adjust the model inputs in the forward pass of each LM model type.

Lets delay the discussion about the whitespace tokenization + normal tokenization afterwards and continue. It seems it doesnt change in the current PR.

Lets go forward with this, very amazing to have other models in here.

Copy link
Contributor

@brandenchan brandenchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@tholor tholor changed the title WIP Add Roberta, XLNet and redesign Tokenizer Add Roberta, XLNet and redesign Tokenizer Oct 25, 2019
@tholor tholor merged commit 3f10b82 into master Oct 25, 2019
This was referenced Oct 25, 2019
tanaysoni pushed a commit that referenced this pull request Oct 28, 2019
Processors saved via FARM <= 0.2.2 stored the information about "lower_case" in the processor config.

In order to still load these old processors correctly after the latest tokenizer refactorings (see #125), we can do some simple check + forwarding.
@tholor tholor deleted the roberta branch April 28, 2020 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants