Add Roberta, XLNet and redesign Tokenizer #125

tholor · 2019-10-23T17:27:37Z

We have more models joining the FARM: XLNet and Roberta 🎉
This PR includes fundamental changes in FARM to make it easier to add other language models.

Support for tokenizers from the transformers repo.

Pros:

It's quite easy to add a tokenizer for any of the models implemented in transformers.
We rather support the development there than building something in parallel
The additional metadata during tokenization (offsets, start_of_word) is still created via tokenize_with_metadata
We can use encode_plus to add model specific special tokens (CLS, SEP ...)

Cons:

We had to deprecate our attribute "never_split_chars" that allowed to adjust the BasicTokenizer of BERT.
Custom vocab is now realized by increasing vocab_size instead of replacing unused tokens

Generic loading of tokenizers:

BertTokenizer.from_pretrained("bert-base-cased") -> Tokenizer.load("bert-base-cased")

Generic loading of language models:

Bert.load("bert-base-cased") -> LanguageModel.load("bert-base-cased")

Using other models is as easy as switching the name: LanguageModel.load("roberta-base")

Downstream tasks

So far, the new models are available for text classification, multilabel text classification, regression and NER.
For QA and LM finetuning the processors have still to be updated.

…nge featurization for text_classif and ner

…functional tests ok. Not tested for performance + large scale

…s. fix pooled_output for xlnet. add tests

…tializing LM and PH seems to matter here. Do first LM then PH. Further investigation needed for the reason.

…Tokenizer.load() and LanguageModel.load() style.

Timoeller

Looking good. I like how you adjust the model inputs in the forward pass of each LM model type.

Lets delay the discussion about the whitespace tokenization + normal tokenization afterwards and continue. It seems it doesnt change in the current PR.

Lets go forward with this, very amazing to have other models in here.

brandenchan

Looks good

Processors saved via FARM <= 0.2.2 stored the information about "lower_case" in the processor config. In order to still load these old processors correctly after the latest tokenizer refactorings (see #125), we can do some simple check + forwarding.

tholor added 21 commits October 9, 2019 09:37

WIP first move towards roberta

87d3555

WIP refactoring tokenizer + preprocessing for roberta

0031015

add basic test for roberta WIP

d6a96a6

WIP roberta

da2ff41

tokenize_with_metadata working with roberta,xlnet,bert incl offsets

222ce57

add example for roberta multilabel classif

1166c2e

Add cola run for testing roberta vs bert

50a4511

change saving/loading of tokenizer. add tests. change truncation. cha…

3d0a971

…nge featurization for text_classif and ner

update lm finetuning. change truncation. update RegressionProcessor. …

6d69481

…functional tests ok. Not tested for performance + large scale

WIP LM refactoring for xlnet

b8cf8d7

adjust experiment mode for multiple LMs. add cola+conll-en experiment…

2f0bb39

…s. fix pooled_output for xlnet. add tests

add conll2003 en experiment. prettify logging of lm loading

13c425d

add multiple models to english experiment configs

7cccb49

Fixing bug for roberta in experiment mode on config. The order of ini…

fd2c46d

…tializing LM and PH seems to matter here. Do first LM then PH. Further investigation needed for the reason.

fix init of xlnet.pooler. update examples, tests and tutorial to new …

a1c06c6

…Tokenizer.load() and LanguageModel.load() style.

update requirements

dc92e4d

Merge branch 'master' into roberta

9197369

fix import

e71078c

fix processor args after merge

12518fa

add xlnet models to docstring

24de609

update docstrings and readme

3de24b8

tholor requested review from brandenchan and Timoeller October 24, 2019 11:01

minor change of abstract method signature in lm

fa32157

Timoeller approved these changes Oct 24, 2019

View reviewed changes

brandenchan approved these changes Oct 25, 2019

View reviewed changes

tholor changed the title ~~WIP Add Roberta, XLNet and redesign Tokenizer~~ Add Roberta, XLNet and redesign Tokenizer Oct 25, 2019

tholor merged commit 3f10b82 into master Oct 25, 2019

This was referenced Oct 25, 2019

Add XLNet #17

Closed

fix loading of old tokenizer style #129

Merged

tholor deleted the roberta branch April 28, 2020 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Roberta, XLNet and redesign Tokenizer #125

Add Roberta, XLNet and redesign Tokenizer #125

tholor commented Oct 23, 2019 •

edited

Loading

Timoeller left a comment

brandenchan left a comment

Add Roberta, XLNet and redesign Tokenizer #125

Add Roberta, XLNet and redesign Tokenizer #125

Conversation

tholor commented Oct 23, 2019 • edited Loading

Support for tokenizers from the transformers repo.

Generic loading of tokenizers:

Generic loading of language models:

Downstream tasks

Timoeller left a comment

Choose a reason for hiding this comment

brandenchan left a comment

Choose a reason for hiding this comment

tholor commented Oct 23, 2019 •

edited

Loading