-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to use fast HF tokenizer. #482
Conversation
I just saw that there is a very similar PR here: #205 @tholor @Timoeller do you want to have fast set by default? HF not having it as default in |
Hey @PhilipMay, would be great if you can continue with the integration of the fast rust tokenizers. We were actually blocked in #205 for quite some time due to missing support for multiprocessing, but we should be good to go now.
For 2) we would need to understand in more detail if there were any breaking changes (e.g are all model types by now supported?) or tokenization differences (e.g Roberta's prefix whitespace) We really appreciate any support from you here, but can also take over at any point (especially for 2.) BTW: Great that you'll open source a German Electra model! We have trained some variants there too and found them to be very effective. Will publish them soon together with a paper. |
Hey @tholor - thanks for the answer. Is it ok for you to separate point 1 and 2 from each other in different PRs? I think they can be handled in steps. |
Yes, absolutely! |
Great - so I will continue with this... :-) |
@tholor Is it ok with you to set "slow" as default? So we have no breaking change? |
- set num_processes=0 for Inferencer
If "slow" tokenizer as default is ok with you - this PR can be merged from my point of view. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great! Thanks for working on this! Two small suggestions from my side...
- electra - roberta
- electra - roberta
@tholor When doing "normal" text classification witha fast tokenizer like in https://github.com/deepset-ai/FARM/blob/master/examples/doc_classification.py I have the following problem:
The is coming from here: https://github.com/deepset-ai/FARM/blob/master/farm/data_handler/processor.py#L299 Do you have an idea? |
Hmm...how does the sample look like that causes trouble here? Is it printed after the above error message? If not, maybe set a breakpoint there. Is that happening for all samples or only some in your dataset? |
f286d43
to
aec7d2d
Compare
Sure. I am currently on vacation with only partial internet access, but I will try to have a look at it tomorrow. |
That would be great! Thanks. ;-) |
- add shape assert - fix embedding assert
I investigated the problem with |
There seem to be more bugs with fast and slow tokenizers. To be honest I do not feel like debugging all this stuff at the moment... |
Ok sure. We will take over. |
Well, thanks! :-) |
Fast tokenizers seem to have the same problem with |
Did you see this commen huggingface/transformers#6046 (comment) ? |
Yes, we saw that. Currently trying to come up with a solution for this. |
A quick update on what I have done so far, @PhilipMay. One question that I came up with concerns line 107 in tokenizytion.py. Why is an error raised if the user wants to use a FastTokenizer with a Roberta model? According to Huggingface documentation, fast tokenizers are available for Roberta models. Do they behave differently than the other tokenizers? |
My 1.5 cents: |
This PR, as it is now, makes it possible to use fast tokenizers with Bert models, Distilbert models and Electra models. Although fast tokenizers exist for Roberta models, fast Roberta tokenizers are not supported yet. This is because to get the input-IDs, for certain tasks we need to detokenize the input. However, Roberta treats spaces like parts of the tokens which makes detokenizing more complex. In a future PR we should implement fast tokenizers such that we get the samples' features while tokenizing and extracting the offsets. Like this, we would save one tokenizer step and could use fast tokenizers with Roberta models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good
I think that was the reason why I did that on my early PR commits. |
This PR adds the option to use fast HF tokenizer.
The reason why I need this is the following:
I plan to open source a German Electra model which is lower case but does not strip accents. To do that you have to specify
strip_accents=False
to the tokenizer. But this option is only available for the fast tokenizer. Also see here: huggingface/transformers#6186 and here google-research/electra#88Might also solve #157
To-do before done