Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable dropout = 0.0 as an equivalent to none in BPE #1550

Merged
merged 6 commits into from
Jun 24, 2024

Conversation

mcognetta
Copy link
Contributor

This is related to the discussion in #1541.

This PR allows for 0.0 to be used as the dropout value in BPE models with equivalent functionality to none. Previously, the docs and implementation were inconsistent:

  • the docs said 0.0 to 1.0 were acceptable values for dropout and that the default was 0
    • the default was actually none
  • it was possible to set dropout to 0.0 once the model was built (and it would work the same as if it was none)
    • however, in this case caching was disabled
    • on a small benchmark, i measured a 50% increase in encoding time with dropout = 0.0 compared to none (this discrepancy goes away after this PR is applied)
  • serialization worked, but deserialization failed, since there was a check that dropout \in (0.0, 1.0]
  • initializing a python BPE model with dropout 0.0 failed (e.g. BPE(dropout = 0.0))

This simply allows for 0.0 to be an acceptable value during initialization and enables caching when tokenizing if dropout == 0.0.

E.g., now the following works

>>> from tokenizers import Tokenizer, models
>>> tokenizer = Tokenizer(models.BPE(dropout = 0.0))
>>> s = tokenizer.to_str()
>>> s
'{"version":"1.0","truncation":null,"padding":null,"added_tokens":[],"normalizer":null,"pre_tokenizer":null,"post_processor":null,"decoder":null,"model":{"type":"BPE","dropout":0.0,"unk_token":null,"continuing_subword_prefix":null,"end_of_word_suffix":null,"fuse_unk":false,"byte_fallback":false,"ignore_merges":false,"vocab":{},"merges":[]}}'
>>> deserialized = Tokenizer.from_str(s)
>>> deserialized.model.dropout
0.0

whereas before it errored.


As future work, I think that dropout should be made non-optional, with the default being 0.0. This would remove the checks for dropout.is_none(), etc, but keep the functionality the same. However, I guess this would be a breaking change (since then all tokenizers serialized before this change would be invalid?).

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding tests in python and rust.

@ArthurZucker
Copy link
Collaborator

rebasing on main should fix clippy issues

@mcognetta
Copy link
Contributor Author

Thanks. I fixed one of the lint errors which was a range readability thing.

@mcognetta
Copy link
Contributor Author

Fixed one more formatting issue. Now I think it should be all good!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! Le't s remove the unrelated change (maybe from rebasing ? )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this should not be in the diff

@mcognetta
Copy link
Contributor Author

I goofed, and now its getting worse 😬

@mcognetta
Copy link
Contributor Author

Ok, it's all good now, unless you want me to squash the commits.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No don't worry, this LGMT! thanks for updating

@ArthurZucker ArthurZucker merged commit fdd26ba into huggingface:main Jun 24, 2024
7 of 12 checks passed
@mcognetta mcognetta deleted the dropout_zero branch June 25, 2024 03:16
ArthurZucker pushed a commit that referenced this pull request Jul 12, 2024
* enable dropout = 0.0

* typo

* lint

* formatter

* enable dropout = 0.0

* formatter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants