-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder #1513
Conversation
decoder Causes issues with `ByteLevel` messing up some `AddedTokens` with some utf-8 range used in the bytelevel mapping. This commit tests the extend of the damage of ignoring the decoder for those tokens.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's add a test for the decoder to make sure the behavior we are looking for is enable
tokenizers/src/tokenizer/mod.rs
Outdated
if !result.is_empty() { | ||
result.push(' '); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we adding an extra space before the added token? IMO we should only do this if there is no decoder (default join on " "
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this breaks existing tests other wise (in node and Python, there are tests that just add regular token and expect the decoding to happen.
I've updated to make the test pass without modification, we could envision changing those test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These test are done when there is no decoder
no?
I mean the previous behaviour does not add spaces between added token if there is a decoder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it did (because the test wasn't using a decoder, so by default no decoder means space separated)
Gents, Thanks, |
FWIW I ran tokenizers tests on transformers and didn't find any related error in the test suite.
I figured the issue has nothing to do with it. |
More tests with actual failures
|
The failing test is basically the same one run on bert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this kinda breaks the way we decode, because we don't decode everything, meta space or anything that relies on the idx of the token is broken
I'll fix it |
Causes issues with
ByteLevel
messing up someAddedTokens
with someutf-8 range used in the bytelevel mapping.
This commit tests the extend of the damage of ignoring the decoder for
those tokens.