Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TL;DR When we use the vocab provided in the multilingual data and use
tokenize()
ofFullTokenizer
intokenization.py
withdo_lower_case=True
, it processes all tokens with Korean subtokens as UNK.What Happens?
The result of running this code is
As can be seen, all tokens that have Korean subtokens are processed as UNK. This happens because _run_strip_accents() is called when do_lower_case is True, and unicodedata.normalize() is called with NFD in _run_strip_accents(). Korean strings should not be normalized with NFD, because then it disassembles each Korean character as smaller units. When the disassembled units are joined back together, the result seem like normal Korean strings but are in fact different sequences of bytes.
For example,
When we run this code, we can see that the two seemingly same strings are in fact different.
Why Fix This?
Although this bug occurs only when do_lower_case is True, since the option is True as default, I think this bug has some chance that it might confuse the users who process data with Korean characters in it. (BTW, I think that maybe the line
token = self._run_strip_accents(token)
is better to be outside theif self.do_lower_case
block, since accents are not necessarily related to cases.)Therefore, I added few lines of codes to skip normalization for Korean subtokens.
(This bug may happen in other languages that have troubles with NFD normalization, but I do not know about such other cases. If such another language is found, expanding the regex pattern which I currently defined only for Korean would solve the problem.)
After fix, texts with Korean subtokens are processed well as follows, even with do_lower_case set True.