Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix korean tokenization bug #228

Closed
wants to merge 1 commit into from

Conversation

soheeyang
Copy link

@soheeyang soheeyang commented Dec 4, 2018

TL;DR When we use the vocab provided in the multilingual data and use tokenize() of FullTokenizer in tokenization.py with do_lower_case=True, it processes all tokens with Korean subtokens as UNK.

What Happens?

test2
The result of running this code is

before_fix_py2
before_fix_py3

As can be seen, all tokens that have Korean subtokens are processed as UNK. This happens because _run_strip_accents() is called when do_lower_case is True, and unicodedata.normalize() is called with NFD in _run_strip_accents(). Korean strings should not be normalized with NFD, because then it disassembles each Korean character as smaller units. When the disassembled units are joined back together, the result seem like normal Korean strings but are in fact different sequences of bytes.

For example,
test_code

When we run this code, we can see that the two seemingly same strings are in fact different.
exec_result_python2
exec_result_python3

Why Fix This?

Although this bug occurs only when do_lower_case is True, since the option is True as default, I think this bug has some chance that it might confuse the users who process data with Korean characters in it. (BTW, I think that maybe the line token = self._run_strip_accents(token) is better to be outside the if self.do_lower_case block, since accents are not necessarily related to cases.)
Therefore, I added few lines of codes to skip normalization for Korean subtokens.
(This bug may happen in other languages that have troubles with NFD normalization, but I do not know about such other cases. If such another language is found, expanding the regex pattern which I currently defined only for Korean would solve the problem.)

After fix, texts with Korean subtokens are processed well as follows, even with do_lower_case set True.

test2

after_fix_py2
after_fix_py3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant