Fix korean tokenization bug #228

soheeyang · 2018-12-04T16:41:21Z

TL;DR When we use the vocab provided in the multilingual data and use tokenize() of FullTokenizer in tokenization.py with do_lower_case=True, it processes all tokens with Korean subtokens as UNK.

What Happens?

The result of running this code is

As can be seen, all tokens that have Korean subtokens are processed as UNK. This happens because _run_strip_accents() is called when do_lower_case is True, and unicodedata.normalize() is called with NFD in _run_strip_accents(). Korean strings should not be normalized with NFD, because then it disassembles each Korean character as smaller units. When the disassembled units are joined back together, the result seem like normal Korean strings but are in fact different sequences of bytes.

For example,

When we run this code, we can see that the two seemingly same strings are in fact different.

Why Fix This?

Although this bug occurs only when do_lower_case is True, since the option is True as default, I think this bug has some chance that it might confuse the users who process data with Korean characters in it. (BTW, I think that maybe the line token = self._run_strip_accents(token) is better to be outside the if self.do_lower_case block, since accents are not necessarily related to cases.)
Therefore, I added few lines of codes to skip normalization for Korean subtokens.
(This bug may happen in other languages that have troubles with NFD normalization, but I do not know about such other cases. If such another language is found, expanding the regex pattern which I currently defined only for Korean would solve the problem.)

After fix, texts with Korean subtokens are processed well as follows, even with do_lower_case set True.

Fix korean tokenization bug

c263412

This was referenced Nov 30, 2021

Support adding additional special tokens lassl/lassl#26

Merged

Ready to release v0.1.0 lassl/lassl#27

Closed

soheeyang closed this Dec 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix korean tokenization bug #228

Fix korean tokenization bug #228

soheeyang commented Dec 4, 2018 •

edited

Loading

Fix korean tokenization bug #228

Fix korean tokenization bug #228

Conversation

soheeyang commented Dec 4, 2018 • edited Loading

What Happens?

Why Fix This?

soheeyang commented Dec 4, 2018 •

edited

Loading