avoid collisions when decoding bad unicode #2411

mpenkov · 2019-03-10T08:25:15Z

gensim/models/_fasttext_bin.py

mpenkov · 2019-03-23T04:48:06Z

@piskvorky Please have a look at this when you have the chance. This is the last PR we need to merge before the next release (sorry I missed it in my earlier email).

piskvorky

How does the FB itself deal with invalid utf8? IMO we should mimic that.

Or do they operate on raw bytes, instead of characters?

piskvorky · 2019-03-23T11:41:23Z

gensim/models/_fasttext_bin.py

+    word = word_bytes.decode('latin1')
+    logger.error(
+        'failed to decode invalid unicode bytes %r; falling back to latin1, using %r; '
+        'consider upgrading to Python 3 for better Unicode handling (and so much more)',


Seems unrelated. If I understand correctly, the problem is with badly encoded strings in the model (FB's fault), or possibly our choice to accept and recover from such inputs, instead of failing explicitly.

There's nothing wrong with Python 2, and Python 3 won't fix the problem (just mask it in a different way). The recommendation that projects switch to a different Python major version because of some error in a FB model seems out of place.

Under Py2, you get a string of complete garbage. The term is essentially lost from the vocabulary. Under Py3, you get a mostly OK string with the broken characters escaped. The term remains in the vocabulary, and can contribute vectors to the model. See the unit tests for an example.

So under Py2, our handling of this edge case is worse than under Py3. We're not recommending people switch to Py3, we're just telling them that switching will give them better edge case handling. So, IMO the message is accurate.

Both terms are garbage and lost from the vocabulary, I see no difference. No user query or input text will ever match them, whether it's latin1-garbage or backslash-escaped-garbage.

The only useful property might be substrings that still happen to be decoded correctly within that term (probably ~ASCII subset; true both for latin1 and backslash-escaped).

IMO the "pure", correct solution would be to "refuse to guess" and either:

fail the model loading with an exception, in case valid unicode (via utf8?) is the contract of the FT format, aka "not our problem"; or

work with bytes instead of text, in case there's no decoding contract (which I hope is not the case, because that's a lot of work, fixing all the interfaces that expect text!).

Which approach does the Facebook's reference implementation use?

Either way, if we want to go down the "impure" road of patching the models on-the-fly, the difference of py2 vs py3 is moot. Anything you can do with unicode in python3, you can also do in python2. We'd just implement backslashreplace (looks relatively easy) and be consistent, instead of muddying the water by logging arcane py2/3 upgrade recommendations.

We'd just implement backslashreplace (looks relatively easy)

Googling turned out this, as a possible backslashreplace backport:

def backslashescape_backport(ex): # The error handler receives the UnicodeDecodeError, which contains arguments of the # string and start/end indexes of the bad portion. bstr, start, end = ex.object, ex.start, ex.end # The return value is a tuple of Unicode string and the index to continue conversion. # Note: iterating byte strings returns int on 3.x but str on 2.x return u''.join('\\x{:02x}'.format(c if isinstance(c, int) else ord(c)) for c in bstr[start:end]), end codecs.register_error("backslashescape_backport", backslashescape_backport) bad_term = u"Žluťoučký koníček".encode("latin2") # create invalid UTF8 print(repr(bad_term)) b'\xaelu\xbbou\xe8k\xfd kon\xed\xe8ek' print(repr(bad_term.decode("utf-8", "backslashescape_backport"))) u'\\xaelu\\xbbou\\xe8k\\xfd kon\\xed\\xe8ek'

gensim/models/_fasttext_bin.py

mpenkov · 2019-03-24T00:03:39Z

FB seems to operate on raw bytes, so they are able to ignore the problem altogether.

piskvorky · 2019-03-24T10:04:06Z

gensim/models/_fasttext_bin.py

+    word = word_bytes.decode('latin1')
+    logger.error(
+        'failed to decode invalid unicode bytes %r; falling back to latin1, using %r; '
+        'consider upgrading to Python 3 for better Unicode handling (and so much more)',


Both terms are garbage and lost from the vocabulary, I see no difference. No user query or input text will ever match them, whether it's latin1-garbage or backslash-escaped-garbage.

The only useful property might be substrings that still happen to be decoded correctly within that term (probably ~ASCII subset; true both for latin1 and backslash-escaped).

IMO the "pure", correct solution would be to "refuse to guess" and either:

fail the model loading with an exception, in case valid unicode (via utf8?) is the contract of the FT format, aka "not our problem"; or

work with bytes instead of text, in case there's no decoding contract (which I hope is not the case, because that's a lot of work, fixing all the interfaces that expect text!).

Which approach does the Facebook's reference implementation use?

Either way, if we want to go down the "impure" road of patching the models on-the-fly, the difference of py2 vs py3 is moot. Anything you can do with unicode in python3, you can also do in python2. We'd just implement backslashreplace (looks relatively easy) and be consistent, instead of muddying the water by logging arcane py2/3 upgrade recommendations.

gensim/models/_fasttext_bin.py

mpenkov added 2 commits March 10, 2019 17:24

avoid collisions when decoding bad unicode

0a11c9d

Py2.7 support

ef2f595

piskvorky reviewed Mar 10, 2019

View reviewed changes

gensim/models/_fasttext_bin.py Outdated Show resolved Hide resolved

mpenkov added 2 commits March 23, 2019 13:34

improve Py2.7 handling during collision avoidance

7c1a175

avoid division by zero

07f3d1f

mpenkov assigned piskvorky Mar 23, 2019

mpenkov added the 3.7.2 label Mar 23, 2019

piskvorky requested changes Mar 23, 2019

View reviewed changes

piskvorky requested changes Mar 24, 2019

View reviewed changes

backport backslashreplace for Py2

bd543c5

mpenkov merged commit bd199aa into piskvorky:develop Apr 6, 2019

mpenkov deleted the fix-unicode branch April 6, 2019 08:34

mpenkov mentioned this pull request Apr 20, 2019

Inference issue using FB pretrained model if word have no ngrams #2415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid collisions when decoding bad unicode #2411

avoid collisions when decoding bad unicode #2411

mpenkov commented Mar 10, 2019

mpenkov commented Mar 23, 2019

piskvorky left a comment

piskvorky Mar 23, 2019 •

edited

Loading

mpenkov Mar 24, 2019 •

edited

Loading

piskvorky Mar 24, 2019 •

edited

Loading

piskvorky Mar 24, 2019 •

edited

Loading

mpenkov commented Mar 24, 2019

piskvorky Mar 24, 2019 •

edited

Loading

avoid collisions when decoding bad unicode #2411

avoid collisions when decoding bad unicode #2411

Conversation

mpenkov commented Mar 10, 2019

mpenkov commented Mar 23, 2019

piskvorky left a comment

Choose a reason for hiding this comment

piskvorky Mar 23, 2019 • edited Loading

Choose a reason for hiding this comment

mpenkov Mar 24, 2019 • edited Loading

Choose a reason for hiding this comment

piskvorky Mar 24, 2019 • edited Loading

Choose a reason for hiding this comment

piskvorky Mar 24, 2019 • edited Loading

Choose a reason for hiding this comment

mpenkov commented Mar 24, 2019

piskvorky Mar 24, 2019 • edited Loading

Choose a reason for hiding this comment

piskvorky Mar 23, 2019 •

edited

Loading

mpenkov Mar 24, 2019 •

edited

Loading

piskvorky Mar 24, 2019 •

edited

Loading

piskvorky Mar 24, 2019 •

edited

Loading

piskvorky Mar 24, 2019 •

edited

Loading