Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid collisions when decoding bad unicode #2411

Merged
merged 5 commits into from
Apr 6, 2019
Merged

avoid collisions when decoding bad unicode #2411

merged 5 commits into from
Apr 6, 2019

Conversation

mpenkov
Copy link
Collaborator

@mpenkov mpenkov commented Mar 10, 2019

Fix #2402

@mpenkov
Copy link
Collaborator Author

mpenkov commented Mar 23, 2019

@piskvorky Please have a look at this when you have the chance. This is the last PR we need to merge before the next release (sorry I missed it in my earlier email).

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the FB itself deal with invalid utf8? IMO we should mimic that.

Or do they operate on raw bytes, instead of characters?

word = word_bytes.decode('latin1')
logger.error(
'failed to decode invalid unicode bytes %r; falling back to latin1, using %r; '
'consider upgrading to Python 3 for better Unicode handling (and so much more)',
Copy link
Owner

@piskvorky piskvorky Mar 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unrelated. If I understand correctly, the problem is with badly encoded strings in the model (FB's fault), or possibly our choice to accept and recover from such inputs, instead of failing explicitly.

There's nothing wrong with Python 2, and Python 3 won't fix the problem (just mask it in a different way). The recommendation that projects switch to a different Python major version because of some error in a FB model seems out of place.

Copy link
Collaborator Author

@mpenkov mpenkov Mar 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under Py2, you get a string of complete garbage. The term is essentially lost from the vocabulary. Under Py3, you get a mostly OK string with the broken characters escaped. The term remains in the vocabulary, and can contribute vectors to the model. See the unit tests for an example.

So under Py2, our handling of this edge case is worse than under Py3. We're not recommending people switch to Py3, we're just telling them that switching will give them better edge case handling. So, IMO the message is accurate.

Copy link
Owner

@piskvorky piskvorky Mar 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both terms are garbage and lost from the vocabulary, I see no difference. No user query or input text will ever match them, whether it's latin1-garbage or backslash-escaped-garbage.

The only useful property might be substrings that still happen to be decoded correctly within that term (probably ~ASCII subset; true both for latin1 and backslash-escaped).

IMO the "pure", correct solution would be to "refuse to guess" and either:

  1. fail the model loading with an exception, in case valid unicode (via utf8?) is the contract of the FT format, aka "not our problem"; or
  2. work with bytes instead of text, in case there's no decoding contract (which I hope is not the case, because that's a lot of work, fixing all the interfaces that expect text!).

Which approach does the Facebook's reference implementation use?

Either way, if we want to go down the "impure" road of patching the models on-the-fly, the difference of py2 vs py3 is moot. Anything you can do with unicode in python3, you can also do in python2. We'd just implement backslashreplace (looks relatively easy) and be consistent, instead of muddying the water by logging arcane py2/3 upgrade recommendations.

Copy link
Owner

@piskvorky piskvorky Mar 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd just implement backslashreplace (looks relatively easy)

Googling turned out this, as a possible backslashreplace backport:

def backslashescape_backport(ex):
    # The error handler receives the UnicodeDecodeError, which contains arguments of the
    # string and start/end indexes of the bad portion.
    bstr, start, end = ex.object, ex.start, ex.end

    # The return value is a tuple of Unicode string and the index to continue conversion.
    # Note: iterating byte strings returns int on 3.x but str on 2.x
    return u''.join('\\x{:02x}'.format(c if isinstance(c, int) else ord(c)) for c in bstr[start:end]), end

codecs.register_error("backslashescape_backport", backslashescape_backport)

bad_term = u"Žluťoučký koníček".encode("latin2")  # create invalid UTF8
print(repr(bad_term))
b'\xaelu\xbbou\xe8k\xfd kon\xed\xe8ek'

print(repr(bad_term.decode("utf-8", "backslashescape_backport")))
u'\\xaelu\\xbbou\\xe8k\\xfd kon\\xed\\xe8ek'

gensim/models/_fasttext_bin.py Outdated Show resolved Hide resolved
gensim/models/_fasttext_bin.py Outdated Show resolved Hide resolved
@mpenkov
Copy link
Collaborator Author

mpenkov commented Mar 24, 2019

FB seems to operate on raw bytes, so they are able to ignore the problem altogether.

word = word_bytes.decode('latin1')
logger.error(
'failed to decode invalid unicode bytes %r; falling back to latin1, using %r; '
'consider upgrading to Python 3 for better Unicode handling (and so much more)',
Copy link
Owner

@piskvorky piskvorky Mar 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both terms are garbage and lost from the vocabulary, I see no difference. No user query or input text will ever match them, whether it's latin1-garbage or backslash-escaped-garbage.

The only useful property might be substrings that still happen to be decoded correctly within that term (probably ~ASCII subset; true both for latin1 and backslash-escaped).

IMO the "pure", correct solution would be to "refuse to guess" and either:

  1. fail the model loading with an exception, in case valid unicode (via utf8?) is the contract of the FT format, aka "not our problem"; or
  2. work with bytes instead of text, in case there's no decoding contract (which I hope is not the case, because that's a lot of work, fixing all the interfaces that expect text!).

Which approach does the Facebook's reference implementation use?

Either way, if we want to go down the "impure" road of patching the models on-the-fly, the difference of py2 vs py3 is moot. Anything you can do with unicode in python3, you can also do in python2. We'd just implement backslashreplace (looks relatively easy) and be consistent, instead of muddying the water by logging arcane py2/3 upgrade recommendations.

gensim/models/_fasttext_bin.py Outdated Show resolved Hide resolved
@mpenkov mpenkov merged commit bd199aa into piskvorky:develop Apr 6, 2019
@mpenkov mpenkov deleted the fix-unicode branch April 6, 2019 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants