-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid collisions when decoding bad unicode #2411
Conversation
@piskvorky Please have a look at this when you have the chance. This is the last PR we need to merge before the next release (sorry I missed it in my earlier email). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the FB itself deal with invalid utf8? IMO we should mimic that.
Or do they operate on raw bytes, instead of characters?
gensim/models/_fasttext_bin.py
Outdated
word = word_bytes.decode('latin1') | ||
logger.error( | ||
'failed to decode invalid unicode bytes %r; falling back to latin1, using %r; ' | ||
'consider upgrading to Python 3 for better Unicode handling (and so much more)', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems unrelated. If I understand correctly, the problem is with badly encoded strings in the model (FB's fault), or possibly our choice to accept and recover from such inputs, instead of failing explicitly.
There's nothing wrong with Python 2, and Python 3 won't fix the problem (just mask it in a different way). The recommendation that projects switch to a different Python major version because of some error in a FB model seems out of place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under Py2, you get a string of complete garbage. The term is essentially lost from the vocabulary. Under Py3, you get a mostly OK string with the broken characters escaped. The term remains in the vocabulary, and can contribute vectors to the model. See the unit tests for an example.
So under Py2, our handling of this edge case is worse than under Py3. We're not recommending people switch to Py3, we're just telling them that switching will give them better edge case handling. So, IMO the message is accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both terms are garbage and lost from the vocabulary, I see no difference. No user query or input text will ever match them, whether it's latin1-garbage or backslash-escaped-garbage.
The only useful property might be substrings that still happen to be decoded correctly within that term (probably ~ASCII subset; true both for latin1 and backslash-escaped).
IMO the "pure", correct solution would be to "refuse to guess" and either:
- fail the model loading with an exception, in case valid unicode (via utf8?) is the contract of the FT format, aka "not our problem"; or
- work with bytes instead of text, in case there's no decoding contract (which I hope is not the case, because that's a lot of work, fixing all the interfaces that expect text!).
Which approach does the Facebook's reference implementation use?
Either way, if we want to go down the "impure" road of patching the models on-the-fly, the difference of py2 vs py3 is moot. Anything you can do with unicode in python3, you can also do in python2. We'd just implement backslashreplace
(looks relatively easy) and be consistent, instead of muddying the water by logging arcane py2/3 upgrade recommendations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd just implement backslashreplace (looks relatively easy)
Googling turned out this, as a possible backslashreplace
backport:
def backslashescape_backport(ex):
# The error handler receives the UnicodeDecodeError, which contains arguments of the
# string and start/end indexes of the bad portion.
bstr, start, end = ex.object, ex.start, ex.end
# The return value is a tuple of Unicode string and the index to continue conversion.
# Note: iterating byte strings returns int on 3.x but str on 2.x
return u''.join('\\x{:02x}'.format(c if isinstance(c, int) else ord(c)) for c in bstr[start:end]), end
codecs.register_error("backslashescape_backport", backslashescape_backport)
bad_term = u"Žluťoučký koníček".encode("latin2") # create invalid UTF8
print(repr(bad_term))
b'\xaelu\xbbou\xe8k\xfd kon\xed\xe8ek'
print(repr(bad_term.decode("utf-8", "backslashescape_backport")))
u'\\xaelu\\xbbou\\xe8k\\xfd kon\\xed\\xe8ek'
FB seems to operate on raw bytes, so they are able to ignore the problem altogether. |
gensim/models/_fasttext_bin.py
Outdated
word = word_bytes.decode('latin1') | ||
logger.error( | ||
'failed to decode invalid unicode bytes %r; falling back to latin1, using %r; ' | ||
'consider upgrading to Python 3 for better Unicode handling (and so much more)', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both terms are garbage and lost from the vocabulary, I see no difference. No user query or input text will ever match them, whether it's latin1-garbage or backslash-escaped-garbage.
The only useful property might be substrings that still happen to be decoded correctly within that term (probably ~ASCII subset; true both for latin1 and backslash-escaped).
IMO the "pure", correct solution would be to "refuse to guess" and either:
- fail the model loading with an exception, in case valid unicode (via utf8?) is the contract of the FT format, aka "not our problem"; or
- work with bytes instead of text, in case there's no decoding contract (which I hope is not the case, because that's a lot of work, fixing all the interfaces that expect text!).
Which approach does the Facebook's reference implementation use?
Either way, if we want to go down the "impure" road of patching the models on-the-fly, the difference of py2 vs py3 is moot. Anything you can do with unicode in python3, you can also do in python2. We'd just implement backslashreplace
(looks relatively easy) and be consistent, instead of muddying the water by logging arcane py2/3 upgrade recommendations.
Fix #2402