Python Implementation return 'utf-8' codec can't decode error #715

akakakakakaa · 2019-01-06T09:56:25Z

I used pretrained korean bin file 'cc.ko.300.bin'.

But, When I test bin_to_vec.py, I got

Traceback (most recent call last):
File "bin_to_vec.py", line 30, in
words = f.get_words()
File "/usr/local/lib/python3.5/dist-packages/fastText/FastText.py", line 170, in get_words
pair = self.f.getVocab()
'UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte'

My docker container charset is C.UTF-8.

EdouardGrave · 2019-01-15T10:02:36Z

Hi @akakakakakaa,

Thank you for reporting this issue! This is probably due to some invalid utf-8 data that was not filtered out of the common crawl training data. We will try to fix this issue rapidly.

Best,
Edouard.

Summary: The issue was reported here : #715 Now, we can replace the line : ``` words = f.get_words() ``` by ``` words = f.get_words(on_unicode_error='replace') ``` in bin_to_vec.py The behaviour is similar to python's `decode` function : if there is an encoding issue, `strict`: it fails with an error, `replace`: replaces silently the malformed characters by the replacement character. Reviewed By: EdouardGrave Differential Revision: D14133996 fbshipit-source-id: 9c82fef69b6d5223e4e5d60516a53467d8786ffc

Celebio · 2019-04-17T08:09:25Z

Hi @akakakakakaa ,
It looks like the surrogate pair characters can be handled as valid utf-8 in fasttext binary and in fastText's python bindings in Python 2.

With e13484b, you can now replace the line :

words = f.get_words()

by

words = f.get_words(on_unicode_error='replace')

in bin_to_vec.py. You can also use ignore, by default it is set to strict.
That should unlock you.

Thank you for reporting the issue.
Best regards,
Onur

…r. (#967) Summary: This [earlier commit](e13484b) fixed issue #715 by casting all strings to Python strings. However, this functionality was not added to getNN and I was seeing the same error when querying nearest neighbors for Japanese language. This commit simply adapts castToPythonString to the get NN function. Pull Request resolved: #967 Reviewed By: EdouardGrave Differential Revision: D19287807 Pulled By: Celebio fbshipit-source-id: 31fb8b4d643848f3f22381ac06f2443eb70c0009

…r. (facebookresearch#967) Summary: This [earlier commit](facebookresearch@e13484b) fixed issue facebookresearch#715 by casting all strings to Python strings. However, this functionality was not added to getNN and I was seeing the same error when querying nearest neighbors for Japanese language. This commit simply adapts castToPythonString to the get NN function. Pull Request resolved: facebookresearch#967 Reviewed By: EdouardGrave Differential Revision: D19287807 Pulled By: Celebio fbshipit-source-id: 31fb8b4d643848f3f22381ac06f2443eb70c0009

ahmadbass3l · 2022-08-23T12:04:44Z

i still can reproduce this issue, here a sample text that causes it:
text = "\tThese photos!! ðŸ˜²\ude32 \t\t Woman's Day "
and here the results:

>>> model.predict(text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/fasttext/FastText.py", line 226, in predict
    predictions = self.f.predict(text, k, threshold, on_unicode_error)
TypeError: predict(): incompatible function arguments. The following argument types are supported:
    1. (self: fasttext_pybind.fasttext, arg0: str, arg1: int, arg2: float, arg3: str) -> List[Tuple[float, str]]

Invoked with: <fasttext_pybind.fasttext object at 0x7f67a90ee330>, "\tThese photos!! ðŸ˜²\ude32 \t\t       Woman's Day        \n", 1, 0.0, 'strict'
>>> model.predict(text, on_unicode_error='ignore')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/fasttext/FastText.py", line 226, in predict
    predictions = self.f.predict(text, k, threshold, on_unicode_error)
TypeError: predict(): incompatible function arguments. The following argument types are supported:
    1. (self: fasttext_pybind.fasttext, arg0: str, arg1: int, arg2: float, arg3: str) -> List[Tuple[float, str]]

Invoked with: <fasttext_pybind.fasttext object at 0x7f67a90ee330>, "\tThese photos!! ðŸ˜²\ude32 \t\t       Woman's Day        \n", 1, 0.0, 'ignore'

using replace instead of ignore will cause the same issue as well.
the char causing this in the text above is '\ude32' which is not a valid surrogate one (missing the other half of it):

>>> import surrogates
>>> surrogates.decode('\ude32')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/.local/lib/python3.10/site-packages/surrogates/__init__.py", line 73, in decode
    raise DecodeError(_MESSAGE_LOW_SURROGATE_FIRST
surrogates.DecodeError: Low surrogate U+DE32 not preceded by a high surrogate

ahmadbass3l · 2022-08-24T05:39:10Z

(see previous comment)

i was able to workaround the issue by encoding the text and passing the surrogates and then transforming it again to a python string:

        try:
            prediction_results = model.predict(text,)
        except TypeError as error:
            try:
                text = text.encode('utf-8', 'surrogatepass').decode('utf-8', 'ignore')
                prediction_results = model.predict(text)
           except (TypeError, UnicodeError, UnicodeEncodeError, UnicodeDecodeError) as error:
                try:
                    text = str(text.encode('utf-8', 'surrogatepass'))
                    prediction_results = model.predict(text)
                except (TypeError, UnicodeError, UnicodeEncodeError, UnicodeDecodeError) as error:
                    ...

as you may notice if the original text is not in English and the third try statement was triggered (text = str(text.encode('utf-8', 'surrogatepass'))) then we have a mix of a language and some python encoded chars looking like this \xc3\xb0\xc5\xb8\xe2\x80\x99\xc2\xaf\xed\xb2\xaf which will be evaluated as English by Fasttext and thus the prediction results quality will suffer under this circumstances.

In addition to that and in a previous step to the predicting, i am trying to "clean" the surrogates pairs (or better said left half of pairs) using this:

INVALID_UNICODE_CHARS = ['\uddf2', '\ude41', '\udd2b', '\udded', '\udd27', '\ude22', '\ude1a', '\ude0f', '\udc97',
                         '\ude95', '\uddf7', '\ude11', '\udc47', '\udea2', '\udf69', '\udd24', '\udff9', '\udc95',
                         '\xa0\ude4c', '\udca6', '\udff3', '\ude10', '\ude39', '\ude15', '\uded1', '\udf89', '\udd11',
                         '\udd12', '\ud83c', '\udd28', '\udc4a', '\ude28', '\udd90', '\uddfa', '\udd20', '\udfc1',
                         '\udc51', '\udd14', '\udd15', '\udd29', '\uddf8', '\udc4f', '\udd54', '\udd37', '\udc93',
                         '\udcf7', '\udd8b', '\udd23', '\udc8d', '\ude12', '\udc96', '\ude1f', '\udde3', '\udd71',
                         '\udd2c', '\udd8d', '\udd21', '\udc69', '\ude1e', '\udc81', '\ude02', '\udf33', '\udfe2',
                         '\ude23', '\udd95', '\ude00', '\udf7f', '\udc80', '\udca8', '\udffd', '\udd18', '\udd81',
                         '\ude09', '\udeab', '\udd4a', '\ude2d', '\ude2b', '\udd38', '\udd2a', '\ude0e', '\ude21',
                         '\udc46', '\ude29', '\udffb', '\udc4b', '\ude4a', '\udf4f', '\udc49', '\udc41', '\udc3e',
                         '\udf08', '\udf8a', '\udc4d', '\ude1c', '\udc8b', '\udc9e', '\udc42', '\ude2f', '\ude48',
                         '\udf39', '\udd42', '\udc83', '\ude37', '\udcaf', '\udc0b', '\udd1d', '\ude0d', '\udd25',
                         '\ude06', '\udc43', '\udca3', '\udc36', '\ude05', '\udd94', '\ude0a', '\ude03', '\udf99',
                         '\ude42', '\ude08', '\udc48', '\ude35', '\udd19', '\ude16', '\ude1b', '\udc13', '\udc7b',
                         '\udec5', '\udffe', '\udd80', '\udd26', '\uddd0', '\udd2d', '\ude24', '\udfff', '\ud83d',
                         '\ude01', '\ude2c', '\udc40', '\udca9', '\ude18', '\ude43', '\ude2a', '\udffc', '\ude2e',
                         '\ude44', '\ud83e', '\udd2e', '\ude20', '\ude14', '\udc4c', '\ude32']
replace_by_blank_symbols = re.compile("|".join(INVALID_UNICODE_CHARS))
text = replace_by_blank_symbols.sub('', text)

Still, imho this is not a fix since the encoded chars may affect the prediction results

akakakakakaa changed the title ~~Python Implement return 'utf-8' codec can't decode error.~~ Python Implementation return 'utf-8' codec can't decode error. Jan 6, 2019

akakakakakaa changed the title ~~Python Implementation return 'utf-8' codec can't decode error.~~ Python Implementation return 'utf-8' codec can't decode error Jan 6, 2019

EdouardGrave added bug Python related to python bindings labels Jan 15, 2019

zzaebok mentioned this issue Mar 6, 2019

AssertionError: unexpected number of vectors when loading Korean FB model piskvorky/gensim#2402

Closed

Celebio closed this as completed Apr 17, 2019

tmramalho mentioned this issue Dec 4, 2019

Fix getNN in python bindings to avoid 'utf-8' codec can't decode error. #967

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python Implementation return 'utf-8' codec can't decode error #715

Python Implementation return 'utf-8' codec can't decode error #715

akakakakakaa commented Jan 6, 2019 •

edited

Loading

EdouardGrave commented Jan 15, 2019

Celebio commented Apr 17, 2019

ahmadbass3l commented Aug 23, 2022

ahmadbass3l commented Aug 24, 2022 •

edited

Loading

Python Implementation return 'utf-8' codec can't decode error #715

Python Implementation return 'utf-8' codec can't decode error #715

Comments

akakakakakaa commented Jan 6, 2019 • edited Loading

EdouardGrave commented Jan 15, 2019

Celebio commented Apr 17, 2019

ahmadbass3l commented Aug 23, 2022

ahmadbass3l commented Aug 24, 2022 • edited Loading

akakakakakaa commented Jan 6, 2019 •

edited

Loading

ahmadbass3l commented Aug 24, 2022 •

edited

Loading