Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

Python Implementation return 'utf-8' codec can't decode error #715

Closed
akakakakakaa opened this issue Jan 6, 2019 · 4 comments
Closed

Python Implementation return 'utf-8' codec can't decode error #715

akakakakakaa opened this issue Jan 6, 2019 · 4 comments
Labels
bug Python related to python bindings

Comments

@akakakakakaa
Copy link

akakakakakaa commented Jan 6, 2019

I used pretrained korean bin file 'cc.ko.300.bin'.

But, When I test bin_to_vec.py, I got

Traceback (most recent call last):
File "bin_to_vec.py", line 30, in
words = f.get_words()
File "/usr/local/lib/python3.5/dist-packages/fastText/FastText.py", line 170, in get_words
pair = self.f.getVocab()
'UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte'

My docker container charset is C.UTF-8.

@akakakakakaa akakakakakaa changed the title Python Implement return 'utf-8' codec can't decode error. Python Implementation return 'utf-8' codec can't decode error. Jan 6, 2019
@akakakakakaa akakakakakaa changed the title Python Implementation return 'utf-8' codec can't decode error. Python Implementation return 'utf-8' codec can't decode error Jan 6, 2019
@EdouardGrave
Copy link
Contributor

Hi @akakakakakaa,

Thank you for reporting this issue! This is probably due to some invalid utf-8 data that was not filtered out of the common crawl training data. We will try to fix this issue rapidly.

Best,
Edouard.

@EdouardGrave EdouardGrave added bug Python related to python bindings labels Jan 15, 2019
facebook-github-bot pushed a commit that referenced this issue Feb 21, 2019
Summary:
The issue was reported here : #715
Now, we can replace the line :
```
words = f.get_words()
```
by
```
words = f.get_words(on_unicode_error='replace')
```
in bin_to_vec.py

The behaviour is similar to python's `decode` function : if there is an encoding issue, `strict`: it fails with an error, `replace`: replaces silently the malformed characters by the replacement character.

Reviewed By: EdouardGrave

Differential Revision: D14133996

fbshipit-source-id: 9c82fef69b6d5223e4e5d60516a53467d8786ffc
@Celebio
Copy link
Member

Celebio commented Apr 17, 2019

Hi @akakakakakaa ,
It looks like the surrogate pair characters can be handled as valid utf-8 in fasttext binary and in fastText's python bindings in Python 2.

With e13484b, you can now replace the line :

words = f.get_words()

by

words = f.get_words(on_unicode_error='replace')

in bin_to_vec.py. You can also use ignore, by default it is set to strict.
That should unlock you.

Thank you for reporting the issue.
Best regards,
Onur

@Celebio Celebio closed this as completed Apr 17, 2019
facebook-github-bot pushed a commit that referenced this issue Mar 25, 2020
…r. (#967)

Summary:
This [earlier commit](e13484b) fixed issue #715 by casting all strings to Python strings. However, this functionality was not added to getNN and I was seeing the same error when querying nearest neighbors for Japanese language. This commit simply adapts castToPythonString to the get NN function.
Pull Request resolved: #967

Reviewed By: EdouardGrave

Differential Revision: D19287807

Pulled By: Celebio

fbshipit-source-id: 31fb8b4d643848f3f22381ac06f2443eb70c0009
adrianeboyd pushed a commit to adrianeboyd/fastText that referenced this issue Aug 4, 2021
…r. (facebookresearch#967)

Summary:
This [earlier commit](facebookresearch@e13484b) fixed issue facebookresearch#715 by casting all strings to Python strings. However, this functionality was not added to getNN and I was seeing the same error when querying nearest neighbors for Japanese language. This commit simply adapts castToPythonString to the get NN function.
Pull Request resolved: facebookresearch#967

Reviewed By: EdouardGrave

Differential Revision: D19287807

Pulled By: Celebio

fbshipit-source-id: 31fb8b4d643848f3f22381ac06f2443eb70c0009
@ahmadbass3l
Copy link

i still can reproduce this issue, here a sample text that causes it:
text = "\tThese photos!! 😲\ude32 \t\t Woman's Day "
and here the results:

>>> model.predict(text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/fasttext/FastText.py", line 226, in predict
    predictions = self.f.predict(text, k, threshold, on_unicode_error)
TypeError: predict(): incompatible function arguments. The following argument types are supported:
    1. (self: fasttext_pybind.fasttext, arg0: str, arg1: int, arg2: float, arg3: str) -> List[Tuple[float, str]]

Invoked with: <fasttext_pybind.fasttext object at 0x7f67a90ee330>, "\tThese photos!! 😲\ude32 \t\t       Woman's Day        \n", 1, 0.0, 'strict'
>>> model.predict(text, on_unicode_error='ignore')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/fasttext/FastText.py", line 226, in predict
    predictions = self.f.predict(text, k, threshold, on_unicode_error)
TypeError: predict(): incompatible function arguments. The following argument types are supported:
    1. (self: fasttext_pybind.fasttext, arg0: str, arg1: int, arg2: float, arg3: str) -> List[Tuple[float, str]]

Invoked with: <fasttext_pybind.fasttext object at 0x7f67a90ee330>, "\tThese photos!! 😲\ude32 \t\t       Woman's Day        \n", 1, 0.0, 'ignore' 

using replace instead of ignore will cause the same issue as well.
the char causing this in the text above is '\ude32' which is not a valid surrogate one (missing the other half of it):

>>> import surrogates
>>> surrogates.decode('\ude32')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/.local/lib/python3.10/site-packages/surrogates/__init__.py", line 73, in decode
    raise DecodeError(_MESSAGE_LOW_SURROGATE_FIRST
surrogates.DecodeError: Low surrogate U+DE32 not preceded by a high surrogate

@ahmadbass3l
Copy link

ahmadbass3l commented Aug 24, 2022

(see previous comment)

i was able to workaround the issue by encoding the text and passing the surrogates and then transforming it again to a python string:

        try:
            prediction_results = model.predict(text,)
        except TypeError as error:
            try:
                text = text.encode('utf-8', 'surrogatepass').decode('utf-8', 'ignore')
                prediction_results = model.predict(text)
           except (TypeError, UnicodeError, UnicodeEncodeError, UnicodeDecodeError) as error:
                try:
                    text = str(text.encode('utf-8', 'surrogatepass'))
                    prediction_results = model.predict(text)
                except (TypeError, UnicodeError, UnicodeEncodeError, UnicodeDecodeError) as error:
                    ...

as you may notice if the original text is not in English and the third try statement was triggered (text = str(text.encode('utf-8', 'surrogatepass'))) then we have a mix of a language and some python encoded chars looking like this \xc3\xb0\xc5\xb8\xe2\x80\x99\xc2\xaf\xed\xb2\xaf which will be evaluated as English by Fasttext and thus the prediction results quality will suffer under this circumstances.

In addition to that and in a previous step to the predicting, i am trying to "clean" the surrogates pairs (or better said left half of pairs) using this:

INVALID_UNICODE_CHARS = ['\uddf2', '\ude41', '\udd2b', '\udded', '\udd27', '\ude22', '\ude1a', '\ude0f', '\udc97',
                         '\ude95', '\uddf7', '\ude11', '\udc47', '\udea2', '\udf69', '\udd24', '\udff9', '\udc95',
                         '\xa0\ude4c', '\udca6', '\udff3', '\ude10', '\ude39', '\ude15', '\uded1', '\udf89', '\udd11',
                         '\udd12', '\ud83c', '\udd28', '\udc4a', '\ude28', '\udd90', '\uddfa', '\udd20', '\udfc1',
                         '\udc51', '\udd14', '\udd15', '\udd29', '\uddf8', '\udc4f', '\udd54', '\udd37', '\udc93',
                         '\udcf7', '\udd8b', '\udd23', '\udc8d', '\ude12', '\udc96', '\ude1f', '\udde3', '\udd71',
                         '\udd2c', '\udd8d', '\udd21', '\udc69', '\ude1e', '\udc81', '\ude02', '\udf33', '\udfe2',
                         '\ude23', '\udd95', '\ude00', '\udf7f', '\udc80', '\udca8', '\udffd', '\udd18', '\udd81',
                         '\ude09', '\udeab', '\udd4a', '\ude2d', '\ude2b', '\udd38', '\udd2a', '\ude0e', '\ude21',
                         '\udc46', '\ude29', '\udffb', '\udc4b', '\ude4a', '\udf4f', '\udc49', '\udc41', '\udc3e',
                         '\udf08', '\udf8a', '\udc4d', '\ude1c', '\udc8b', '\udc9e', '\udc42', '\ude2f', '\ude48',
                         '\udf39', '\udd42', '\udc83', '\ude37', '\udcaf', '\udc0b', '\udd1d', '\ude0d', '\udd25',
                         '\ude06', '\udc43', '\udca3', '\udc36', '\ude05', '\udd94', '\ude0a', '\ude03', '\udf99',
                         '\ude42', '\ude08', '\udc48', '\ude35', '\udd19', '\ude16', '\ude1b', '\udc13', '\udc7b',
                         '\udec5', '\udffe', '\udd80', '\udd26', '\uddd0', '\udd2d', '\ude24', '\udfff', '\ud83d',
                         '\ude01', '\ude2c', '\udc40', '\udca9', '\ude18', '\ude43', '\ude2a', '\udffc', '\ude2e',
                         '\ude44', '\ud83e', '\udd2e', '\ude20', '\ude14', '\udc4c', '\ude32']
replace_by_blank_symbols = re.compile("|".join(INVALID_UNICODE_CHARS))
text = replace_by_blank_symbols.sub('', text)  

Still, imho this is not a fix since the encoded chars may affect the prediction results

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Python related to python bindings
Projects
None yet
Development

No branches or pull requests

4 participants