-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Python Implementation return 'utf-8' codec can't decode error #715
Comments
Hi @akakakakakaa, Thank you for reporting this issue! This is probably due to some invalid utf-8 data that was not filtered out of the common crawl training data. We will try to fix this issue rapidly. Best, |
Summary: The issue was reported here : #715 Now, we can replace the line : ``` words = f.get_words() ``` by ``` words = f.get_words(on_unicode_error='replace') ``` in bin_to_vec.py The behaviour is similar to python's `decode` function : if there is an encoding issue, `strict`: it fails with an error, `replace`: replaces silently the malformed characters by the replacement character. Reviewed By: EdouardGrave Differential Revision: D14133996 fbshipit-source-id: 9c82fef69b6d5223e4e5d60516a53467d8786ffc
Hi @akakakakakaa , With e13484b, you can now replace the line :
by
in bin_to_vec.py. You can also use Thank you for reporting the issue. |
…r. (#967) Summary: This [earlier commit](e13484b) fixed issue #715 by casting all strings to Python strings. However, this functionality was not added to getNN and I was seeing the same error when querying nearest neighbors for Japanese language. This commit simply adapts castToPythonString to the get NN function. Pull Request resolved: #967 Reviewed By: EdouardGrave Differential Revision: D19287807 Pulled By: Celebio fbshipit-source-id: 31fb8b4d643848f3f22381ac06f2443eb70c0009
…r. (facebookresearch#967) Summary: This [earlier commit](facebookresearch@e13484b) fixed issue facebookresearch#715 by casting all strings to Python strings. However, this functionality was not added to getNN and I was seeing the same error when querying nearest neighbors for Japanese language. This commit simply adapts castToPythonString to the get NN function. Pull Request resolved: facebookresearch#967 Reviewed By: EdouardGrave Differential Revision: D19287807 Pulled By: Celebio fbshipit-source-id: 31fb8b4d643848f3f22381ac06f2443eb70c0009
i still can reproduce this issue, here a sample text that causes it:
using
|
(see previous comment) i was able to workaround the issue by encoding the text and passing the surrogates and then transforming it again to a python string:
as you may notice if the original text is not in English and the third try statement was triggered ( In addition to that and in a previous step to the predicting, i am trying to "clean" the surrogates pairs (or better said left half of pairs) using this:
Still, imho this is not a fix since the encoded chars may affect the prediction results |
I used pretrained korean bin file 'cc.ko.300.bin'.
But, When I test bin_to_vec.py, I got
Traceback (most recent call last):
File "bin_to_vec.py", line 30, in
words = f.get_words()
File "/usr/local/lib/python3.5/dist-packages/fastText/FastText.py", line 170, in get_words
pair = self.f.getVocab()
'UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte'
My docker container charset is C.UTF-8.
The text was updated successfully, but these errors were encountered: