-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding UTF-8 error on Windows #308
Comments
Can you please show the file you've produced that gave you this error? The error appears to be ealy on in the file (position 11) so just the first few lines should be enough. Also show the command you executed and the full traceback. 0x88 is typically used in UTF-8 escape sequences. |
`(annif-latest) D:\Programs\annif-latest>annif train tfidf-nl Annif-corpora\training\train-utf8.zip ` |
Ah, now I understand. You cannot train Annif directly from a zip file. (Please open an issue if you think such a feature would be useful!) You should instead give Annif the path to a directory with these .txt and .tsv files, like this: annif train tfidf-nl Annif-corpora\training\train-utf8\ |
That seems to work, but I get another error (on Windows, but the same on Ubuntu):
Of course they don't contain only stopwords. I did not define any stopword list. Here is the traceback:
|
Some possible reasons:
I think number 3 is the most likely explanation here. I've edited the wiki section on the Extended subject file format to emphasize that you need the brackets. (It might make sense for Annif TSV parsing to be more flexible here and accept also plain URIs without brackets - please open an issue if you feel so). PS. The annif-users forum might be a better place for sorting out problems like this, when it's not clear that there's a bug in the software - though I understand that it can sometimes be hard to tell. |
Thanks for your patience. Yes, number three it must be. Tend to forget that when my programme produces the output. I did not know about the users forum. Will use it. |
I get this error when running my training file.
UnicodeDecodeError: 'CP_UTF8' codec can't decode byte 0x88 in position 11: No mapping for the Unicode character exists in the target code page.
I have produced a set of txt files en tsv files, both encoded with utf-8, like this:
with open (file,'w+',encoding="utf-8")
. I have produced one archive file from both the txt and tsv files in zip or 7z format.I also tried it with no explicit encoding and with utf-16 encoding, running into the same encoding problems.
I am using Windows 10 and python.
The text was updated successfully, but these errors were encountered: