Encoding UTF-8 error on Windows #308

hekl · 2019-08-08T10:38:05Z

I get this error when running my training file.

UnicodeDecodeError: 'CP_UTF8' codec can't decode byte 0x88 in position 11: No mapping for the Unicode character exists in the target code page.

I have produced a set of txt files en tsv files, both encoded with utf-8, like this:
with open (file,'w+',encoding="utf-8"). I have produced one archive file from both the txt and tsv files in zip or 7z format.
I also tried it with no explicit encoding and with utf-16 encoding, running into the same encoding problems.
I am using Windows 10 and python.

The text was updated successfully, but these errors were encountered:

osma · 2019-08-08T10:48:24Z

Can you please show the file you've produced that gave you this error? The error appears to be ealy on in the file (position 11) so just the first few lines should be enough.

Also show the command you executed and the full traceback.

0x88 is typically used in UTF-8 escape sequences.

hekl · 2019-08-08T11:14:57Z

`(annif-latest) D:\Programs\annif-latest>annif train tfidf-nl Annif-corpora\training\train-utf8.zip
creating vectorizer
Traceback (most recent call last):
File "C:\Program Files\Python37\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Program Files\Python37\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "D:\Programs\annif-latest\Scripts\annif.exe_main.py", line 9, in
File "d:\programs\annif-latest\lib\site-packages\click\core.py", line 764, in call
return self.main(*args, **kwargs)
File "d:\programs\annif-latest\lib\site-packages\flask\cli.py", line 586, in main
return super(FlaskGroup, self).main(*args, **kwargs)
File "d:\programs\annif-latest\lib\site-packages\click\core.py", line 717, in main
rv = self.invoke(ctx)
File "d:\programs\annif-latest\lib\site-packages\click\core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "d:\programs\annif-latest\lib\site-packages\click\core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "d:\programs\annif-latest\lib\site-packages\click\core.py", line 555, in invoke
return callback(*args, **kwargs)
File "d:\programs\annif-latest\lib\site-packages\click\decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "d:\programs\annif-latest\lib\site-packages\flask\cli.py", line 426, in decorator
return __ctx.invoke(f, *args, **kwargs)
File "d:\programs\annif-latest\lib\site-packages\click\core.py", line 555, in invoke
return callback(*args, **kwargs)
File "d:\programs\annif-latest\lib\site-packages\annif\cli.py", line 154, in run_train
proj.train(documents)
File "d:\programs\annif-latest\lib\site-packages\annif\project.py", line 197, in train
self._create_vectorizer(corpus)
File "d:\programs\annif-latest\lib\site-packages\annif\project.py", line 186, in _create_vectorizer
self._vectorizer.fit((subj.text for subj in subjectcorpus.subjects))
File "d:\programs\annif-latest\lib\site-packages\annif\corpus\convert.py", line 19, in subjects
self._generate_corpus_from_documents()
File "d:\programs\annif-latest\lib\site-packages\annif\corpus\convert.py", line 49, in _generate_corpus_from_documents
for doc in self.documents:
File "d:\programs\annif-latest\lib\site-packages\annif\corpus\document.py", line 65, in documents
for line in tsvfile:
File "C:\Program Files\Python37\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'CP_UTF8' codec can't decode byte 0x88 in position 11: No mapping for the Unicode character exists in the target code page.
train-utf8.zip

`

osma · 2019-08-08T14:24:56Z

Ah, now I understand. You cannot train Annif directly from a zip file. (Please open an issue if you think such a feature would be useful!)

You should instead give Annif the path to a directory with these .txt and .tsv files, like this:

annif train tfidf-nl Annif-corpora\training\train-utf8\

hekl · 2019-08-09T14:49:09Z

That seems to work, but I get another error (on Windows, but the same on Ubuntu):

ValueError: empty vocabulary; perhaps the documents only contain stop words

Of course they don't contain only stopwords. I did not define any stopword list.

Here is the traceback:

(annif-latest) user@computer:~/Programs/annif-latest$ bin/annif train tfidf-nl data/Annif-corpora/training/trainingset/
creating vectorizer
Traceback (most recent call last):
File "bin/annif", line 11, in
sys.exit(cli())
File "/home/Programs/annif-latest/lib/python3.6/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/flask/cli.py", line 586, in main
return super(FlaskGroup, self).main(*args, **kwargs)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/Programs/annif-latest/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/flask/cli.py", line 426, in decorator
return __ctx.invoke(f, *args, **kwargs)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/annif/cli.py", line 154, in run_train
proj.train(documents)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/annif/project.py", line 197, in train
self._create_vectorizer(corpus)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/annif/project.py", line 186, in _create_vectorizer
self.vectorizer.fit((subj.text for subj in subjectcorpus.subjects))
File "/home/Programs/annif-latest/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 1631, in fit
X = super().fit_transform(raw_documents)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 1058, in fit_transform
self.fixed_vocabulary)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 989, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

osma · 2019-08-12T06:28:03Z

Some possible reasons:

you are using .key files with subject labels only (Simple subject file format), and hit issue Training from .key files with only subject labels fails #309, which was fixed very recently on the master branch. Though this seems unlikely, since you had .tsv files with URIs in your zip archive.
you didn't load the vocabulary first with the loadvoc command
the URIs in your .tsv files need to be specified with brackets, i.e. <http://taxonomie.cbs.nl/vocab/?tema=4812> instead of just http://taxonomie.cbs.nl/vocab/?tema=4812

I think number 3 is the most likely explanation here. I've edited the wiki section on the Extended subject file format to emphasize that you need the brackets. (It might make sense for Annif TSV parsing to be more flexible here and accept also plain URIs without brackets - please open an issue if you feel so).

PS. The annif-users forum might be a better place for sorting out problems like this, when it's not clear that there's a bug in the software - though I understand that it can sometimes be hard to tell.

hekl · 2019-08-13T12:58:28Z

Thanks for your patience. Yes, number three it must be. Tend to forget that when my programme produces the output. I did not know about the users forum. Will use it.

osma added the question label Aug 8, 2019

hekl closed this as completed Aug 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding UTF-8 error on Windows #308

Encoding UTF-8 error on Windows #308

hekl commented Aug 8, 2019

osma commented Aug 8, 2019

hekl commented Aug 8, 2019

osma commented Aug 8, 2019

hekl commented Aug 9, 2019

osma commented Aug 12, 2019

hekl commented Aug 13, 2019

Encoding UTF-8 error on Windows #308

Encoding UTF-8 error on Windows #308

Comments

hekl commented Aug 8, 2019

osma commented Aug 8, 2019

hekl commented Aug 8, 2019

osma commented Aug 8, 2019

hekl commented Aug 9, 2019

osma commented Aug 12, 2019

hekl commented Aug 13, 2019