Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding UTF-8 error on Windows #308

Closed
hekl opened this issue Aug 8, 2019 · 6 comments
Closed

Encoding UTF-8 error on Windows #308

hekl opened this issue Aug 8, 2019 · 6 comments
Labels

Comments

@hekl
Copy link

hekl commented Aug 8, 2019

I get this error when running my training file.

UnicodeDecodeError: 'CP_UTF8' codec can't decode byte 0x88 in position 11: No mapping for the Unicode character exists in the target code page.

I have produced a set of txt files en tsv files, both encoded with utf-8, like this:
with open (file,'w+',encoding="utf-8"). I have produced one archive file from both the txt and tsv files in zip or 7z format.
I also tried it with no explicit encoding and with utf-16 encoding, running into the same encoding problems.
I am using Windows 10 and python.

@osma osma added the question label Aug 8, 2019
@osma
Copy link
Member

osma commented Aug 8, 2019

Can you please show the file you've produced that gave you this error? The error appears to be ealy on in the file (position 11) so just the first few lines should be enough.

Also show the command you executed and the full traceback.

0x88 is typically used in UTF-8 escape sequences.

@hekl
Copy link
Author

hekl commented Aug 8, 2019

`(annif-latest) D:\Programs\annif-latest>annif train tfidf-nl Annif-corpora\training\train-utf8.zip
creating vectorizer
Traceback (most recent call last):
File "C:\Program Files\Python37\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Program Files\Python37\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "D:\Programs\annif-latest\Scripts\annif.exe_main
.py", line 9, in
File "d:\programs\annif-latest\lib\site-packages\click\core.py", line 764, in call
return self.main(*args, **kwargs)
File "d:\programs\annif-latest\lib\site-packages\flask\cli.py", line 586, in main
return super(FlaskGroup, self).main(*args, **kwargs)
File "d:\programs\annif-latest\lib\site-packages\click\core.py", line 717, in main
rv = self.invoke(ctx)
File "d:\programs\annif-latest\lib\site-packages\click\core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "d:\programs\annif-latest\lib\site-packages\click\core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "d:\programs\annif-latest\lib\site-packages\click\core.py", line 555, in invoke
return callback(*args, **kwargs)
File "d:\programs\annif-latest\lib\site-packages\click\decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "d:\programs\annif-latest\lib\site-packages\flask\cli.py", line 426, in decorator
return __ctx.invoke(f, *args, **kwargs)
File "d:\programs\annif-latest\lib\site-packages\click\core.py", line 555, in invoke
return callback(*args, **kwargs)
File "d:\programs\annif-latest\lib\site-packages\annif\cli.py", line 154, in run_train
proj.train(documents)
File "d:\programs\annif-latest\lib\site-packages\annif\project.py", line 197, in train
self._create_vectorizer(corpus)
File "d:\programs\annif-latest\lib\site-packages\annif\project.py", line 186, in _create_vectorizer
self._vectorizer.fit((subj.text for subj in subjectcorpus.subjects))
File "d:\programs\annif-latest\lib\site-packages\annif\corpus\convert.py", line 19, in subjects
self._generate_corpus_from_documents()
File "d:\programs\annif-latest\lib\site-packages\annif\corpus\convert.py", line 49, in _generate_corpus_from_documents
for doc in self.documents:
File "d:\programs\annif-latest\lib\site-packages\annif\corpus\document.py", line 65, in documents
for line in tsvfile:
File "C:\Program Files\Python37\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'CP_UTF8' codec can't decode byte 0x88 in position 11: No mapping for the Unicode character exists in the target code page.
train-utf8.zip

`

@osma
Copy link
Member

osma commented Aug 8, 2019

Ah, now I understand. You cannot train Annif directly from a zip file. (Please open an issue if you think such a feature would be useful!)

You should instead give Annif the path to a directory with these .txt and .tsv files, like this:

annif train tfidf-nl Annif-corpora\training\train-utf8\

@hekl
Copy link
Author

hekl commented Aug 9, 2019

That seems to work, but I get another error (on Windows, but the same on Ubuntu):

ValueError: empty vocabulary; perhaps the documents only contain stop words

Of course they don't contain only stopwords. I did not define any stopword list.

Here is the traceback:

(annif-latest) user@computer:~/Programs/annif-latest$ bin/annif train tfidf-nl data/Annif-corpora/training/trainingset/
creating vectorizer
Traceback (most recent call last):
File "bin/annif", line 11, in
sys.exit(cli())
File "/home/Programs/annif-latest/lib/python3.6/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/flask/cli.py", line 586, in main
return super(FlaskGroup, self).main(*args, **kwargs)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/Programs/annif-latest/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/flask/cli.py", line 426, in decorator
return __ctx.invoke(f, *args, **kwargs)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/annif/cli.py", line 154, in run_train
proj.train(documents)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/annif/project.py", line 197, in train
self._create_vectorizer(corpus)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/annif/project.py", line 186, in _create_vectorizer
self.vectorizer.fit((subj.text for subj in subjectcorpus.subjects))
File "/home/Programs/annif-latest/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 1631, in fit
X = super().fit_transform(raw_documents)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 1058, in fit_transform
self.fixed_vocabulary
)
File "/home/Programs/annif-latest/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 989, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

@osma
Copy link
Member

osma commented Aug 12, 2019

Some possible reasons:

  1. you are using .key files with subject labels only (Simple subject file format), and hit issue Training from .key files with only subject labels fails #309, which was fixed very recently on the master branch. Though this seems unlikely, since you had .tsv files with URIs in your zip archive.
  2. you didn't load the vocabulary first with the loadvoc command
  3. the URIs in your .tsv files need to be specified with brackets, i.e. <http://taxonomie.cbs.nl/vocab/?tema=4812> instead of just http://taxonomie.cbs.nl/vocab/?tema=4812

I think number 3 is the most likely explanation here. I've edited the wiki section on the Extended subject file format to emphasize that you need the brackets. (It might make sense for Annif TSV parsing to be more flexible here and accept also plain URIs without brackets - please open an issue if you feel so).

PS. The annif-users forum might be a better place for sorting out problems like this, when it's not clear that there's a bug in the software - though I understand that it can sometimes be hard to tell.

@hekl
Copy link
Author

hekl commented Aug 13, 2019

Thanks for your patience. Yes, number three it must be. Tend to forget that when my programme produces the output. I did not know about the users forum. Will use it.

@hekl hekl closed this as completed Aug 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants