Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval crash on TFIDF with multiple training files #332

Closed
juhoinkinen opened this issue Sep 24, 2019 · 2 comments · Fixed by #333
Closed

Eval crash on TFIDF with multiple training files #332

juhoinkinen opened this issue Sep 24, 2019 · 2 comments · Fixed by #333
Assignees
Labels
Milestone

Comments

@juhoinkinen
Copy link
Member

When evaluating a TFIDF project trained on multiple files (CombinedCorpus) the eval crashes:

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif train tfidf-fi yso-cicero-finna-fi-head-500-lines.tsv yso-cicero-finna-fi-tail-500-lines.tsv
creating vectorizer
warning: Unknown subject URI <http://www.yso.fi/onto/yso/p14645>
...
Backend tfidf: creating similarity index
(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif eval tfidf-fi ~/annif-projects/Annif-corpora/fulltext/kirjastonhoitaja/test/
warning: Unknown subject URI <http://www.yso.fi/onto/yso/p1997>
...
Traceback (most recent call last):
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/bin/annif", line 11, in <module>
    load_entry_point('annif', 'console_scripts', 'annif')()
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/flask/cli.py", line 586, in main
    return super(FlaskGroup, self).main(*args, **kwargs)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/flask/cli.py", line 426, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/local/jmminkin/git/Annif/annif/cli.py", line 276, in run_eval
    for metric, score in eval_batch.results().items():
  File "/home/local/jmminkin/git/Annif/annif/eval.py", line 143, in results
    y_true, y_pred, metrics)
  File "/home/local/jmminkin/git/Annif/annif/eval.py", line 93, in _evaluate_samples
    y_true, y_pred_binary, average='samples')
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 1569, in precision_score
    sample_weight=sample_weight)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 1415, in precision_recall_fscore_support
    pos_label)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 1240, in _check_set_wise_labels
    present_labels = unique_labels(y_true, y_pred)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/sklearn/utils/multiclass.py", line 88, in unique_labels
    raise ValueError("Multi-label binary indicator input with "
ValueError: Multi-label binary indicator input with different numbers of labels

Also suggest does not seem to work with such a project (although this could be unrelated):

$ echo testi tekstia tassa nain | annif suggest tfidf-fi
(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ 
(no results)

If a fasttext project is trained on multiple files like above, eval works and suggest produces results.

The eval crash might be due to that TFIDF backend saves the _index that is created in training (when its size is multiplied by the number of the input files), and then for predictions the same _index is loaded and used (but this is not the case for fasttext project). However, eval simply uses the project's vocabulary](https://github.com/NatLibFi/Annif/blob/master/annif/cli.py#L266) which does not know about the multiplied size of the _index, leading to the size mismatch mentioned in the traceback.

@osma
Copy link
Member

osma commented Sep 24, 2019

Confirmed. As it happens, I just trained a tfidf project using the yso-cicero-finna-fi-* training data (all four of them - it took a while!) and I get the same error running eval and no results when using suggest.

I'm a bit surprised if CombinedCorpus turns out to be a problem here, because the backend should not be able to tell that the corpus is a combination of several files.

@osma
Copy link
Member

osma commented Sep 24, 2019

CombinedCorpus.subjects was behaving incorrectly. It concatenates the subjects from the constituent corpora, when it should be merging them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants