Eval crash on TFIDF with multiple training files #332

juhoinkinen · 2019-09-24T11:40:39Z

When evaluating a TFIDF project trained on multiple files (CombinedCorpus) the eval crashes:

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif train tfidf-fi yso-cicero-finna-fi-head-500-lines.tsv yso-cicero-finna-fi-tail-500-lines.tsv
creating vectorizer
warning: Unknown subject URI <http://www.yso.fi/onto/yso/p14645>
...
Backend tfidf: creating similarity index
(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif eval tfidf-fi ~/annif-projects/Annif-corpora/fulltext/kirjastonhoitaja/test/
warning: Unknown subject URI <http://www.yso.fi/onto/yso/p1997>
...
Traceback (most recent call last):
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/bin/annif", line 11, in <module>
    load_entry_point('annif', 'console_scripts', 'annif')()
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/flask/cli.py", line 586, in main
    return super(FlaskGroup, self).main(*args, **kwargs)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/flask/cli.py", line 426, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/local/jmminkin/git/Annif/annif/cli.py", line 276, in run_eval
    for metric, score in eval_batch.results().items():
  File "/home/local/jmminkin/git/Annif/annif/eval.py", line 143, in results
    y_true, y_pred, metrics)
  File "/home/local/jmminkin/git/Annif/annif/eval.py", line 93, in _evaluate_samples
    y_true, y_pred_binary, average='samples')
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 1569, in precision_score
    sample_weight=sample_weight)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 1415, in precision_recall_fscore_support
    pos_label)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 1240, in _check_set_wise_labels
    present_labels = unique_labels(y_true, y_pred)
  File "/home/jmminkin/.local/share/virtualenvs/Annif-b5vsMxU8/lib/python3.6/site-packages/sklearn/utils/multiclass.py", line 88, in unique_labels
    raise ValueError("Multi-label binary indicator input with "
ValueError: Multi-label binary indicator input with different numbers of labels

Also suggest does not seem to work with such a project (although this could be unrelated):

$ echo testi tekstia tassa nain | annif suggest tfidf-fi
(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ 
(no results)

If a fasttext project is trained on multiple files like above, eval works and suggest produces results.

The eval crash might be due to that TFIDF backend saves the _index that is created in training (when its size is multiplied by the number of the input files), and then for predictions the same _index is loaded and used (but this is not the case for fasttext project). However, eval simply uses the project's vocabulary](https://github.com/NatLibFi/Annif/blob/master/annif/cli.py#L266) which does not know about the multiplied size of the _index, leading to the size mismatch mentioned in the traceback.

The text was updated successfully, but these errors were encountered:

osma · 2019-09-24T11:56:24Z

Confirmed. As it happens, I just trained a tfidf project using the yso-cicero-finna-fi-* training data (all four of them - it took a while!) and I get the same error running eval and no results when using suggest.

I'm a bit surprised if CombinedCorpus turns out to be a problem here, because the backend should not be able to tell that the corpus is a combination of several files.

osma · 2019-09-24T13:27:37Z

CombinedCorpus.subjects was behaving incorrectly. It concatenates the subjects from the constituent corpora, when it should be merging them.

…332

juhoinkinen added the bug label Sep 24, 2019

osma added this to the 0.43 milestone Sep 24, 2019

osma self-assigned this Sep 24, 2019

osma added a commit that referenced this issue Sep 24, 2019

Merge subjects in CombinedCorpus instead of concatenating them. Fixes #…

e007c71

…332

osma mentioned this issue Sep 24, 2019

Merge subjects in CombinedCorpus instead of concatenating them #333

Merged

osma closed this as completed in #333 Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval crash on TFIDF with multiple training files #332

Eval crash on TFIDF with multiple training files #332

juhoinkinen commented Sep 24, 2019

osma commented Sep 24, 2019

osma commented Sep 24, 2019

Eval crash on TFIDF with multiple training files #332

Eval crash on TFIDF with multiple training files #332

Comments

juhoinkinen commented Sep 24, 2019

osma commented Sep 24, 2019

osma commented Sep 24, 2019