Issue318 handle missing or invalid path input for commands #322

juhoinkinen · 2019-08-27T06:59:19Z

Closes #318.

Adding exists=True argument to type=click.Path() enables Click to check the existence of paths, for example:

$ annif train tfidf-en nonexistent.tsv
Usage: annif train [OPTIONS] PROJECT_ID [PATHS]...
Try "annif train --help" for help.

Error: Invalid value for "[PATHS]...": Path "nonexistent.tsv" does not exist.

There are now tests for raising these errors for every command, but to me they actually seem a bit too much. After all, there is no actual Annif code that these tests check. To not bloat the test code base, should the test be removed or reduced in number?

The case of missing training file now maps to /dev/null, but as noted, it doesn't work in Windows. This behaviour would still need

use cross-platform null device file ~~checks on the running OS (easy actually)~~,
and implementing different behaviour for vw-multi compared to other backends (seems a bit complicated)

Could this behaviour on missing training file be dropped for now, as the use case seems quite rare, and "a workaround" (if one doesn't want or can't use /dev/null) for this is to create an actual, existing empty training file and train on that?

…pty model

codecov · 2019-08-27T07:21:11Z

Codecov Report

Merging #322 into master will increase coverage by 0.08%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #322      +/-   ##
==========================================
+ Coverage   99.49%   99.57%   +0.08%     
==========================================
  Files          56       56              
  Lines        2944     3041      +97     
==========================================
+ Hits         2929     3028      +99     
+ Misses         15       13       -2

Impacted Files	Coverage Δ
annif/project.py	`100% <100%> (ø)`	⬆️
annif/cli.py	`99.49% <100%> (+0.01%)`	⬆️
annif/backend/fasttext.py	`98.66% <100%> (+0.03%)`	⬆️
tests/test_cli.py	`100% <100%> (ø)`	⬆️
tests/test_backend_vw_multi.py	`100% <100%> (ø)`	⬆️
tests/test_corpus.py	`100% <100%> (ø)`	⬆️
annif/corpus/types.py	`100% <100%> (ø)`	⬆️
annif/backend/pav.py	`98.57% <100%> (+0.04%)`	⬆️
tests/test_project.py	`100% <100%> (ø)`	⬆️
tests/test_backend_fasttext.py	`100% <100%> (ø)`	⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ac8431...af8c54d. Read the comment docs.

osma · 2019-09-02T13:49:40Z

To not bloat the test code base, should the test be removed or reduced in number?

I don't have a strong opinion, but in general, more tests is always better and while the new tests are a bit redundant, they do clarify the intended behaviour. So I'd rather leave them in.

I tested the behavior of annif train without a file argument. With vw_multi the result is good: an empty model. With tfidf I get an error:

  File "/home/oisuomin/git/Annif/annif/project.py", line 199, in _create_vectorizer
    self._vectorizer.fit((subj.text for subj in subjectcorpus.subjects))
  File "/home/oisuomin/.local/share/virtualenvs/Annif-OYFUWV2R/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1631, in fit
    X = super().fit_transform(raw_documents)
  File "/home/oisuomin/.local/share/virtualenvs/Annif-OYFUWV2R/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1058, in fit_transform
    self.fixed_vocabulary_)
  File "/home/oisuomin/.local/share/virtualenvs/Annif-OYFUWV2R/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 989, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

It's similar with fasttext:

  File "/home/oisuomin/git/Annif/annif/backend/fasttext.py", line 103, in _create_model
    self._model = fastText.train_supervised(trainpath, **params)
  File "/home/oisuomin/.local/share/virtualenvs/Annif-OYFUWV2R/lib/python3.5/site-packages/fastText/FastText.py", line 343, in train_supervised
    fasttext.train(ft.f, a)
ValueError: Empty vocabulary. Try a smaller -minCount value.

Both kind of errors can also happen if the training data is too small, for example if it consists entirely of stop words (as noted in the tfidf error message), so they are not really specific to the empty file case.

Test coverage is a bit lacking: there is no test for the "Creating empty model" case. Other than that, this seems good for merging!

…put-for-commands

…dExcs

juhoinkinen · 2019-09-18T10:04:16Z

Uups, wait, not ready yet! Forgot to check the behaviour for PAV backend.

juhoinkinen · 2019-09-18T10:30:28Z

Now added the check for empty documents/corpus also for PAV. The current behaviour for different backends is below.

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif train tfidf-en
warning: Reading empty file
Error: Not supported: using TfidfVectorizer with no documents

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif train fasttext-en
warning: Reading empty file
Error: Not supported: training backend fasttext with no documents

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif train pav-en
warning: Reading empty file
Error: Not supported: training backend pav with no documents

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif train vw-multi-en
warning: Reading empty file
Backend vw_multi: creating VW model
Backend vw_multi: creating VW train file
Backend vw_multi: creating VW model (algorithm: oaa)
Num weight bits = 1
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = data/projects/vw-multi-en/vw-train.txt
num sources = 1

finished run
number of examples = 0
weighted example sum = 0.000000
weighted label sum = 0.000000
average loss = n.a.
total feature number = 0

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif learn vw-multi-en
warning: Reading empty file

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif train vw-ensemble-en
warning: Reading empty file
Backend vw_ensemble: creating VW model
Backend vw_ensemble: creating VW train file
Num weight bits = 20
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = data/projects/vw-ensemble-en/vw-train.txt
num sources = 1

finished run
number of examples = 0
weighted example sum = 0.000000
weighted label sum = 0.000000
average loss = n.a.
total feature number = 0

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif learn vw-ensemble-en
warning: Reading empty file

osma · 2019-09-20T13:10:40Z

Looks good! A couple of minor issues:

I'd rename the method are_documents_empty to simply is_empty
Scrutinizer complains that The variable docs does not seem to be defined for all execution paths. for the open_documents function in cli.py, apparently because docs is only set within an if block and Scrutinizer isn't smart enough to figure out that len(x) can never be negative... Maybe use an else block to make it clear that all cases are actually covered.

+1 for merging.

juhoinkinen added 4 commits August 26, 2019 14:04

Click existence checks for path arguments

c7e56ab

Tests for Click existence checks for path arguments

67840a9

Set type; corrects output from --help flag for the option: TEXT->PATH

0c61e33

Use /dev/null in case of missing training file; allows training an em…

a78fafc

…pty model

juhoinkinen added the enhancement label Aug 27, 2019

juhoinkinen added this to the Short term milestone Aug 27, 2019

Click existence check for projects.cfg path option (--projects)

f45683a

Use cross-platform null device file

7a3c53a

juhoinkinen modified the milestones: Short term, 0.43 Sep 3, 2019

juhoinkinen added 7 commits September 4, 2019 10:43

Merge branch 'master' into issue318-handle-missing-or-invalid-path-in…

7e56b6e

…put-for-commands

Emptiness check for documents generator raising necessary NotSupporte…

8939d6b

…dExcs

Test for emptiness check for documents generator

c2b3c28

Skip tests that use vw and fasttext if those packages are not available

f3fdf88

More general warning: applies also to learn method

a7f367e

Refactor and add tests

46cb529

Forgot to remove the VW test from CLI tests

dc27476

juhoinkinen marked this pull request as ready for review September 18, 2019 09:58

juhoinkinen added 2 commits September 18, 2019 13:21

Raise NotSupportedException on training PAV without documents

237532e

Test raise NotSupportedException on training PAV without documents

6887fbd

osma mentioned this pull request Sep 20, 2019

Issue250 support backend param option in train and learn commands #289

Merged

juhoinkinen added 2 commits September 23, 2019 16:28

Rename corpus emptiness check

08463a1

Avoid Scrutinizer issue about undef var for execution paths

af8c54d

juhoinkinen merged commit 81994c4 into master Sep 23, 2019

juhoinkinen deleted the issue318-handle-missing-or-invalid-path-input-for-commands branch September 23, 2019 14:22

juhoinkinen added a commit that referenced this pull request Sep 26, 2019

Adapt to PR #322

91136a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue318 handle missing or invalid path input for commands #322

Issue318 handle missing or invalid path input for commands #322

juhoinkinen commented Aug 27, 2019 •

edited

Loading

codecov bot commented Aug 27, 2019 •

edited

Loading

osma commented Sep 2, 2019

juhoinkinen commented Sep 18, 2019

juhoinkinen commented Sep 18, 2019

osma commented Sep 20, 2019

Issue318 handle missing or invalid path input for commands #322

Issue318 handle missing or invalid path input for commands #322

Conversation

juhoinkinen commented Aug 27, 2019 • edited Loading

codecov bot commented Aug 27, 2019 • edited Loading

Codecov Report

osma commented Sep 2, 2019

juhoinkinen commented Sep 18, 2019

juhoinkinen commented Sep 18, 2019

osma commented Sep 20, 2019

juhoinkinen commented Aug 27, 2019 •

edited

Loading

codecov bot commented Aug 27, 2019 •

edited

Loading