Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue318 handle missing or invalid path input for commands #322

Merged

Conversation

juhoinkinen
Copy link
Member

@juhoinkinen juhoinkinen commented Aug 27, 2019

Closes #318.

Adding exists=True argument to type=click.Path() enables Click to check the existence of paths, for example:

$ annif train tfidf-en nonexistent.tsv
Usage: annif train [OPTIONS] PROJECT_ID [PATHS]...
Try "annif train --help" for help.

Error: Invalid value for "[PATHS]...": Path "nonexistent.tsv" does not exist.

There are now tests for raising these errors for every command, but to me they actually seem a bit too much. After all, there is no actual Annif code that these tests check. To not bloat the test code base, should the test be removed or reduced in number?

The case of missing training file now maps to /dev/null, but as noted, it doesn't work in Windows. This behaviour would still need

  • use cross-platform null device file checks on the running OS (easy actually),

  • and implementing different behaviour for vw-multi compared to other backends (seems a bit complicated)

Could this behaviour on missing training file be dropped for now, as the use case seems quite rare, and "a workaround" (if one doesn't want or can't use /dev/null) for this is to create an actual, existing empty training file and train on that?

@juhoinkinen juhoinkinen added this to the Short term milestone Aug 27, 2019
@codecov
Copy link

codecov bot commented Aug 27, 2019

Codecov Report

Merging #322 into master will increase coverage by 0.08%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #322      +/-   ##
==========================================
+ Coverage   99.49%   99.57%   +0.08%     
==========================================
  Files          56       56              
  Lines        2944     3041      +97     
==========================================
+ Hits         2929     3028      +99     
+ Misses         15       13       -2
Impacted Files Coverage Δ
annif/project.py 100% <100%> (ø) ⬆️
annif/cli.py 99.49% <100%> (+0.01%) ⬆️
annif/backend/fasttext.py 98.66% <100%> (+0.03%) ⬆️
tests/test_cli.py 100% <100%> (ø) ⬆️
tests/test_backend_vw_multi.py 100% <100%> (ø) ⬆️
tests/test_corpus.py 100% <100%> (ø) ⬆️
annif/corpus/types.py 100% <100%> (ø) ⬆️
annif/backend/pav.py 98.57% <100%> (+0.04%) ⬆️
tests/test_project.py 100% <100%> (ø) ⬆️
tests/test_backend_fasttext.py 100% <100%> (ø) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ac8431...af8c54d. Read the comment docs.

@osma
Copy link
Member

osma commented Sep 2, 2019

To not bloat the test code base, should the test be removed or reduced in number?

I don't have a strong opinion, but in general, more tests is always better and while the new tests are a bit redundant, they do clarify the intended behaviour. So I'd rather leave them in.

I tested the behavior of annif train without a file argument. With vw_multi the result is good: an empty model. With tfidf I get an error:

  File "/home/oisuomin/git/Annif/annif/project.py", line 199, in _create_vectorizer
    self._vectorizer.fit((subj.text for subj in subjectcorpus.subjects))
  File "/home/oisuomin/.local/share/virtualenvs/Annif-OYFUWV2R/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1631, in fit
    X = super().fit_transform(raw_documents)
  File "/home/oisuomin/.local/share/virtualenvs/Annif-OYFUWV2R/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1058, in fit_transform
    self.fixed_vocabulary_)
  File "/home/oisuomin/.local/share/virtualenvs/Annif-OYFUWV2R/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 989, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

It's similar with fasttext:

  File "/home/oisuomin/git/Annif/annif/backend/fasttext.py", line 103, in _create_model
    self._model = fastText.train_supervised(trainpath, **params)
  File "/home/oisuomin/.local/share/virtualenvs/Annif-OYFUWV2R/lib/python3.5/site-packages/fastText/FastText.py", line 343, in train_supervised
    fasttext.train(ft.f, a)
ValueError: Empty vocabulary. Try a smaller -minCount value.

Both kind of errors can also happen if the training data is too small, for example if it consists entirely of stop words (as noted in the tfidf error message), so they are not really specific to the empty file case.

Test coverage is a bit lacking: there is no test for the "Creating empty model" case. Other than that, this seems good for merging!

@juhoinkinen juhoinkinen modified the milestones: Short term, 0.43 Sep 3, 2019
@juhoinkinen juhoinkinen marked this pull request as ready for review September 18, 2019 09:58
@juhoinkinen
Copy link
Member Author

Uups, wait, not ready yet! Forgot to check the behaviour for PAV backend.

@juhoinkinen
Copy link
Member Author

Now added the check for empty documents/corpus also for PAV. The current behaviour for different backends is below.

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif train tfidf-en
warning: Reading empty file
Error: Not supported: using TfidfVectorizer with no documents

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif train fasttext-en
warning: Reading empty file
Error: Not supported: training backend fasttext with no documents

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif train pav-en
warning: Reading empty file
Error: Not supported: training backend pav with no documents

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif train vw-multi-en
warning: Reading empty file
Backend vw_multi: creating VW model
Backend vw_multi: creating VW train file
Backend vw_multi: creating VW model (algorithm: oaa)
Num weight bits = 1
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = data/projects/vw-multi-en/vw-train.txt
num sources = 1

finished run
number of examples = 0
weighted example sum = 0.000000
weighted label sum = 0.000000
average loss = n.a.
total feature number = 0

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif learn vw-multi-en
warning: Reading empty file

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif train vw-ensemble-en
warning: Reading empty file
Backend vw_ensemble: creating VW model
Backend vw_ensemble: creating VW train file
Num weight bits = 20
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = data/projects/vw-ensemble-en/vw-train.txt
num sources = 1

finished run
number of examples = 0
weighted example sum = 0.000000
weighted label sum = 0.000000
average loss = n.a.
total feature number = 0

(Annif) jmminkin@lx8-9811-008:/home/local/jmminkin/git/Annif$ annif learn vw-ensemble-en
warning: Reading empty file

@osma
Copy link
Member

osma commented Sep 20, 2019

Looks good! A couple of minor issues:

  1. I'd rename the method are_documents_empty to simply is_empty
  2. Scrutinizer complains that The variable docs does not seem to be defined for all execution paths. for the open_documents function in cli.py, apparently because docs is only set within an if block and Scrutinizer isn't smart enough to figure out that len(x) can never be negative... Maybe use an else block to make it clear that all cases are actually covered.

+1 for merging.

@juhoinkinen juhoinkinen merged commit 81994c4 into master Sep 23, 2019
@juhoinkinen juhoinkinen deleted the issue318-handle-missing-or-invalid-path-input-for-commands branch September 23, 2019 14:22
juhoinkinen added a commit that referenced this pull request Sep 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle missing or invalid path input for commands
2 participants