Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--cached option to reuse preprocessed training data #342

Closed
osma opened this issue Oct 28, 2019 · 0 comments · Fixed by #376
Closed

--cached option to reuse preprocessed training data #342

osma opened this issue Oct 28, 2019 · 0 comments · Fixed by #376
Milestone

Comments

@osma
Copy link
Member

osma commented Oct 28, 2019

A large proportion of training time is typically spent on preprocessing the training data into a format suitable for the backend algorithm. This is particularly evident for trainable ensemble backends (pav, vw_ensemble, nn_ensemble) which have to pass all the training documents to source projects.

We could support a --cached option that reuses the already preprocessed training data from the previous train run. This would make it a lot faster to experiment with different hyperparameters. For example:

annif train nn-ensemble-fi my-corpus/train/  # initial training run
annif eval nn-ensemble-fi my-corpus/test/  # evaluate on test documents
# not happy with the result, let's adjust the hyperparameters
$EDITOR projects.cfg
annif train --cached nn-ensemble-fi  # retrain using previous train data (note: no corpus given)
annif eval nn-ensemble-fi my-corpus/test/  # reevaluate

Implementing this requires changes to some backends:

  • fasttext, vw_multi and vw_ensemble already create training files which are stored in the project data directory; the option would simply skip creating a new one before retraining
  • pav and nn_ensemble only collect the training data into NumPy arrays held in memory; they would need to store the arrays in the data directory (e.g. as an LMDB database) so they can be reused later
  • tfidf also collects training data only in memory, but since it has no parameters to tune, there is probably no need for a --cached option (it could just give an error instead)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant