--cached option to reuse preprocessed training data #342

osma · 2019-10-28T14:55:19Z

A large proportion of training time is typically spent on preprocessing the training data into a format suitable for the backend algorithm. This is particularly evident for trainable ensemble backends (pav, vw_ensemble, nn_ensemble) which have to pass all the training documents to source projects.

We could support a --cached option that reuses the already preprocessed training data from the previous train run. This would make it a lot faster to experiment with different hyperparameters. For example:

annif train nn-ensemble-fi my-corpus/train/  # initial training run
annif eval nn-ensemble-fi my-corpus/test/  # evaluate on test documents
# not happy with the result, let's adjust the hyperparameters
$EDITOR projects.cfg
annif train --cached nn-ensemble-fi  # retrain using previous train data (note: no corpus given)
annif eval nn-ensemble-fi my-corpus/test/  # reevaluate

Implementing this requires changes to some backends:

fasttext, vw_multi and vw_ensemble already create training files which are stored in the project data directory; the option would simply skip creating a new one before retraining
pav and nn_ensemble only collect the training data into NumPy arrays held in memory; they would need to store the arrays in the data directory (e.g. as an LMDB database) so they can be reused later
tfidf also collects training data only in memory, but since it has no parameters to tune, there is probably no need for a --cached option (it could just give an error instead)

The text was updated successfully, but these errors were encountered:

…vw_multi backends (#342)

osma added the enhancement label Oct 28, 2019

osma added this to the Short term milestone Oct 28, 2019

osma mentioned this issue Nov 29, 2019

Use LMDB to store vectors in nn_ensemble #363

Closed

osma modified the milestones: Short term, 0.46 Jan 17, 2020

osma added a commit that referenced this issue Jan 24, 2020

Implement --cached option for train command in fasttext, omikuji and …

8458f0d

…vw_multi backends (#342)

This was referenced Jan 24, 2020

Implement --cached option for train command #376

Merged

Use LMDB to store vectors in PAV backend #378

Open

osma closed this as completed in #376 Jan 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--cached option to reuse preprocessed training data #342

--cached option to reuse preprocessed training data #342

osma commented Oct 28, 2019

--cached option to reuse preprocessed training data #342

--cached option to reuse preprocessed training data #342

Comments

osma commented Oct 28, 2019