You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A large proportion of training time is typically spent on preprocessing the training data into a format suitable for the backend algorithm. This is particularly evident for trainable ensemble backends (pav, vw_ensemble, nn_ensemble) which have to pass all the training documents to source projects.
We could support a --cached option that reuses the already preprocessed training data from the previous train run. This would make it a lot faster to experiment with different hyperparameters. For example:
annif train nn-ensemble-fi my-corpus/train/ # initial training run
annif eval nn-ensemble-fi my-corpus/test/ # evaluate on test documents
# not happy with the result, let's adjust the hyperparameters
$EDITOR projects.cfg
annif train --cached nn-ensemble-fi # retrain using previous train data (note: no corpus given)
annif eval nn-ensemble-fi my-corpus/test/ # reevaluate
Implementing this requires changes to some backends:
fasttext, vw_multi and vw_ensemble already create training files which are stored in the project data directory; the option would simply skip creating a new one before retraining
pav and nn_ensemble only collect the training data into NumPy arrays held in memory; they would need to store the arrays in the data directory (e.g. as an LMDB database) so they can be reused later
tfidf also collects training data only in memory, but since it has no parameters to tune, there is probably no need for a --cached option (it could just give an error instead)
The text was updated successfully, but these errors were encountered:
A large proportion of training time is typically spent on preprocessing the training data into a format suitable for the backend algorithm. This is particularly evident for trainable ensemble backends (
pav
,vw_ensemble
,nn_ensemble
) which have to pass all the training documents to source projects.We could support a
--cached
option that reuses the already preprocessed training data from the previous train run. This would make it a lot faster to experiment with different hyperparameters. For example:Implementing this requires changes to some backends:
--cached
option (it could just give an error instead)The text was updated successfully, but these errors were encountered: