You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the nn_ensemble backend is trained, it sends all documents through the source projects and aggregates their suggestion vectors in memory. This can take up a significant amount of RAM. For example, with three YSO based source projects, I could only train a NN ensemble with 8k documents on a machine with 16GB RAM - any larger training set leads to an out of memory situation.
If we instead streamed the vectors to a LMDB database, and then read them back from the LMDB in batches, the backend could scale to much larger training data sets. An additional benefit would be that the LMDB could be retained on disk, so that another training run could be made using the same documents but different hyperparameters (this would require implementing the --cached option - see #342), without having to process the documents again, so it would be much faster.
LMDB seems to be ideal for this as it is very fast and supports streaming style operations both for reading and writing. It will introduce an additional dependency, though.
The text was updated successfully, but these errors were encountered:
When the
nn_ensemble
backend is trained, it sends all documents through the source projects and aggregates their suggestion vectors in memory. This can take up a significant amount of RAM. For example, with three YSO based source projects, I could only train a NN ensemble with 8k documents on a machine with 16GB RAM - any larger training set leads to an out of memory situation.If we instead streamed the vectors to a LMDB database, and then read them back from the LMDB in batches, the backend could scale to much larger training data sets. An additional benefit would be that the LMDB could be retained on disk, so that another training run could be made using the same documents but different hyperparameters (this would require implementing the
--cached
option - see #342), without having to process the documents again, so it would be much faster.LMDB seems to be ideal for this as it is very fast and supports streaming style operations both for reading and writing. It will introduce an additional dependency, though.
The text was updated successfully, but these errors were encountered: