Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LMDB to store vectors in nn_ensemble #363

Closed
osma opened this issue Nov 29, 2019 · 0 comments · Fixed by #381
Closed

Use LMDB to store vectors in nn_ensemble #363

osma opened this issue Nov 29, 2019 · 0 comments · Fixed by #381
Milestone

Comments

@osma
Copy link
Member

osma commented Nov 29, 2019

When the nn_ensemble backend is trained, it sends all documents through the source projects and aggregates their suggestion vectors in memory. This can take up a significant amount of RAM. For example, with three YSO based source projects, I could only train a NN ensemble with 8k documents on a machine with 16GB RAM - any larger training set leads to an out of memory situation.

If we instead streamed the vectors to a LMDB database, and then read them back from the LMDB in batches, the backend could scale to much larger training data sets. An additional benefit would be that the LMDB could be retained on disk, so that another training run could be made using the same documents but different hyperparameters (this would require implementing the --cached option - see #342), without having to process the documents again, so it would be much faster.

LMDB seems to be ideal for this as it is very fast and supports streaming style operations both for reading and writing. It will introduce an additional dependency, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant