Use LMDB to store vectors in nn_ensemble #363

osma · 2019-11-29T13:35:59Z

When the nn_ensemble backend is trained, it sends all documents through the source projects and aggregates their suggestion vectors in memory. This can take up a significant amount of RAM. For example, with three YSO based source projects, I could only train a NN ensemble with 8k documents on a machine with 16GB RAM - any larger training set leads to an out of memory situation.

If we instead streamed the vectors to a LMDB database, and then read them back from the LMDB in batches, the backend could scale to much larger training data sets. An additional benefit would be that the LMDB could be retained on disk, so that another training run could be made using the same documents but different hyperparameters (this would require implementing the --cached option - see #342), without having to process the documents again, so it would be much faster.

LMDB seems to be ideal for this as it is very fast and supports streaming style operations both for reading and writing. It will introduce an additional dependency, though.

The text was updated successfully, but these errors were encountered:

…ating large vectors in RAM (#363)

osma added the enhancement label Nov 29, 2019

osma added this to the Short term milestone Nov 29, 2019

osma mentioned this issue Jan 27, 2020

Use LMDB to store vectors in PAV backend #378

Open

osma added a commit that referenced this issue Jan 28, 2020

Use LMDB and sparse vectors in nn_ensemble backend, instead of aggreg…

a845cbf

…ating large vectors in RAM (#363)

osma mentioned this issue Jan 28, 2020

Use LMDB and sparse vectors in nn_ensemble backend #381

Merged

osma modified the milestones: Short term, 0.46 Jan 28, 2020

osma closed this as completed in #381 Jan 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use LMDB to store vectors in nn_ensemble #363

Use LMDB to store vectors in nn_ensemble #363

osma commented Nov 29, 2019

Use LMDB to store vectors in nn_ensemble #363

Use LMDB to store vectors in nn_ensemble #363

Comments

osma commented Nov 29, 2019