Use LMDB and sparse vectors in nn_ensemble backend #381

osma · 2020-01-28T12:24:54Z

Before this PR, the nn_ensemble backend collected score and label (target) vectors in memory, which took a lot of RAM when there are thousands of documents, multiple source projects and a large vocabulary such as YSO. This PR makes the nn_ensemble backend instead spool those vectors into an on-disk, memory-mapped LMDB database instead (implemented as a keras.utils.Sequential subclass). The vectors are also compressed by representing them as SciPy sparse vectors.

The LMDB training data can also be reused in later training rounds by using the recently introduced --cached option, although this functionality is not yet implemented in the initial commit.

Fixes #363

…ating large vectors in RAM (#363)

codecov · 2020-01-28T12:36:09Z

Codecov Report

Merging #381 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #381      +/-   ##
==========================================
+ Coverage   99.38%   99.39%   +<.01%     
==========================================
  Files          59       59              
  Lines        3738     3794      +56     
==========================================
+ Hits         3715     3771      +56     
  Misses         23       23

Impacted Files	Coverage Δ
tests/test_backend_nn_ensemble.py	`100% <100%> (ø)`	⬆️
annif/backend/nn_ensemble.py	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 21d92bf...7fa3dac. Read the comment docs.

…in LMDB

lgtm-com · 2020-01-28T14:49:46Z

This pull request introduces 1 alert when merging 41d2953 into 21d92bf - view on LGTM.com

new alerts:

1 for Unused import

osma · 2020-01-29T11:02:29Z

Still need to make sure the LMDB dependencies are included in Docker images.

sonarcloud · 2020-01-29T11:55:44Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities (and 0 Security Hotspots to review)
0 Code Smells

No Coverage information
0.0% Duplication

Use LMDB and sparse vectors in nn_ensemble backend, instead of aggreg…

a845cbf

…ating large vectors in RAM (#363)

osma added the enhancement label Jan 28, 2020

osma added this to the 0.46 milestone Jan 28, 2020

fix whitespace

cd18015

osma added 3 commits January 28, 2020 15:27

rename helper functions

940f8f4

Implement --cached option for nn_ensemble using training data stored …

18cb23b

…in LMDB

remove unnecessary variable

41d2953

osma added 6 commits January 28, 2020 17:00

remove unused import

838947b

make sure the LMDB is cleared between training runs

c7d7f86

add unit test for idx <-> key conversions

b882a07

split off helper function for opening LMDB database

8f2de70

Split off idx/key conversions to functions (to fix test coverage issue)

2f46af9

use csc_matrix for input matrices, as it's most efficient for this shape

b66ef9c

install LMDB dependency in Docker images

7fa3dac

osma marked this pull request as ready for review January 29, 2020 12:06

osma merged commit b5edc6d into master Jan 29, 2020

osma deleted the issue363-nn-ensemble-lmdb branch January 29, 2020 15:20

osma mentioned this pull request Feb 4, 2020

Use sparse arrays for subject vectors #377

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use LMDB and sparse vectors in nn_ensemble backend #381

Use LMDB and sparse vectors in nn_ensemble backend #381

osma commented Jan 28, 2020

codecov bot commented Jan 28, 2020 •

edited

Loading

lgtm-com bot commented Jan 28, 2020

osma commented Jan 29, 2020

sonarcloud bot commented Jan 29, 2020

Use LMDB and sparse vectors in nn_ensemble backend #381

Use LMDB and sparse vectors in nn_ensemble backend #381

Conversation

osma commented Jan 28, 2020

codecov bot commented Jan 28, 2020 • edited Loading

Codecov Report

lgtm-com bot commented Jan 28, 2020

osma commented Jan 29, 2020

sonarcloud bot commented Jan 29, 2020

codecov bot commented Jan 28, 2020 •

edited

Loading