Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LMDB and sparse vectors in nn_ensemble backend #381

Merged
merged 12 commits into from
Jan 29, 2020

Conversation

osma
Copy link
Member

@osma osma commented Jan 28, 2020

Before this PR, the nn_ensemble backend collected score and label (target) vectors in memory, which took a lot of RAM when there are thousands of documents, multiple source projects and a large vocabulary such as YSO. This PR makes the nn_ensemble backend instead spool those vectors into an on-disk, memory-mapped LMDB database instead (implemented as a keras.utils.Sequential subclass). The vectors are also compressed by representing them as SciPy sparse vectors.

The LMDB training data can also be reused in later training rounds by using the recently introduced --cached option, although this functionality is not yet implemented in the initial commit.

Fixes #363

@osma osma added this to the 0.46 milestone Jan 28, 2020
@codecov
Copy link

codecov bot commented Jan 28, 2020

Codecov Report

Merging #381 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #381      +/-   ##
==========================================
+ Coverage   99.38%   99.39%   +<.01%     
==========================================
  Files          59       59              
  Lines        3738     3794      +56     
==========================================
+ Hits         3715     3771      +56     
  Misses         23       23
Impacted Files Coverage Δ
tests/test_backend_nn_ensemble.py 100% <100%> (ø) ⬆️
annif/backend/nn_ensemble.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 21d92bf...7fa3dac. Read the comment docs.

@lgtm-com
Copy link

lgtm-com bot commented Jan 28, 2020

This pull request introduces 1 alert when merging 41d2953 into 21d92bf - view on LGTM.com

new alerts:

  • 1 for Unused import

@osma
Copy link
Member Author

osma commented Jan 29, 2020

Still need to make sure the LMDB dependencies are included in Docker images.

@sonarcloud
Copy link

sonarcloud bot commented Jan 29, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@osma osma marked this pull request as ready for review January 29, 2020 12:06
@osma osma merged commit b5edc6d into master Jan 29, 2020
@osma osma deleted the issue363-nn-ensemble-lmdb branch January 29, 2020 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use LMDB to store vectors in nn_ensemble
1 participant