-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use LMDB and sparse vectors in nn_ensemble backend #381
Conversation
…ating large vectors in RAM (#363)
Codecov Report
@@ Coverage Diff @@
## master #381 +/- ##
==========================================
+ Coverage 99.38% 99.39% +<.01%
==========================================
Files 59 59
Lines 3738 3794 +56
==========================================
+ Hits 3715 3771 +56
Misses 23 23
Continue to review full report at Codecov.
|
This pull request introduces 1 alert when merging 41d2953 into 21d92bf - view on LGTM.com new alerts:
|
Still need to make sure the LMDB dependencies are included in Docker images. |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Before this PR, the nn_ensemble backend collected score and label (target) vectors in memory, which took a lot of RAM when there are thousands of documents, multiple source projects and a large vocabulary such as YSO. This PR makes the nn_ensemble backend instead spool those vectors into an on-disk, memory-mapped LMDB database instead (implemented as a keras.utils.Sequential subclass). The vectors are also compressed by representing them as SciPy sparse vectors.
The LMDB training data can also be reused in later training rounds by using the recently introduced
--cached
option, although this functionality is not yet implemented in the initial commit.Fixes #363