-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EXP] support LCA functionality for SqliteIndex
+ LineageDB_Sqlite
databases.
#1933
Conversation
Codecov Report
@@ Coverage Diff @@
## add/sqlite_index #1933 +/- ##
====================================================
+ Coverage 83.25% 90.43% +7.17%
====================================================
Files 126 96 -30
Lines 14231 10243 -3988
Branches 1958 2005 +47
====================================================
- Hits 11848 9263 -2585
+ Misses 2093 673 -1420
- Partials 290 307 +17
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
The last command took:
so that's 3.5 minutes for a rankinfo on a GTDB-wide LCA database at scaled=1000. How do you like THEM 🍎 ??!! |
SQL manifests
CSV manifests
The key thing here is that the SQL manifest is much faster and requires only 45 MB of RAM, while the CSV manifest is not only 3x slower (77s vs 25s) but requires 6.2 GB of RAM. 🎉 Note, this is an <ahem> fairly large collection 😁
|
Note: PR into #1930
This PR adds support for generic programmatic and command-line
LCA_Database
functionality based on sqlite databases for the specialized situation where you have the tables for theSqliteIndex
andLineageDB_Sqlite
databases in one file.That means, for example, that you can do the following:
As a more simple version of the above, this PR also supports complete creation of a SQLite LCA database via the
-F sql
argument tosourmash lca index
.Note that at the moment there's actually nothing in the implementation that requires that the LineageDB be a SQLite database; it was just simple to hack it all into one file load and it made subverting the LCA Database loading functions easy. We could support dynamic lineage loading via the
MultiLineageDB.load(...)
command easily enough at the programmatic level.