Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXP] support LCA functionality for SqliteIndex + LineageDB_Sqlite databases. #1933

Merged
merged 39 commits into from
Apr 12, 2022

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Apr 6, 2022

Note: PR into #1930

This PR adds support for generic programmatic and command-line LCA_Database functionality based on sqlite databases for the specialized situation where you have the tables for the SqliteIndex and LineageDB_Sqlite databases in one file.

That means, for example, that you can do the following:

# build a SqliteIndex for a bunch of signatures
sourmash sig cat -k 31 podar-ref.zip -o podar-ref.sqldb
# add lineage/taxonomy information into that same database
sourmash tax prepare -F sql -o podar-ref.sqldb -t podar-ref/podar-lineage.csv 

# call an LCA command on it
sourmash lca rankinfo podar-ref.sqldb

As a more simple version of the above, this PR also supports complete creation of a SQLite LCA database via the -F sql argument to sourmash lca index.

Note that at the moment there's actually nothing in the implementation that requires that the LineageDB be a SQLite database; it was just simple to hack it all into one file load and it made subverting the LCA Database loading functions easy. We could support dynamic lineage loading via the MultiLineageDB.load(...) command easily enough at the programmatic level.

@codecov
Copy link

codecov bot commented Apr 6, 2022

Codecov Report

Merging #1933 (50976f7) into add/sqlite_index (b311d36) will increase coverage by 7.17%.
The diff coverage is 77.96%.

@@                 Coverage Diff                  @@
##           add/sqlite_index    #1933      +/-   ##
====================================================
+ Coverage             83.25%   90.43%   +7.17%     
====================================================
  Files                   126       96      -30     
  Lines                 14231    10243    -3988     
  Branches               1958     2005      +47     
====================================================
- Hits                  11848     9263    -2585     
+ Misses                 2093      673    -1420     
- Partials                290      307      +17     
Flag Coverage Δ
python 90.43% <77.96%> (-0.61%) ⬇️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/lca/command_index.py 88.55% <50.00%> (-0.95%) ⬇️
src/sourmash/manifest.py 91.16% <63.63%> (-3.94%) ⬇️
src/sourmash/index/sqlite_index.py 82.29% <75.24%> (-9.11%) ⬇️
src/sourmash/tax/tax_utils.py 97.43% <87.50%> (-0.35%) ⬇️
src/sourmash/sqlite_utils.py 90.90% <90.90%> (ø)
src/sourmash/lca/lca_db.py 91.94% <93.75%> (+0.14%) ⬆️
src/sourmash/cli/lca/index.py 100.00% <100.00%> (ø)
src/sourmash/lca/command_rankinfo.py 84.61% <100.00%> (ø)
src/sourmash/sig/__main__.py 93.55% <100.00%> (+0.21%) ⬆️
src/sourmash/sourmash_args.py 93.20% <100.00%> (ø)
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b311d36...50976f7. Read the comment docs.

@ctb
Copy link
Contributor Author

ctb commented Apr 6, 2022

% sourmash sig flatten gtdb-rs202.genomic-reps.k31.zip -o gtdb-rs202.genomic-reps.k31.sqldb
...
% sourmash tax prepare -F sql -t gtdb-rs202.taxonomy.v2.csv -o gtdb-rs202.genomic-reps.k31.sqldb
...
% sourmash sig summarize gtdb-rs202.genomic-reps.k31.sqldb

== This is sourmash version 4.2.4.dev78+g31003c86. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'gtdb-rs202.genomic-reps.k31.sqldb'
path filetype: SqliteIndex
location: gtdb-rs202.genomic-reps.k31.sqldb
is database? yes
has manifest? yes
num signatures: 47894
** examining manifest...
total hashes: 161364669
summary of sketches:
   47894 sketches with DNA, k=31, scaled=1000         161364669 total hashes

% sourmash lca rankinfo gtdb-rs202.genomic-reps.k31.sqldb

== This is sourmash version 4.2.4.dev78+g31003c86. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

{'SqliteIndex': '1.0', 'SqliteManifest': '1.0'}1.sqldb
setting ksize to 31
setting moltype to DNA
loaded 1 LCA databases. ksize=31, scaled=1000 moltype=DNA
superkingdom: 90414 (1.1%)
phylum: 33193 (0.4%)
class: 153022 (1.8%)
order: 73194 (0.9%)
family: 328131 (3.9%)
genus: 1919159 (22.8%)
species: 5825082 (69.2%)
strain: 0 (0.0%)

The last command took:

real    3m31.572s
user    3m25.825s
sys     0m5.111s

so that's 3.5 minutes for a rankinfo on a GTDB-wide LCA database at scaled=1000. How do you like THEM 🍎 ??!!

ctb added a commit to sourmash-bio/database-examples that referenced this pull request Apr 10, 2022
@ctb
Copy link
Contributor Author

ctb commented Apr 11, 2022

SQL manifests

        Command being timed: "sourmash sig summarize entire.mar29.sqlmf"
        User time (seconds): 23.19
        System time (seconds): 1.64
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:25.00
        Maximum resident set size (kbytes): 45016

CSV manifests

        Command being timed: "sourmash sig summarize ../entire.mar29.csv"
        User time (seconds): 74.27
        System time (seconds): 3.84
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:17.47
        Maximum resident set size (kbytes): 6283508

The key thing here is that the SQL manifest is much faster and requires only 45 MB of RAM, while the CSV manifest is not only 3x slower (77s vs 25s) but requires 6.2 GB of RAM. 🎉

Note, this is an <ahem> fairly large collection 😁

** loading from '../entire.mar29.csv'
path filetype: StandaloneManifestIndex
location: ../entire.mar29.csv
is database? yes
has manifest? yes
num signatures: 4731705
** examining manifest...
total hashes: 55575636699
summary of sketches:
   1577235 sketches with DNA, k=21, scaled=1000, abund 17430160184 total hashes
   1577235 sketches with DNA, k=31, scaled=1000, abund 18549305689 total hashes
   1577235 sketches with DNA, k=51, scaled=1000, abund 19596170826 total hashes

@ctb ctb merged commit 50976f7 into add/sqlite_index Apr 12, 2022
@ctb ctb deleted the add/sqlite_index_lca branch April 12, 2022 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant