You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
these should probably go somewhere close to the database pages, perhaps when we put out a new release - ref #1941
note, we have benchmarks on metagenomes against full genbank here.
benchmarks - prefetch
gtdb genomic reps, 65k sigs, DNA, ksize=31, db scaled=1000
query was SRR606249 / podar-ref.
query scaled=10,000
863 matches total;
53.9k query hashes, 19.0k found in matches above threshold. sqldb here was produced via sourmash sig flatten $zip -o $sqldb, sbt.zip with sourmash index $sbt $zip.
db format
db size
time
memory
sqldb
15 GB
28.2s
2.6 GB
sbt
3.5 GB
2m 43s
2.9 GB
zip
1.7 GB
5m 16s
1.9 GB
query scaled=1000
625 matches total;
374.6k query hashes, 189.1k found in matches above threshold
db format
db size
time
memory
sqldb
15 GB
3m 58s
9.9 GB
sbt
3.5 GB
7m 33s
2.6 GB
zip
1.7 GB
5m 53s
2.0 GB
thoughts
I was surprised by SBT being slower than the others, since it's pretty fast on simpler (single-genome) queries. I think it reflects a few different things - but is mostly about the complex query, along with how much faster everything else has become. (I ran the twice to make sure the numbers were legit!)
this is all single threaded; once we get multithreaded/rust-based searching of zip files in, zipfile search is gonna be smokin'.
sqldb is showing its value esp with higher scaled - since the scaled value is used as a constraint directly in the SQL query, we're searching a much smaller space of hashes. I was surprised to see the high memory usage, and it might be worth revisiting the code to see if that's coming from choices made in Python land (likely) or if that's internal to sqlite.
the extra on-disk size for sqldb is because the sqldb implementation has a lot of indices and doesn't seem to compress anything. I don't think we'll be distributing sqlite databases via download anytime soon 😆
benchmarks - LCA summarize
gtdb genomic reps, 65k sigs, DNA, ksize=31, db scaled=10,000
as I wrote elsewhere, LCA-style queries into sqlite databases are one of the real pitches for #1808 - SqliteIndex itself is a nice proof of concept, but not compelling from a performance/disk space perspective. A fast on-disk approach will be nice! (SqliteCollectionManifest is also fantastic, FWIW.)
it's interesting to see the low memory for SQL here compared to the prefetch benchmarks. Makes me think that I'm doing something bad with memory in the SqliteIndex.find(...) code 🤔 .
The text was updated successfully, but these errors were encountered:
note: these were calculated with sourmash 4.3.
these should probably go somewhere close to the database pages, perhaps when we put out a new release - ref #1941
note, we have benchmarks on metagenomes against full genbank here.
benchmarks - prefetch
gtdb genomic reps, 65k sigs, DNA, ksize=31, db scaled=1000
query was SRR606249 / podar-ref.
query scaled=10,000
863 matches total;
53.9k query hashes, 19.0k found in matches above threshold. sqldb here was produced via
sourmash sig flatten $zip -o $sqldb
, sbt.zip withsourmash index $sbt $zip
.query scaled=1000
625 matches total;
374.6k query hashes, 189.1k found in matches above threshold
thoughts
I was surprised by SBT being slower than the others, since it's pretty fast on simpler (single-genome) queries. I think it reflects a few different things - but is mostly about the complex query, along with how much faster everything else has become. (I ran the twice to make sure the numbers were legit!)
this is all single threaded; once we get multithreaded/rust-based searching of zip files in, zipfile search is gonna be smokin'.
sqldb is showing its value esp with higher scaled - since the scaled value is used as a constraint directly in the SQL query, we're searching a much smaller space of hashes. I was surprised to see the high memory usage, and it might be worth revisiting the code to see if that's coming from choices made in Python land (likely) or if that's internal to sqlite.
the extra on-disk size for sqldb is because the sqldb implementation has a lot of indices and doesn't seem to compress anything. I don't think we'll be distributing sqlite databases via download anytime soon 😆
benchmarks - LCA summarize
gtdb genomic reps, 65k sigs, DNA, ksize=31, db scaled=10,000
command:
query scaled=10,000
53.9k query hashes
thoughts
is fast! and low memory!
as I wrote elsewhere, LCA-style queries into sqlite databases are one of the real pitches for #1808 -
SqliteIndex
itself is a nice proof of concept, but not compelling from a performance/disk space perspective. A fast on-disk approach will be nice! (SqliteCollectionManifest
is also fantastic, FWIW.)it's interesting to see the low memory for SQL here compared to the prefetch benchmarks. Makes me think that I'm doing something bad with memory in the
SqliteIndex.find(...)
code 🤔 .The text was updated successfully, but these errors were encountered: