-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some simple benchmarking of sourmash gather
on GTDB zipfiles/SBTs
#1530
Labels
Comments
on the full SBT:
so 60 seconds and 1 GB to search 300k signatures?! |
full genomic SBT is 15 GB for the 300k sigs. |
ran
|
Results on Really Big files (15 GB .sbt.zip for ~280k all-GTDB)
|
I'll label this with FAQ and leave this here. |
This was referenced Apr 29, 2022
integrated into docs with #2025. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
While writing a blog post about the sourmash v4.1 release, I got curious about the practical implications of
--linear/--no-linear
and--prefetch/--no-prefetch
, so I ran a following benchmark script and recorded the output. The benchmark script and raw output are at the bottom.The query signature here was a merge of four signatures that were present in the database, so gather would do four iterations.
Summary:
Zipfile collection
Indexed zipfile (SBT):
conclusions
so I think I understand almost everything here, which is good, since I wrote a lot of the code 😆 -
--no-prefetch
;--no-linear
andlinear
are identical;linear
is way slower than using the index, of course!but the two weird results are for the SBT:
--linear
than with--no-linear
?so my hypothesis (a theory we can test! :dora:) is that the SBT
.signatures()
method is keeping all the sigs in memory. The puzzling thing is that the memory usage is so high for that - maybe it's keeping the tree in memory, too, or something?Anyway, the two big conclusions are the obvious ones and also reflect the defaults for sourmash:
--no-linear --prefetch
is generally best;--prefetch
by default;script and raw output
Raw output attached.
bench.txt
The text was updated successfully, but these errors were encountered: