-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gather performance improvements discussion #838
Comments
|
Whoa, there's something bad going on all right :) Used the following code,
and got:
(This was with the podar-ref LCA database.) I didn't think we'd changed that code much between 3.0.1 and 2.3.1 but clearly I am mistaken :) |
Yes, my recollection is indeed ...vastly wrong. We merged in the new index ABC (#556) after 2.3.1 and then moved immediately to Rust and released 3.x. I suppose there's a lesson in there about our versioning, @luizirber! |
(nothing to do with rust!) |
OK, using py-spy and digging around a bit, I think the culprit is probably in the new interplay between |
I'm a fan of "one release per PR", but we also wanted to do #556 and #424 in 3.0.0, so... yeah. See #655 for more info. |
But both SBTs and LCA indices define their own
It's not ideal, but we can pass |
yes, I know. I'm digging. seems not as simple as I'd first thought, of
course :)
|
I'll dump here some other pointers to EngineeringWe can continue running
ResearchSBT indexing is not very smart: it finds the next available position in the tree, and put the signature there. Ideally we would want signatures to be clustered by some metric (probably similarity, but can also just be common hashes), because the search algorithms (be it depth-first search like in So, one direction to solve
Another direction, but depending on the first one: change the search strategy.
|
Thanks for all of your attention on this task. Currently, gather function does not support muti-threading. has that been considered before? I also emailed you about this. |
On Mon, Jan 13, 2020 at 04:33:48PM -0800, satish viswanatham wrote:
Thanks for all of your attention on this task. Currently, gather function does not support muti-threading. has that been considered before? I also emailed you about this.
One of the big reasons to move to rust, as I understand it, is better
support for multithreading. Before, the C++ layer was a real pain in
terms of threading...
The algorithm is not trivial to parallelize, however.
|
Embarrassing realization: my earlier benchmarking of lca gather was off, because the v2.3.1 code modified the incoming query object and removed all the minhashes, so I was essentially comparing gather on an empty signature (in v2.3.1) to gather on a full signature (in v3.0.1). When I run a real comparison, the LCA gather is about 50% faster in v3 than in v2.3.1. OK, now that I have that figured out, going to take a look at the SBT performance. |
Well, at least one problem (mentioned above) is that neither gather implementation (LCA or SBT) was using thresholding in search. That's fixed in #843. My benchmarking is highly variable but I'm seeing what looks like consistent improvements in gather performance on SBTs. |
Maybe, maybe... It's not trivial, but I've been thinking about doing something like this, using rayon: But that's still a bit far in the future, because #532 needs to land (3.1), and then I need to write the |
For some more context: We are running with a DB file (200 GB zipped) with over 100k genomes. Would sorting these genomes in the DB file help? As a last resort, we may consider chopping the large DB file into multiple DB files or even passing both starting and end index to the DB file would be awesome, if you can support that kind of DB lookups. Happy to expand this further. Not sure i am making self very clear here. cc @metajinomics |
hi @satishv, one thing you can do today would be to split the large DB into multiple files, as you suggest. If you can do so in such a way that related genomes are kept in the same database subset, that might improve things significantly. A rough pipeline for doing so might be to:
best, |
@ctb - Thanks for the above info. Doing the above will certainly adds to our timelines. We are hoping for small changes to do our side at this point. It will help if you also make changes on your end to speed up things as much as possible. |
hi satish,
the only big speed improvements I see coming in the near future are in the
rust-sigs branch, #532, which I see is ready for review and merge. I'll
probably get to a review tomorrow or over the weekend.
There are various ways to construct your databases so that they are smaller
and faster; I would suggest trying out different --scaled parameters, e.g.
10000 is much faster than 1000.
best,
--titus
|
thanks a lot! we are using sourmash-2.0.1 now. I assume no outputs & format will be changed in the 4.0/or the above fix. Assume we will be able to easily swap 2.0.1 with your above fix, ideally with out lot of effort? cc @brendanwee |
On Wed, Jan 15, 2020 at 11:15:43AM -0800, satish viswanatham wrote:
thanks a lot!
we are using sourmash-2.0.1 now. I assume no outputs & format will be changed in the 4.0/or the above fix. we will be able to swap 2.0.1 with your above fix.
so we have to move to cc @brendanwee
4.0 may change output and formats, but all of the near improvements will
be made in 3.x, which will be stable - c.f. semantic versioning.
|
thanks. looking forward to next week, hope to incorporate your changes as soon as ready. can not wait. |
@ctb - we request that the performance of gather to prioritized over compute performance. Let us know if you have any estimate on the release date. totally appreciate this effort. please note that we are not interested in lca gather at this point. cc @luizirber |
hi @satishv, I can't speak for @luizirber here,. but I have no immediate plans to work on improving gather performance. The current performance is not really an obstacle for anything I want to do and I'm usually much more interested in correctness, user experience, and memory usage than I am in performance. More generally, we are always happy to consider pull requests that implement your priorities; we could also help connect you with some consultants that might be able to do the work on your desired timeline. Since the basic gather functionality is well tested and (presumably) performance improvements wouldn't change the API or behavior, there is no obstacle to releasing a new version of sourmash as soon as a PR is merged. You should also note that there is a distinction between sourmash-gather-on-LCA-databases and |
more performance optimization stuff for gather on SBTs, here. basically, performance improvements for gather on an SBT from parallelization could come from --
#925 may also help by doing a better job of constructing the SBT so that less of the tree is sorted. |
Important note here: the compressed BF are on DISK, not on MEMORY. Using larger BFs might incur more memory usage. To fix that:
|
ok, with the release of v3.3.0 link, things have improved dramatically - this is because #799 now permits direct loading of compressed nodegraphs without intermediate files. #648 also adds the convenience of keeping the entire database in one compressed .zip file. I think #925 is the next big optimization expected to land. |
re
since hashes from the best match (across all databases) are removed from the query, it might be possible to provide a hint to unload the path to that best match in a database. |
Good idea! But maybe avoid unloading the nodes near the root (4 levels, ~31 nodes?), because they will probably be queried every time anyway. Something else that can help: gather starts with |
yes! but of course it's more complicated - see #930 for a slightly rambling discussion of how this would have to work across multiple databases. |
remaining issues moved to #1110 |
Moving the conversation from #835 here.
The text was updated successfully, but these errors were encountered: