Gather performance improvements discussion #838

luizirber · 2020-01-11T22:59:14Z

Moving the conversation from #835 here.

luizirber · 2020-01-12T03:15:04Z

We love it, if you guys can improve gather's performance.

here are some stats... looks like v3 may be slower than v2?

Should we file a task? we are happy to help. perhaps in v4? Is a smaller release planned on top of v3 codebase anytime soon?

thanks

sourmash compute -k 31 --scaled 5000 -o testv2.sig test.01M_R1_L001.fastq.gz
0m17.907s

sourmash gather -k 31 --scaled 5000 -o gatherv2 testv2.sig db/ecolidb.sbt.json
0m1.896s

sourmash v3.0.1
sourmash compute -k 31 --scaled 5000 -o testv3.sig test.01M_R1_L001.fastq.gz
0m42.243s

sourmash gather -k 31 --scaled 5000 -o gatherv3 testv3.sig db/ecolidb.sbt.json
0m4.429s

Originally posted by @satishv in #835 (comment)_

ctb · 2020-01-13T15:43:27Z

Whoa, there's something bad going on all right :)

Used the following code,

#! /usr/bin/env python
import sourmash
import time

print('loaded from:', sourmash)
print('version:', sourmash.VERSION)

query = sourmash.load_one_signature('podar-ref/0.fa.sig', ksize=31)
db = sourmash.lca.lca_utils.load_single_database('xxx.lca.json')
if sourmash.VERSION.startswith('3.'):
    dblist = [db]
elif sourmash.VERSION.startswith('2.3.'):
    dblist = [(db[0], 'xxx', 'LCA'),]

start = time.time()
for i in range(500):
    g = sourmash.search.gather_databases(query, dblist,
                                         threshold_bp=0,
                                         ignore_abundance=False)
    g = list(g)
end = time.time()

print(end - start)

and got:

loaded from: <module 'sourmash' from '/Users/t/dev/sourmash/sourmash/__init__.py'>
version: 3.0.2.dev6+g86e1105
5.426559925079346

——

loaded from: <module 'sourmash' from '/Users/t/dev/sourmash/sourmash/__init__.py'>
version: 2.3.1
0.07218098640441895

(This was with the podar-ref LCA database.)

I didn't think we'd changed that code much between 3.0.1 and 2.3.1 but clearly I am mistaken :)

ctb · 2020-01-13T15:46:50Z

Yes, my recollection is indeed ...vastly wrong. We merged in the new index ABC (#556) after 2.3.1 and then moved immediately to Rust and released 3.x. I suppose there's a lesson in there about our versioning, @luizirber!

ctb · 2020-01-13T15:52:38Z

commit ec8a00c, with index ABC:

version: 2.3.2.dev1+gec8a00c
5.255207061767578

previous commit 03e5269, no index ABC:

version: 2.3.1
0.11417484283447266

ok, so it was #556 that did the dirty deed.

ctb · 2020-01-13T15:52:49Z

(nothing to do with rust!)

ctb · 2020-01-13T16:11:50Z

OK, using py-spy and digging around a bit, I think the culprit is probably in the new interplay between _find_best in search.py, and Index.gather implementations. _find_best does not pass on any hints about search thresholds, and the Index.gather implementation wouldn't take them into account anyway. 😿

luizirber · 2020-01-13T18:12:55Z

Yes, my recollection is indeed ...vastly wrong. We merged in the new index ABC (#556) after 2.3.1 and then moved immediately to Rust and released 3.x. I suppose there's a lesson in there about our versioning, @luizirber!

I'm a fan of "one release per PR", but we also wanted to do #556 and #424 in 3.0.0, so... yeah. See #655 for more info.

luizirber · 2020-01-13T18:33:43Z

OK, using py-spy and digging around a bit, I think the culprit is probably in the new interplay
between _find_best in search.py, and Index.gather implementations.

But both SBTs and LCA indices define their own gather, and don't use Index.gather.

_find_best does not pass on any hints about search thresholds, and the Index.gather implementation wouldn't take them into account anyway. crying_cat_face

It's not ideal, but we can pass threshold as a kwargs, and define a default if it is not available (like search already does)

ctb · 2020-01-13T20:29:18Z

yes, I know. I'm digging. seems not as simple as I'd first thought, of course :)

luizirber · 2020-01-13T20:33:12Z

I'll dump here some other pointers to gather improvements, and they should probably become other issues eventually (and this one stays as a meta/tracking issue?). I'll break it in two sections: Engineering for making what already exists faster, and Research for deeper changes that involve rethinking data structures and methods.

Engineering

We can continue running py-spy and heaptrack for optimizing specific routines or lowering overall memory consumption.

Finish [MRG] replace khmer.Nodegraph with rust nodegraph #799 to allow loading Nodegraph from memory (avoid the tempfile dance when loading data from storage).
Generate a Nodegraph from the query, and use it to query internal nodes (instead of checking every hash in the query). This branch has this somewhat implemented, but depends on an unreleased khmer branch. Should also be doable in [MRG] replace khmer.Nodegraph with rust nodegraph #799
Expose an unload method for SBT nodes #784 is not so good for gather or long-running processes (in these cases you might prefer to keep the full index in memory), but it can lower the memory consumption substantially. There is also [WIP] cache top node results for gather #520 which tries to cache internal nodes, but I think it needs to cut a bit deeper to work well.
- is there any cache that, once an item is evicted, allows calling a method? That would work well with the .unload() from Expose an unload method for SBT nodes #784, I think.
[MRG] add ZipStorage, support loading tree from storage #648 is a disk-space saver (because you don't need to uncompress the SBT indices), but for now it makes things slower

Research

SBT indexing is not very smart: it finds the next available position in the tree, and put the signature there. Ideally we would want signatures to be clustered by some metric (probably similarity, but can also just be common hashes), because the search algorithms (be it depth-first search like in gather, or breadth-first search like in search) strongly benefit from prunning the search as early as possible (and avoid loading data from disk).

So, one direction to solve gather issues: better organization of the SBT nodes. At least two ways to do it:

Make the insertion of a dataset in an index smarter. We could look for the best position to put the signature, instead of just putting on the next available place. @phoenixAja showed interest in this approach, because it also makes [WIP] Add "knn" and "umap" commands #710 viable.
- Benefit: this is also an online/streaming approach
- Possible problem: SBTs built with the current method are "dense", with the lowest possible depth. To keep it dense with this approach it would also involve rebalancing the tree (a self-balancing binary search tree is ideal), but it is more expensive to do and might not work very well with the current internal implementation of SBTs (which uses enumeration on the nodes instead of pointers, so a rotation involves changing a lot of node positions).
Scaffold an index. Given a collection of signatures we can cluster them by similarity and build a tree which is both dense and maximize similarity between nodes under the same parent. This is what HowDeSBT does (with the howdesbt cluster command). There is some support for this in sourmash, hidden in the Rust crate.
- Drawback: this is an offline approach, because it needs a batch of signatures to work.

Another direction, but depending on the first one: change the search strategy.

Use a best-first search to find matches in gather. Eventually the DFS will find the same result, but the best-first can reach it faster (by taking the path with largest similarity first). But this only works if the index is well-behaved...
- Check best_first for a branch implementing the search strategy. It abuses the results dict a bit, so it breaks some tests with search functions that don't take extra kwargs.
Use a simple simulated annealing approach: while descending the tree also check some random signature to see if there is a better score.

satishv · 2020-01-14T00:33:47Z

Thanks for all of your attention on this task. Currently, gather function does not support muti-threading. has that been considered before? I also emailed you about this.

ctb · 2020-01-14T13:56:04Z

On Mon, Jan 13, 2020 at 04:33:48PM -0800, satish viswanatham wrote: Thanks for all of your attention on this task. Currently, gather function does not support muti-threading. has that been considered before? I also emailed you about this.

One of the big reasons to move to rust, as I understand it, is better support for multithreading. Before, the C++ layer was a real pain in terms of threading... The algorithm is not trivial to parallelize, however.

ctb · 2020-01-14T14:31:36Z

Embarrassing realization: my earlier benchmarking of lca gather was off, because the v2.3.1 code modified the incoming query object and removed all the minhashes, so I was essentially comparing gather on an empty signature (in v2.3.1) to gather on a full signature (in v3.0.1). When I run a real comparison, the LCA gather is about 50% faster in v3 than in v2.3.1.

OK, now that I have that figured out, going to take a look at the SBT performance.

ctb · 2020-01-14T15:13:00Z

Well, at least one problem (mentioned above) is that neither gather implementation (LCA or SBT) was using thresholding in search. That's fixed in #843. My benchmarking is highly variable but I'm seeing what looks like consistent improvements in gather performance on SBTs.

luizirber · 2020-01-14T23:13:39Z

Thanks for all of your attention on this task. Currently, gather function does not support muti-threading. has that been considered before? I also emailed you about this.

One of the big reasons to move to rust, as I understand it, is better support for multithreading. Before, the C++ layer was a real pain in terms of threading...
The algorithm is not trivial to parallelize, however.

Maybe, maybe... It's not trivial, but I've been thinking about doing something like this, using rayon:
https://github.com/oconnor663/bao/blob/a2fedd649487de63e4d0c0a9e45f5a22d1e46365/src/hash.rs#L224L270

But that's still a bit far in the future, because #532 needs to land (3.1), and then I need to write the Index Python abc/Rust trait bridge (3.3). I want to focus on improving compute (3.2, check #845) before doing the bridge.

satishv · 2020-01-15T00:53:39Z

For some more context: We are running with a DB file (200 GB zipped) with over 100k genomes. Would sorting these genomes in the DB file help? As a last resort, we may consider chopping the large DB file into multiple DB files or even passing both starting and end index to the DB file would be awesome, if you can support that kind of DB lookups. Happy to expand this further. Not sure i am making self very clear here. cc @metajinomics

ctb · 2020-01-15T13:46:25Z

hi @satishv, one thing you can do today would be to split the large DB into multiple files, as you suggest. If you can do so in such a way that related genomes are kept in the same database subset, that might improve things significantly.

A rough pipeline for doing so might be to:

take something like the GTDB taxonomy database that I just posted on my blog
run sourmash lca classify on all your genome signatures (should take ~3-6 hours)
pick a taxonomic level that groups your genomes into conveniently sized chunks of 20k or less
make databases for each of those.

best,
--titus

satishv · 2020-01-15T17:04:31Z

@ctb - Thanks for the above info. Doing the above will certainly adds to our timelines. We are hoping for small changes to do our side at this point. It will help if you also make changes on your end to speed up things as much as possible.

ctb · 2020-01-15T17:15:44Z

hi satish, the only big speed improvements I see coming in the near future are in the rust-sigs branch, #532, which I see is ready for review and merge. I'll probably get to a review tomorrow or over the weekend. There are various ways to construct your databases so that they are smaller and faster; I would suggest trying out different --scaled parameters, e.g. 10000 is much faster than 1000. best, --titus

satishv · 2020-01-15T19:15:42Z

thanks a lot!

we are using sourmash-2.0.1 now. I assume no outputs & format will be changed in the 4.0/or the above fix. Assume we will be able to easily swap 2.0.1 with your above fix, ideally with out lot of effort?

cc @brendanwee

ctb · 2020-01-15T19:20:25Z

On Wed, Jan 15, 2020 at 11:15:43AM -0800, satish viswanatham wrote: thanks a lot! we are using sourmash-2.0.1 now. I assume no outputs & format will be changed in the 4.0/or the above fix. we will be able to swap 2.0.1 with your above fix. so we have to move to cc @brendanwee

4.0 may change output and formats, but all of the near improvements will be made in 3.x, which will be stable - c.f. semantic versioning.

satishv · 2020-01-17T01:42:27Z

thanks. looking forward to next week, hope to incorporate your changes as soon as ready. can not wait.

ctb · 2020-01-19T15:21:16Z

re improving SBT organization, I wanted to link in #756, which is a very nice discussion of the issue, and also #545.

satishv · 2020-01-25T01:08:59Z

@ctb - we request that the performance of gather to prioritized over compute performance. Let us know if you have any estimate on the release date. totally appreciate this effort. please note that we are not interested in lca gather at this point. cc @luizirber

ctb · 2020-01-25T12:03:17Z

hi @satishv, I can't speak for @luizirber here,. but I have no immediate plans to work on improving gather performance. The current performance is not really an obstacle for anything I want to do and I'm usually much more interested in correctness, user experience, and memory usage than I am in performance.

More generally, we are always happy to consider pull requests that implement your priorities; we could also help connect you with some consultants that might be able to do the work on your desired timeline. Since the basic gather functionality is well tested and (presumably) performance improvements wouldn't change the API or behavior, there is no obstacle to releasing a new version of sourmash as soon as a PR is merged.

You should also note that there is a distinction between sourmash-gather-on-LCA-databases and sourmash lca gather. The former should return ~identical results given the same collection of sequences, (but has very different performance characteristics than sourmash-gather-on-SBT-databases; the latter has somewhat different functionality. I'm pretty sure (but not 100% positive...) that you can give sourmash lca index an empty spreadsheet to build an LCA database with no taxonomy; you might give that a try if you haven't already.

ctb · 2020-04-14T16:50:57Z

more performance optimization stuff for gather on SBTs, here.

basically, performance improvements for gather on an SBT from parallelization could come from --

on multiple databases, searching each database independently and then selecting the best containment - easy-ish if we can do the search entirely in Rust.
parallelizing the tree search WITHIN an SBT, by e.g. queueing up a list of parts of the tree to search in parallel - hard-ish.

#925 may also help by doing a better job of constructing the SBT so that less of the tree is sorted.

ctb · 2020-05-03T13:43:35Z

with the new compressed loading Bloom filters in #648, I think we can explore using larger Bloom filters in SBTs as another performance optimization, ref #304.

luizirber · 2020-05-04T17:27:09Z

with the new compressed loading Bloom filters in #648, I think we can explore using larger Bloom filters in SBTs as another performance optimization, ref #304.

Important note here: the compressed BF are on DISK, not on MEMORY. Using larger BFs might incur more memory usage. To fix that:

Better internal nodes caching and eviction. For search the nodes are being unloaded after they are traversed (as per Expose an unload method for SBT nodes #784 (comment)), but for gather simply unloading them is too aggressive (since they might be traversed again). Having a cache that can call .unload() when the node is evicted would be perfect, but I didn't find any available with a good API for that. Probably will have to write our own. (Calling .unload() is fine, because the nodes know how to reload the data; but it would involve reading data from storage again, which will be slower)
Succint representation in memory. This could probably be RRR encoding, like the original SBT. Pointers: https://alexbowe.com/rrr/
- the fixedbitset API is super nice to work on, so having something like it but with internal RRR encoding would be perfect
- https://github.com/sujayakar/rsdict is a good candidate for using in our implementation, as well as the https://github.com/tov/succinct-rs crate

ctb · 2020-05-04T21:35:20Z

ok, with the release of v3.3.0 link, things have improved dramatically -

this is because #799 now permits direct loading of compressed nodegraphs without intermediate files. #648 also adds the convenience of keeping the entire database in one compressed .zip file.

I think #925 is the next big optimization expected to land.

ctb · 2020-05-05T13:44:10Z

re

Better internal nodes caching and eviction. For search the nodes are being unloaded after they are traversed (as per #784 (comment)), but for gather simply unloading them is too aggressive (since they might be traversed again).

since hashes from the best match (across all databases) are removed from the query, it might be possible to provide a hint to unload the path to that best match in a database.

luizirber · 2020-05-05T15:41:43Z

since hashes from the best match (across all databases) are removed from the query, it might be possible to provide a hint to unload the path to that best match in a database.

Good idea! But maybe avoid unloading the nodes near the root (4 levels, ~31 nodes?), because they will probably be queried every time anyway.

Something else that can help: gather starts with best_similarity = 0, but after the first search we have more results (not only the best_match) that we could check first at the second round to start with a better score. That will also probably help limit the number of backtracking/paths that need to be checked. To avoid this being unbounded, maybe keep the top-20 leaves for each search, for seeding the next round?

ctb · 2020-05-05T15:44:45Z

Something else that can help: gather starts with best_similarity = 0, but after the first search we have more results (not only the best_match) that we could check first at the second round to start with a better score. That will also probably help limit the number of backtracking/paths that need to be checked.

yes! but of course it's more complicated - see #930 for a slightly rambling discussion of how this would have to work across multiple databases.

ctb · 2020-07-18T14:22:51Z

remaining issues moved to #1110

ctb mentioned this issue Jan 14, 2020

[MRG] update gather implementations for LCA and SBT to use thresholding #843

Merged

5 tasks

luizirber added this to the pre-4.0 milestone Jan 14, 2020

sourmash-bio deleted a comment from ctb May 5, 2020

ctb mentioned this issue Jul 18, 2020

Thinking about further gather performance improvements. #1110

Closed

4 tasks

ctb closed this as completed Jul 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gather performance improvements discussion #838

Gather performance improvements discussion #838

luizirber commented Jan 11, 2020 •

edited

Loading

luizirber commented Jan 12, 2020

ctb commented Jan 13, 2020

ctb commented Jan 13, 2020 •

edited

Loading

ctb commented Jan 13, 2020

ctb commented Jan 13, 2020

ctb commented Jan 13, 2020

luizirber commented Jan 13, 2020

luizirber commented Jan 13, 2020 •

edited

Loading

ctb commented Jan 13, 2020 via email

luizirber commented Jan 13, 2020 •

edited

Loading

satishv commented Jan 14, 2020

ctb commented Jan 14, 2020 via email

ctb commented Jan 14, 2020

ctb commented Jan 14, 2020

luizirber commented Jan 14, 2020

satishv commented Jan 15, 2020 •

edited

Loading

ctb commented Jan 15, 2020

satishv commented Jan 15, 2020

ctb commented Jan 15, 2020 via email

satishv commented Jan 15, 2020 •

edited

Loading

ctb commented Jan 15, 2020 via email

satishv commented Jan 17, 2020

ctb commented Jan 19, 2020

satishv commented Jan 25, 2020 •

edited

Loading

ctb commented Jan 25, 2020 •

edited

Loading

ctb commented Apr 14, 2020 •

edited

Loading

ctb commented May 3, 2020

luizirber commented May 4, 2020 •

edited

Loading

ctb commented May 4, 2020

ctb commented May 5, 2020

luizirber commented May 5, 2020

ctb commented May 5, 2020

ctb commented Jul 18, 2020

Gather performance improvements discussion #838

Gather performance improvements discussion #838

Comments

luizirber commented Jan 11, 2020 • edited Loading

luizirber commented Jan 12, 2020

ctb commented Jan 13, 2020

ctb commented Jan 13, 2020 • edited Loading

ctb commented Jan 13, 2020

ctb commented Jan 13, 2020

ctb commented Jan 13, 2020

luizirber commented Jan 13, 2020

luizirber commented Jan 13, 2020 • edited Loading

ctb commented Jan 13, 2020 via email

luizirber commented Jan 13, 2020 • edited Loading

Engineering

Research

satishv commented Jan 14, 2020

ctb commented Jan 14, 2020 via email

ctb commented Jan 14, 2020

ctb commented Jan 14, 2020

luizirber commented Jan 14, 2020

satishv commented Jan 15, 2020 • edited Loading

ctb commented Jan 15, 2020

satishv commented Jan 15, 2020

ctb commented Jan 15, 2020 via email

satishv commented Jan 15, 2020 • edited Loading

ctb commented Jan 15, 2020 via email

satishv commented Jan 17, 2020

ctb commented Jan 19, 2020

satishv commented Jan 25, 2020 • edited Loading

ctb commented Jan 25, 2020 • edited Loading

ctb commented Apr 14, 2020 • edited Loading

ctb commented May 3, 2020

luizirber commented May 4, 2020 • edited Loading

ctb commented May 4, 2020

ctb commented May 5, 2020

luizirber commented May 5, 2020

ctb commented May 5, 2020

ctb commented Jul 18, 2020

luizirber commented Jan 11, 2020 •

edited

Loading

ctb commented Jan 13, 2020 •

edited

Loading

luizirber commented Jan 13, 2020 •

edited

Loading

luizirber commented Jan 13, 2020 •

edited

Loading

satishv commented Jan 15, 2020 •

edited

Loading

satishv commented Jan 15, 2020 •

edited

Loading

satishv commented Jan 25, 2020 •

edited

Loading

ctb commented Jan 25, 2020 •

edited

Loading

ctb commented Apr 14, 2020 •

edited

Loading

luizirber commented May 4, 2020 •

edited

Loading