How does sourmash gather time scale with reads (and can this be reduced with multithreading)? #2827

rmcolq · 2023-10-30T14:50:54Z

I'd really like to use sourmash for metagenomic classification as in portik et al. I have been trying it out on small datasets and I've noticed that the gather step in particular seems to take a long time (e.g. it took 6.5 mins to process 40k long reads, whilst kraken2 run with a server took 4.5s on the same data). I will be running classification in an environment where there is lots of access to RAM so holding an ~80GB database in memory is not a problem, but as we anticipate having to process lots of samples we need the time to be as short as possible.

Do you plan to introduce multithreading to sourmash gather?
Does this time scale with the number of reads, or is it mostly the time to load the database?
Is there something I could change e.g. database or threshold_bps which would speed things up?

ctb · 2023-10-30T15:07:48Z

hi @rmcolq, thanks for asking!

I have a couple of thoughts for you!

First, we have a multithreaded gather implemented over in pyo3_branchwater. We tend to use it in two stages - run fastgather, and then use the fastgather results as a picklist to run a regular gather to do more complete reporting. Happy to describe this more if you try it out and need some help; the hardest part is getting it installed, I'm afraid, but once that's done it should be smooth sailing.

@bluegenes recently gave a lab meeting on branchwater; see slides here.

In particular, the fastmultigather command might be what you need for many samples. It loads the database once and then processes all of the samples in parallel.

Second, we rarely hear "sure, hold 80 GB in memory!" so I have to admit I'm not 100% sure what to recommend 😆

But what you're seeing is what I'd expect: gather does a pass across the full database (the prefetch stage) before reporting matches. So if you are using (e.g.) the entire GTDB, sourmash does a pass across the entire database, loading it from disk as it goes. Hence the ~5 minutes.

Third, there are a couple of indexed databases that might be faster. My first suggestion is to try out the sqlite index (sourmash sig cat database -o database.sqldb); my second suggestion is to try out the SBT index (generated with sourmash index. Neither will get you to 5 seconds, though. The mastiff index (in branchwater) might, but if you are using branchwater, I think fastmultigather is your best bet.

A few other thoughts for us sourmash devs to think about -

if you're using a server, kraken is pre-loading the database, presumably. So that's one reason why it's so darn fast. We don't have that implemented in sourmash directly because we have focused on keeping memory usage low (that's more often a limiting factor).
sourmash can deal with much larger databases than kraken, which is one reason why we've focused on keeping RAM low (at expense of time).

rmcolq · 2023-10-30T16:13:03Z

Hi @ctb

Thank you for this in depth answer! It sounds like fastmultigather gets a long way towards what we are looking for, so I'll definitely check it out. I have been using the largest databases on the sourmash downloads page although that 6 minutes was just with the viral, fungi and protozoa files for the test data (I was planning to use bacteria too in the main). Makes sense that the focus would be on getting the RAM use down.

The fact that most of that time is in database loading is pretty reassuring because it means that something like the multithreading one (or a server implementation of it if it were to occur down the line) would be a good solution.

ctb · 2024-03-22T14:13:56Z

I've written a short fastmultigather quickstart here: #3095. This should dramatically decrease loading time as well as memory usage.

benchmarks:

ctb · 2024-06-20T13:52:11Z

note: as of sourmash_plugin_branchwater v0.9.5 link, the results from fastgather and fastmultigather are now identical to those from sourmash gather. So we can just recommend using fastmultigather directly.

I haven't done the appropriate benchmarking yet, but I'll get to it eventually :).

rmcolq · 2024-06-20T16:06:50Z

Amazing, thanks for keeping this updated with the developments!

ctb · 2024-06-30T20:21:52Z

Here are the benchmarks - in brief, fastgather is quite fast, while the rocksdb implementation is very low memory:

#3232

lhsnam mentioned this issue Dec 14, 2023

How do I get the abundance of each query in fastgather and multifastgather results? sourmash-bio/sourmash_plugin_branchwater#163

Closed

ctb mentioned this issue Mar 22, 2024

using fastmultigather to do contig-level gather and taxonomy assignment - a brief tutorial #3095

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does sourmash gather time scale with reads (and can this be reduced with multithreading)? #2827

How does sourmash gather time scale with reads (and can this be reduced with multithreading)? #2827

rmcolq commented Oct 30, 2023

ctb commented Oct 30, 2023

rmcolq commented Oct 30, 2023

ctb commented Mar 22, 2024

ctb commented Jun 20, 2024

rmcolq commented Jun 20, 2024

ctb commented Jun 30, 2024

How does sourmash gather time scale with reads (and can this be reduced with multithreading)? #2827

How does sourmash gather time scale with reads (and can this be reduced with multithreading)? #2827

Comments

rmcolq commented Oct 30, 2023

ctb commented Oct 30, 2023

rmcolq commented Oct 30, 2023

ctb commented Mar 22, 2024

ctb commented Jun 20, 2024

rmcolq commented Jun 20, 2024

ctb commented Jun 30, 2024