Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does sourmash gather time scale with reads (and can this be reduced with multithreading)? #2827

Open
rmcolq opened this issue Oct 30, 2023 · 6 comments

Comments

@rmcolq
Copy link

rmcolq commented Oct 30, 2023

I'd really like to use sourmash for metagenomic classification as in portik et al. I have been trying it out on small datasets and I've noticed that the gather step in particular seems to take a long time (e.g. it took 6.5 mins to process 40k long reads, whilst kraken2 run with a server took 4.5s on the same data). I will be running classification in an environment where there is lots of access to RAM so holding an ~80GB database in memory is not a problem, but as we anticipate having to process lots of samples we need the time to be as short as possible.

Do you plan to introduce multithreading to sourmash gather?
Does this time scale with the number of reads, or is it mostly the time to load the database?
Is there something I could change e.g. database or threshold_bps which would speed things up?

@ctb
Copy link
Contributor

ctb commented Oct 30, 2023

hi @rmcolq, thanks for asking!

I have a couple of thoughts for you!

First, we have a multithreaded gather implemented over in pyo3_branchwater. We tend to use it in two stages - run fastgather, and then use the fastgather results as a picklist to run a regular gather to do more complete reporting. Happy to describe this more if you try it out and need some help; the hardest part is getting it installed, I'm afraid, but once that's done it should be smooth sailing.

@bluegenes recently gave a lab meeting on branchwater; see slides here.

In particular, the fastmultigather command might be what you need for many samples. It loads the database once and then processes all of the samples in parallel.


Second, we rarely hear "sure, hold 80 GB in memory!" so I have to admit I'm not 100% sure what to recommend 😆

But what you're seeing is what I'd expect: gather does a pass across the full database (the prefetch stage) before reporting matches. So if you are using (e.g.) the entire GTDB, sourmash does a pass across the entire database, loading it from disk as it goes. Hence the ~5 minutes.


Third, there are a couple of indexed databases that might be faster. My first suggestion is to try out the sqlite index (sourmash sig cat database -o database.sqldb); my second suggestion is to try out the SBT index (generated with sourmash index. Neither will get you to 5 seconds, though. The mastiff index (in branchwater) might, but if you are using branchwater, I think fastmultigather is your best bet.


A few other thoughts for us sourmash devs to think about -

  • if you're using a server, kraken is pre-loading the database, presumably. So that's one reason why it's so darn fast. We don't have that implemented in sourmash directly because we have focused on keeping memory usage low (that's more often a limiting factor).
  • sourmash can deal with much larger databases than kraken, which is one reason why we've focused on keeping RAM low (at expense of time).

@rmcolq
Copy link
Author

rmcolq commented Oct 30, 2023

Hi @ctb

Thank you for this in depth answer! It sounds like fastmultigather gets a long way towards what we are looking for, so I'll definitely check it out. I have been using the largest databases on the sourmash downloads page although that 6 minutes was just with the viral, fungi and protozoa files for the test data (I was planning to use bacteria too in the main). Makes sense that the focus would be on getting the RAM use down.

The fact that most of that time is in database loading is pretty reassuring because it means that something like the multithreading one (or a server implementation of it if it were to occur down the line) would be a good solution.

@ctb
Copy link
Contributor

ctb commented Mar 22, 2024

I've written a short fastmultigather quickstart here: #3095. This should dramatically decrease loading time as well as memory usage.

benchmarks:

@ctb
Copy link
Contributor

ctb commented Jun 20, 2024

note: as of sourmash_plugin_branchwater v0.9.5 link, the results from fastgather and fastmultigather are now identical to those from sourmash gather. So we can just recommend using fastmultigather directly.

I haven't done the appropriate benchmarking yet, but I'll get to it eventually :).

@rmcolq
Copy link
Author

rmcolq commented Jun 20, 2024

Amazing, thanks for keeping this updated with the developments!

@ctb
Copy link
Contributor

ctb commented Jun 30, 2024

Here are the benchmarks - in brief, fastgather is quite fast, while the rocksdb implementation is very low memory:

#3232

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants