-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How does sourmash gather time scale with reads (and can this be reduced with multithreading)? #2827
Comments
hi @rmcolq, thanks for asking! I have a couple of thoughts for you! First, we have a multithreaded gather implemented over in pyo3_branchwater. We tend to use it in two stages - run fastgather, and then use the fastgather results as a picklist to run a regular gather to do more complete reporting. Happy to describe this more if you try it out and need some help; the hardest part is getting it installed, I'm afraid, but once that's done it should be smooth sailing. @bluegenes recently gave a lab meeting on branchwater; see slides here. In particular, the fastmultigather command might be what you need for many samples. It loads the database once and then processes all of the samples in parallel. Second, we rarely hear "sure, hold 80 GB in memory!" so I have to admit I'm not 100% sure what to recommend 😆 But what you're seeing is what I'd expect: gather does a pass across the full database (the prefetch stage) before reporting matches. So if you are using (e.g.) the entire GTDB, sourmash does a pass across the entire database, loading it from disk as it goes. Hence the ~5 minutes. Third, there are a couple of indexed databases that might be faster. My first suggestion is to try out the sqlite index ( A few other thoughts for us sourmash devs to think about -
|
Hi @ctb Thank you for this in depth answer! It sounds like fastmultigather gets a long way towards what we are looking for, so I'll definitely check it out. I have been using the largest databases on the sourmash downloads page although that 6 minutes was just with the viral, fungi and protozoa files for the test data (I was planning to use bacteria too in the main). Makes sense that the focus would be on getting the RAM use down. The fact that most of that time is in database loading is pretty reassuring because it means that something like the multithreading one (or a server implementation of it if it were to occur down the line) would be a good solution. |
I've written a short benchmarks: |
note: as of sourmash_plugin_branchwater v0.9.5 link, the results from I haven't done the appropriate benchmarking yet, but I'll get to it eventually :). |
Amazing, thanks for keeping this updated with the developments! |
Here are the benchmarks - in brief, |
I'd really like to use sourmash for metagenomic classification as in portik et al. I have been trying it out on small datasets and I've noticed that the gather step in particular seems to take a long time (e.g. it took 6.5 mins to process 40k long reads, whilst kraken2 run with a server took 4.5s on the same data). I will be running classification in an environment where there is lots of access to RAM so holding an ~80GB database in memory is not a problem, but as we anticipate having to process lots of samples we need the time to be as short as possible.
Do you plan to introduce multithreading to sourmash gather?
Does this time scale with the number of reads, or is it mostly the time to load the database?
Is there something I could change e.g. database or threshold_bps which would speed things up?
The text was updated successfully, but these errors were encountered: