Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion to speed up gather for metagenomic samples #431

Closed
alexmsalmeida opened this issue Mar 7, 2018 · 13 comments
Closed

Suggestion to speed up gather for metagenomic samples #431

alexmsalmeida opened this issue Mar 7, 2018 · 13 comments
Labels

Comments

@alexmsalmeida
Copy link

Hi,

I am analysing a few thousand metagenomes (FASTQ.GZ files) against a custom-made reference database of ~ 10K genomes. I want to assess how much of each reference genome is present in the raw reads of the query. I am using "gather" for this, but it is taking > 1 day to run each analysis. For building the signatures (both query and reference) I did:

sourmash compute -k 31 --scaled 1000 --track-abundance --name-from-first -o out.sig in.fasta/fastq.gz

For indexing the database:

sourmash index -k 31 ref_name *.sig

Then running gather:

sourmash gather -k 31 query.sig refname.sbt.json -o query.csv > query.sm

The samples have a quite high sequencing depth (10 million reads on average), so is this expected or might I be doing something wrong? Do you have any tips on how I could speed things up?

Many thanks in advance,
Alex

@ctb
Copy link
Contributor

ctb commented Mar 7, 2018 via email

@ctb
Copy link
Contributor

ctb commented Mar 7, 2018

OH! I failed to read your question properly.

Gather should not be that slow against a small custom database. @luizirber do you have suggestions?

If you want to make a custom LCA database, we have instructions in the command line docs and also in the lca tutorial; happy to help work through it with you. You'd need to have taxonomy information; again, happy to help work through how to do that with you.

@alexmsalmeida
Copy link
Author

Hi Titus,

Thanks a lot for the quick reply, as always.

I am a bit hesitant about using the LCA function because I want to get an assessment of read classification by individual genome, instead of grouping by species or genus (for some of the references I don't have taxonomic information beyond family). The gather function was exactly what I was looking for. I couldn't find any other real alternatives out there for reference-based mapping that don't need any associated taxonomic information.

I imagine the main issue is probably the amount of reads from the query. I guess if I increased --scaled value for both reference and query (currently at 1000) it would probably speed things up right? But then again, it might also lower the sensitivity.

I will leave it running for now and will see how much it gets done in the next couple of days.

Alex

@ctb
Copy link
Contributor

ctb commented Mar 7, 2018 via email

@alexmsalmeida
Copy link
Author

Ah, that sounds great actually. Will have a look at the tutorials and give it a go then. Thanks a lot!

@ctb
Copy link
Contributor

ctb commented Mar 7, 2018 via email

@alexmsalmeida
Copy link
Author

Wow, just finished running one test sample with lca gather against a subset of 2.5 k reference genomes.

CPU time : 208.19 sec. Max Memory : 1270 MB

Amazing! What a difference. It leaves me wondering though, is there a particular loss in sensitivity/specificity in using this method as opposed to just gather? It seems too good to be true.

One additional question if you don't mind: when indexing the LCA database I got a warning for 40 signatures saying that they were duplicated. Does that mean that 40 of my reference genomes are almost identical to another?

Again thanks a lot, this will make things much more manageable now.

Alex

@ctb
Copy link
Contributor

ctb commented Mar 7, 2018 via email

@alexmsalmeida
Copy link
Author

alexmsalmeida commented Mar 7, 2018

Cool! I will compare some of the results I got with gather in relation to lca gather and will let you know if I notice any big discrepancies.

Regarding the duplicate signatures, I am getting this warning:
WARNING: in file hgr/18048_2_57.sig, duplicate md5sum: fa802c32c13c4860f0c1d5e02a0bfc7e; skipping
And then at the end:
WARNING: 40 duplicate signatures.

I might try playing around with --scaled, see if it makes any changes. Want to have as much resolution as possible.

Thanks again!

@luizirber luizirber added the sbt label Jun 7, 2018
@mytluo
Copy link

mytluo commented Aug 8, 2018

Hi! Sorry for tagging onto an old thread, but I'm having the same issue with sourmash gather taking a long time as well. @ctb, you mentioned lca gather does not go down to strain level, so if I need strain level analysis, I would need to use sourmash gather instead, correct?

@ctb
Copy link
Contributor

ctb commented Sep 5, 2018

yes, the current lca gather database does not go down to strain level. It's awfully incomplete also, as it turns out - working on better databasing here, #537.

@ctb
Copy link
Contributor

ctb commented Dec 13, 2018

A few other thoughts, spurred by work on #533 --

  • what if we up-front removed all the hashes that had no match in any database? Might not be possible (or quick) for SBTs tho.
  • we could also potentially focus on rare hashes, hashes that occurred only once (or a few times) in the database

@ctb
Copy link
Contributor

ctb commented Jan 11, 2019

#615 should significantly speed up gather (17h -> 17 min for one analysis...)

@ctb ctb closed this as completed Jan 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants