Suggestion to speed up gather for metagenomic samples #431

alexmsalmeida · 2018-03-07T14:02:47Z

Hi,

I am analysing a few thousand metagenomes (FASTQ.GZ files) against a custom-made reference database of ~ 10K genomes. I want to assess how much of each reference genome is present in the raw reads of the query. I am using "gather" for this, but it is taking > 1 day to run each analysis. For building the signatures (both query and reference) I did:

sourmash compute -k 31 --scaled 1000 --track-abundance --name-from-first -o out.sig in.fasta/fastq.gz

For indexing the database:

sourmash index -k 31 ref_name *.sig

Then running gather:

sourmash gather -k 31 query.sig refname.sbt.json -o query.csv > query.sm

The samples have a quite high sequencing depth (10 million reads on average), so is this expected or might I be doing something wrong? Do you have any tips on how I could speed things up?

Many thanks in advance,
Alex

The text was updated successfully, but these errors were encountered:

ctb · 2018-03-07T14:08:50Z

Hi alex, (ok, first, "a few thousand metagenomes"? wow.) Anyway, back to your question: No, it's quite slow at the moment :(. We have a few issues (#300 in particular) about this but haven't fixed it yet. One suggestion - can you try 'sourmash lca gather'? Briefly, 1. download genbank-k31.lca.json from here: https://osf.io/zskb9/download (https://sourmash.readthedocs.io/en/latest/tutorials-lca.html for full tut) 2. run 'sourmash lca gather sample.sig genbank-k31.lca.json' (https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-lca-gather for docs) And you can read more about 'gather' vs 'search' vs ... here, https://sourmash.readthedocs.io/en/latest/classifying-signatures.html There are a few caveats to lca gather at the moment: * it's super fast but also memory intensive (~10-12 GB of RAM) * it doesn't yet go down to strain level, unlike 'sourmash gather' * it has a lower scaled value (10000) than the genbank SBT (2000) so might be less sensitive to small contributions. so please do let us know how it goes!

ctb · 2018-03-07T14:11:11Z

OH! I failed to read your question properly.

Gather should not be that slow against a small custom database. @luizirber do you have suggestions?

If you want to make a custom LCA database, we have instructions in the command line docs and also in the lca tutorial; happy to help work through it with you. You'd need to have taxonomy information; again, happy to help work through how to do that with you.

alexmsalmeida · 2018-03-07T14:57:18Z

Hi Titus,

Thanks a lot for the quick reply, as always.

I am a bit hesitant about using the LCA function because I want to get an assessment of read classification by individual genome, instead of grouping by species or genus (for some of the references I don't have taxonomic information beyond family). The gather function was exactly what I was looking for. I couldn't find any other real alternatives out there for reference-based mapping that don't need any associated taxonomic information.

I imagine the main issue is probably the amount of reads from the query. I guess if I increased --scaled value for both reference and query (currently at 1000) it would probably speed things up right? But then again, it might also lower the sensitivity.

I will leave it running for now and will see how much it gets done in the next couple of days.

Alex

ctb · 2018-03-07T15:04:57Z

welcome :) 'lca gather' doesn't actually use LCA (...I know, I know, terminology), it just uses the LCA database, so it has "full" resolution! you can make an LCA database at a resolution of 1000 and it should be very very fast with lca gather. Might take more memory but for 10k genomes will be under 10 GB.

alexmsalmeida · 2018-03-07T15:14:45Z

Ah, that sounds great actually. Will have a look at the tutorials and give it a go then. Thanks a lot!

ctb · 2018-03-07T15:16:39Z

welcome & please do let us know how it goes :)

alexmsalmeida · 2018-03-07T16:56:50Z

Wow, just finished running one test sample with lca gather against a subset of 2.5 k reference genomes.

CPU time : 208.19 sec. Max Memory : 1270 MB

Amazing! What a difference. It leaves me wondering though, is there a particular loss in sensitivity/specificity in using this method as opposed to just gather? It seems too good to be true.

One additional question if you don't mind: when indexing the LCA database I got a warning for 40 signatures saying that they were duplicated. Does that mean that 40 of my reference genomes are almost identical to another?

Again thanks a lot, this will make things much more manageable now.

Alex

ctb · 2018-03-07T19:14:24Z

:) there *should* be no loss in sens/spec with it, other than questions of scaled value. If there is, it's a bug. re duplicate signatures, can you post some of the messages? it *should* mean that the signatures are identical, which would mean that (at that scaled value) there is absolutely no difference between them.

…

--titus

alexmsalmeida · 2018-03-07T19:54:18Z

Cool! I will compare some of the results I got with gather in relation to lca gather and will let you know if I notice any big discrepancies.

Regarding the duplicate signatures, I am getting this warning:
WARNING: in file hgr/18048_2_57.sig, duplicate md5sum: fa802c32c13c4860f0c1d5e02a0bfc7e; skipping
And then at the end:
WARNING: 40 duplicate signatures.

I might try playing around with --scaled, see if it makes any changes. Want to have as much resolution as possible.

Thanks again!

mytluo · 2018-08-08T18:07:41Z

Hi! Sorry for tagging onto an old thread, but I'm having the same issue with sourmash gather taking a long time as well. @ctb, you mentioned lca gather does not go down to strain level, so if I need strain level analysis, I would need to use sourmash gather instead, correct?

ctb · 2018-09-05T13:25:39Z

yes, the current lca gather database does not go down to strain level. It's awfully incomplete also, as it turns out - working on better databasing here, #537.

ctb · 2018-12-13T17:20:45Z

A few other thoughts, spurred by work on #533 --

what if we up-front removed all the hashes that had no match in any database? Might not be possible (or quick) for SBTs tho.
we could also potentially focus on rare hashes, hashes that occurred only once (or a few times) in the database

ctb · 2019-01-11T14:50:09Z

#615 should significantly speed up gather (17h -> 17 min for one analysis...)

luizirber added the sbt label Jun 7, 2018

ctb mentioned this issue Dec 13, 2018

Add a remove method to minhashes #573

Closed

ctb closed this as completed Jan 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion to speed up gather for metagenomic samples #431

Suggestion to speed up gather for metagenomic samples #431

alexmsalmeida commented Mar 7, 2018

ctb commented Mar 7, 2018 via email

ctb commented Mar 7, 2018

alexmsalmeida commented Mar 7, 2018

ctb commented Mar 7, 2018 via email

alexmsalmeida commented Mar 7, 2018

ctb commented Mar 7, 2018 via email

alexmsalmeida commented Mar 7, 2018

ctb commented Mar 7, 2018 via email

alexmsalmeida commented Mar 7, 2018 •

edited

Loading

mytluo commented Aug 8, 2018

ctb commented Sep 5, 2018

ctb commented Dec 13, 2018

ctb commented Jan 11, 2019

Suggestion to speed up gather for metagenomic samples #431

Suggestion to speed up gather for metagenomic samples #431

Comments

alexmsalmeida commented Mar 7, 2018

ctb commented Mar 7, 2018 via email

ctb commented Mar 7, 2018

alexmsalmeida commented Mar 7, 2018

ctb commented Mar 7, 2018 via email

alexmsalmeida commented Mar 7, 2018

ctb commented Mar 7, 2018 via email

alexmsalmeida commented Mar 7, 2018

ctb commented Mar 7, 2018 via email

alexmsalmeida commented Mar 7, 2018 • edited Loading

mytluo commented Aug 8, 2018

ctb commented Sep 5, 2018

ctb commented Dec 13, 2018

ctb commented Jan 11, 2019

alexmsalmeida commented Mar 7, 2018 •

edited

Loading