-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion to speed up gather for metagenomic samples #431
Comments
Hi alex,
(ok, first, "a few thousand metagenomes"? wow.)
Anyway, back to your question:
No, it's quite slow at the moment :(. We have a few issues (#300 in particular)
about this but haven't fixed it yet.
One suggestion - can you try 'sourmash lca gather'?
Briefly,
1. download genbank-k31.lca.json from here:
https://osf.io/zskb9/download
(https://sourmash.readthedocs.io/en/latest/tutorials-lca.html for full tut)
2. run 'sourmash lca gather sample.sig genbank-k31.lca.json'
(https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-lca-gather for docs)
And you can read more about 'gather' vs 'search' vs ... here,
https://sourmash.readthedocs.io/en/latest/classifying-signatures.html
There are a few caveats to lca gather at the moment:
* it's super fast but also memory intensive (~10-12 GB of RAM)
* it doesn't yet go down to strain level, unlike 'sourmash gather'
* it has a lower scaled value (10000) than the genbank SBT (2000) so
might be less sensitive to small contributions.
so please do let us know how it goes!
|
OH! I failed to read your question properly. Gather should not be that slow against a small custom database. @luizirber do you have suggestions? If you want to make a custom LCA database, we have instructions in the command line docs and also in the lca tutorial; happy to help work through it with you. You'd need to have taxonomy information; again, happy to help work through how to do that with you. |
Hi Titus, Thanks a lot for the quick reply, as always. I am a bit hesitant about using the LCA function because I want to get an assessment of read classification by individual genome, instead of grouping by species or genus (for some of the references I don't have taxonomic information beyond family). The gather function was exactly what I was looking for. I couldn't find any other real alternatives out there for reference-based mapping that don't need any associated taxonomic information. I imagine the main issue is probably the amount of reads from the query. I guess if I increased --scaled value for both reference and query (currently at 1000) it would probably speed things up right? But then again, it might also lower the sensitivity. I will leave it running for now and will see how much it gets done in the next couple of days. Alex |
welcome :)
'lca gather' doesn't actually use LCA (...I know, I know, terminology), it just
uses the LCA database, so it has "full" resolution!
you can make an LCA database at a resolution of 1000 and it should be very
very fast with lca gather. Might take more memory but for 10k genomes
will be under 10 GB.
|
Ah, that sounds great actually. Will have a look at the tutorials and give it a go then. Thanks a lot! |
welcome & please do let us know how it goes :)
|
Wow, just finished running one test sample with lca gather against a subset of 2.5 k reference genomes.
Amazing! What a difference. It leaves me wondering though, is there a particular loss in sensitivity/specificity in using this method as opposed to just gather? It seems too good to be true. One additional question if you don't mind: when indexing the LCA database I got a warning for 40 signatures saying that they were duplicated. Does that mean that 40 of my reference genomes are almost identical to another? Again thanks a lot, this will make things much more manageable now. Alex |
:)
there *should* be no loss in sens/spec with it, other than questions
of scaled value. If there is, it's a bug.
re duplicate signatures, can you post some of the messages? it *should*
mean that the signatures are identical, which would mean that (at that
scaled value) there is absolutely no difference between them.
…--titus
|
Cool! I will compare some of the results I got with gather in relation to lca gather and will let you know if I notice any big discrepancies. Regarding the duplicate signatures, I am getting this warning: I might try playing around with --scaled, see if it makes any changes. Want to have as much resolution as possible. Thanks again! |
Hi! Sorry for tagging onto an old thread, but I'm having the same issue with |
yes, the current lca gather database does not go down to strain level. It's awfully incomplete also, as it turns out - working on better databasing here, #537. |
A few other thoughts, spurred by work on #533 --
|
#615 should significantly speed up gather (17h -> 17 min for one analysis...) |
Hi,
I am analysing a few thousand metagenomes (FASTQ.GZ files) against a custom-made reference database of ~ 10K genomes. I want to assess how much of each reference genome is present in the raw reads of the query. I am using "gather" for this, but it is taking > 1 day to run each analysis. For building the signatures (both query and reference) I did:
sourmash compute -k 31 --scaled 1000 --track-abundance --name-from-first -o out.sig in.fasta/fastq.gz
For indexing the database:
sourmash index -k 31 ref_name *.sig
Then running gather:
sourmash gather -k 31 query.sig refname.sbt.json -o query.csv > query.sm
The samples have a quite high sequencing depth (10 million reads on average), so is this expected or might I be doing something wrong? Do you have any tips on how I could speed things up?
Many thanks in advance,
Alex
The text was updated successfully, but these errors were encountered: