-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sourmash vs aligners, & comparison with other tools #3079
Comments
will draft longer answer in a bit, but can I confirm what 'coverage' means here?
do you mean fraction of genome covered by at least one read ('breadth' in inStrain, or 'detection' in anvi'o), or something else? |
The best reference to use is Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets, Portik et al., which shows that sourmash has very few false positives at a taxonomic level. This does suggest relatively few false positives at a genomic level as well. The underlying algorithm that does the actual assignment is implemented as But... we don't have anything written as to why it works well. I have unpublished analyses and a general understanding of what's going on, and there's information scattered around the repo (see #2360 specifically). The short answer appears to be that the combinatorial nature of sourmash matching with FracMinHash and containment makes matches as low as 3 hashes (a
We always like to hear that ;)
Sure - I think biobakery uses mapping underneath, right? Happy to chat more here about such things; if we need to move to e-mail for detailed stuff, drop me a note at ctbrown@ucdavis.edu (but I'd be happiest to keep as much public as possible). In general, sourmash has the goal of rapid and extremely large scale analysis against all available genomes, including private genomes; while other tools focus more on a curated set of high quality genomes, usually those available from NCBI or GTDB. |
The coverage I meaned here was specific fraction of genome covered by at least one read (the breadth ) |
Dear the author Titus, The biobakery is using their genome set to generate the results, and we think their so-called biomarkers are derived from their manually curated. On the other hand, we had gtdbtk representative genome and some denovo assembled genomes sketched from sourmash and applied this customed db to generate the abundance table against PE raw reads. Overall, we could see their phylum and genus constitution or even the alpha diversity are comparable. In addition, we also conducted deseq2 or maaslin2 to see the differences between different sample groups. We found some species showed higher abundance within specific group, and those results presented in the biobakery and your methods. We actually not 100% trust the results from biobakery so we wanted to use sourmash to validate or supplement any interesting results to our hypothesis. This project was actually a big cohort study related to smoker and non-smoker in Asian, which was not only including quantification analysis but also the assembled genomes built from the samples. Because this project was complicated, forgive me unable to make it clearer. We are actually your big fans because our pipelines heavily relying your tool, and look forward to any improvements within your updates. if we have any good news, will let you know and bring out to discuss with you. |
Great! Yes, in that case, our experience is that sourmash is accurate at the species level down to about 3 hashes, or |
Excellent, really glad to hear this - please post questions and so on as you have them! |
Hi @ctb ! I'd like to slide into this discussion, hopefully not too far away from the subject. @yuzie0314 is comparing (i assume) Metaphlan results with Sourmash, which I've also been doing (along with kraken/bracken using conservative confidence cutoffs, as suggested a great paper I wish had also considered sourmash). My question is: Would it be accurate to say that Sourmash gather reports sequence abundance, and not taxonomic abundance? I'm referring to the distinction made by this paper, which explains k-mer based methods (again kraken/bracken) reports sequence abundance, which are not genome-length normalised, as opposed to marker-gene methods. Are we getting something similar conceptually with Sourmash, or is there a form of genome length normalisation going on in the abundance estimation? Looking forward to your reply, |
You are correct - no genome-length normalization is going on! (Thanks for the references, too :) I've been exploring this a little bit with @bryshalm who is looking at sourmash for OTU-style processing, and my hot take is that failing to normalize for genome length is similar to our general disregard for 16S/SSU rRNA copy number - you just need to be aware that it's a thing :). I also do not think, even in principle, that you can properly normalize for genome length - there are too many unknowns around pangenomic content in metagenomes vs what's in the reference database - but happy to discuss further! |
I agree, the important thing is to acknowledge this. Thanks for the quick reply! |
(copied from #461 (comment))
Hi Titus @ctb , I don't know whether I could ask the following questions related to this issue, anyway....
We've used your tool to quantify the abundance of a set number of reference genomes against raw reads (query fastq).
In addtion, according to your suggestion here, we used
median_abund
to determine which genomes are differentially abundant among sample groups. However, we had some doubts on this value.forgive me for my offensiveness to hijack this disscussion.
Yuzie
Originally posted by @yuzie0314 in #461 (comment)
The text was updated successfully, but these errors were encountered: