-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lca summarize abundance with outfile #1833
Comments
I would like to take advantage of this issue to ask a question relative to this analysis. When building a custom database ( |
hi @jsgounot thanks for the question! Are you using |
Since we are interested in strain-specific matching, we generally use one signature for each strain. I have to go dig to see where the output can take advantage of that, but at the very least you should get appropriate summarization at all the levels above strain! More soon. |
OK, dug into the code - the percentage in the text output to the screen is calculated as the count for that lineage divided by the total counts in the query sketch. The It would be straightforward to add this to the output format for CSV as a new column, |
Yes I am.
Thanks for this answer, make sense.
This would be critical for me. I also need to know the proportion of my reads which does not 'match' with any of the reference, but this could be calculated with the above number. I guess however that results are biased by references length variation, unless you're able to retrieve sequences original size from signatures (and that no sequences were merged for one signature). |
OK! I note two places where this info should be added - one is in the output of
Right, so you'd be using the fraction of total k-mers that is not matched as a proxy for the fraction of total reads that do not match. We are also planning to add estimation of total k-mers in reference at some point soon; see #1798 for motivation. I'm totally on board with adding all of this to sourmash and supporting your use case with
There's some slightly outdated documentation on this here. We just got the preprint out so at some point we'll revise the sourmash documentation appropriately :). Last but not least, note that the genome-grist software will use sourmash to find the relevant references to a metagenome, download them for you from GenBank, and then actually do the mapping for you. This should get you to the actual read numbers. genome-grist is less mature than sourmash, though, so a bit of buyer beware applies! We'd love feedback if this looks interesting and needs some specific additions to meet your needs. |
Hi,
Correct. Concerning the rest of your nice answer, I do have another question: Does Thank you very much for your time. |
👍 !!
It should work fine in that case, too.
No, the gather method is the winner-take-all method (that then became the min set cov/minimum metagenome cover discussed in the preprint for sourmash gather). We've mostly switched to gather from lca ourselves as a taxonomic analysis solution, because lca doesn't provide strain level resolution. But it's definitely a forward leaning position that we're still exploring and explaining!
You're very welcome! Thank you for all the questions! I will see if I can get the updates to lca summarize in this weekend, although it might be a few more weeks before it's available in a release. |
#1833 should fix Still need to create an issue about maybe updating the output of |
#1833 has been merged into |
#1833 is now available in sourmash v4.3.0. |
Hi,
thanks for this software. I'm trying sourmash as an alternative for abundance estimation. When I run the
sourmash lca summarize
command, it works fine and reports in stdout what seems to be abundance values (at least a percentage). However, when I define an output file with the same query and database , this column does not appear and I only have count.Sorry if I missed something here.
sourmash 4.2.4
Regards,
JSG
The text was updated successfully, but these errors were encountered: