Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lca summarize abundance with outfile #1833

Closed
jsgounot opened this issue Feb 8, 2022 · 11 comments · Fixed by #1882
Closed

lca summarize abundance with outfile #1833

jsgounot opened this issue Feb 8, 2022 · 11 comments · Fixed by #1882

Comments

@jsgounot
Copy link

jsgounot commented Feb 8, 2022

Hi,

thanks for this software. I'm trying sourmash as an alternative for abundance estimation. When I run the sourmash lca summarize command, it works fine and reports in stdout what seems to be abundance values (at least a percentage). However, when I define an output file with the same query and database , this column does not appear and I only have count.

Sorry if I missed something here.
sourmash 4.2.4

Regards,
JSG

@jsgounot
Copy link
Author

jsgounot commented Feb 8, 2022

I would like to take advantage of this issue to ask a question relative to this analysis. When building a custom database (sourmash lca index), what is the best way to manage multiple strains from the same species? Should I generate a single signature for all strains? Or is it ok to provide multiple signatures with the same taxonomic levels?

@ctb
Copy link
Contributor

ctb commented Feb 8, 2022

Hi,

thanks for this software. I'm trying sourmash as an alternative for abundance estimation. When I run the sourmash lca summarize command, it works fine and reports in stdout what seems to be abundance values (at least a percentage). However, when I define an output file with the same query and database , this column does not appear and I only have count.

Sorry if I missed something here. sourmash 4.2.4

hi @jsgounot thanks for the question! Are you using sourmash lca summarize -o <output> to do this?

@ctb
Copy link
Contributor

ctb commented Feb 8, 2022

I would like to take advantage of this issue to ask a question relative to this analysis. When building a custom database (sourmash lca index), what is the best way to manage multiple strains from the same species? Should I generate a single signature for all strains? Or is it ok to provide multiple signatures with the same taxonomic levels?

Since we are interested in strain-specific matching, we generally use one signature for each strain. I have to go dig to see where the output can take advantage of that, but at the very least you should get appropriate summarization at all the levels above strain! More soon.

@ctb
Copy link
Contributor

ctb commented Feb 8, 2022

Hi,
thanks for this software. I'm trying sourmash as an alternative for abundance estimation. When I run the sourmash lca summarize command, it works fine and reports in stdout what seems to be abundance values (at least a percentage). However, when I define an output file with the same query and database , this column does not appear and I only have count.
Sorry if I missed something here. sourmash 4.2.4

hi @jsgounot thanks for the question! Are you using sourmash lca summarize -o <output> to do this?

OK, dug into the code - the percentage in the text output to the screen is calculated as the count for that lineage divided by the total counts in the query sketch. The count column in the CSV output is the numerator used for the percentage, but the denominator is not provided anywhere as far as I can see. It's a constant number so it doesn't change the relative abudances, but I can see why you would want it :)

It would be straightforward to add this to the output format for CSV as a new column, total_counts. Shall we do so?

@jsgounot
Copy link
Author

jsgounot commented Feb 9, 2022

Are you using sourmash lca summarize -o to do this?

Yes I am.

Since we are interested in strain-specific matching, we generally use one signature for each strain. I have to go dig to see where the output can take advantage of that, but at the very least you should get appropriate summarization at all the levels above strain! More soon.

Thanks for this answer, make sense.

It would be straightforward to add this to the output format for CSV as a new column, total_counts. Shall we do so?

This would be critical for me. I also need to know the proportion of my reads which does not 'match' with any of the reference, but this could be calculated with the above number. I guess however that results are biased by references length variation, unless you're able to retrieve sequences original size from signatures (and that no sequences were merged for one signature).

@ctb
Copy link
Contributor

ctb commented Feb 9, 2022

It would be straightforward to add this to the output format for CSV as a new column, total_counts. Shall we do so?

This would be critical for me.

OK! I note two places where this info should be added - one is in the output of lca summarize -o <csv>, the other is in sourmash sig describe, which right now just gives the number of distinct hashes but does not give the sum.

I also need to know the proportion of my reads which does not 'match' with any of the reference, but this could be calculated with the above number. I guess however that results are biased by references length variation, unless you're able to retrieve sequences original size from signatures (and that no sequences were merged for one signature).

Right, so you'd be using the fraction of total k-mers that is not matched as a proxy for the fraction of total reads that do not match.

We are also planning to add estimation of total k-mers in reference at some point soon; see #1798 for motivation.

I'm totally on board with adding all of this to sourmash and supporting your use case with lca summarize, but I also wanted to point you in another possible direction: using sourmash gather for some of this. You can read more about it in this preprint and also see the genome-grist software but shorn of all of the extra details:

  1. use sourmash gather with abundance-weighted queries against a database of genome references: sourmash gather query.sig db1 [ db2 ... ] -o output.csv
  2. Use the f_unique_weighted column to get your fractions - the sum of this number across all rows will give you the weighted fraction of all k-mers that match to anything in the database, and for each row it will give you the amount that matches to the specific genome.
  3. You can also use sourmash tax to take the output.csv and add taxonomic information to it if you like.

There's some slightly outdated documentation on this here. We just got the preprint out so at some point we'll revise the sourmash documentation appropriately :).

Last but not least, note that the genome-grist software will use sourmash to find the relevant references to a metagenome, download them for you from GenBank, and then actually do the mapping for you. This should get you to the actual read numbers. genome-grist is less mature than sourmash, though, so a bit of buyer beware applies! We'd love feedback if this looks interesting and needs some specific additions to meet your needs.

@jsgounot
Copy link
Author

Hi,

Right, so you'd be using the fraction of total k-mers that is not matched as a proxy for the fraction of total reads that do not match.

Correct.

Concerning the rest of your nice answer, sourmash gather sounds an interesting tool but might not be adapted in my case where I 1) have a complete database with most of the diversity and 2) want to add some custom additional genomes easily. Am I wrong here?

I do have another question: Does lca method include the winner take all as described here?

Thank you very much for your time.

@ctb
Copy link
Contributor

ctb commented Feb 11, 2022

Hi,

Right, so you'd be using the fraction of total k-mers that is not matched as a proxy for the fraction of total reads that do not match.

Correct.

👍 !!

Concerning the rest of your nice answer, sourmash gather sounds an interesting tool but might not be adapted in my case where I 1) have a complete database with most of the diversity and 2) want to add some custom additional genomes easily. Am I wrong here?

It should work fine in that case, too.

I do have another question: Does lca method include the winner take all as described here?

No, the gather method is the winner-take-all method (that then became the min set cov/minimum metagenome cover discussed in the preprint for sourmash gather).

We've mostly switched to gather from lca ourselves as a taxonomic analysis solution, because lca doesn't provide strain level resolution. But it's definitely a forward leaning position that we're still exploring and explaining!

Thank you very much for your time.

You're very welcome! Thank you for all the questions!

I will see if I can get the updates to lca summarize in this weekend, although it might be a few more weeks before it's available in a release.

@ctb
Copy link
Contributor

ctb commented Feb 12, 2022

#1833 should fix lca summarize output.

Still need to create an issue about maybe updating the output of sourmash sig describe.

@ctb
Copy link
Contributor

ctb commented Feb 14, 2022

#1833 has been merged into latest. Will ping you again when it's available in a sourmash release; let me know if you would like help trying it out in the meantime!

@ctb
Copy link
Contributor

ctb commented Mar 12, 2022

#1833 is now available in sourmash v4.3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants