lca summarize abundance with outfile #1833

jsgounot · 2022-02-08T11:46:14Z

Hi,

thanks for this software. I'm trying sourmash as an alternative for abundance estimation. When I run the sourmash lca summarize command, it works fine and reports in stdout what seems to be abundance values (at least a percentage). However, when I define an output file with the same query and database , this column does not appear and I only have count.

Sorry if I missed something here.
sourmash 4.2.4

Regards,
JSG

The text was updated successfully, but these errors were encountered:

jsgounot · 2022-02-08T11:49:53Z

I would like to take advantage of this issue to ask a question relative to this analysis. When building a custom database (sourmash lca index), what is the best way to manage multiple strains from the same species? Should I generate a single signature for all strains? Or is it ok to provide multiple signatures with the same taxonomic levels?

ctb · 2022-02-08T14:02:01Z

Hi,

thanks for this software. I'm trying sourmash as an alternative for abundance estimation. When I run the sourmash lca summarize command, it works fine and reports in stdout what seems to be abundance values (at least a percentage). However, when I define an output file with the same query and database , this column does not appear and I only have count.

Sorry if I missed something here. sourmash 4.2.4

hi @jsgounot thanks for the question! Are you using sourmash lca summarize -o <output> to do this?

ctb · 2022-02-08T14:03:42Z

I would like to take advantage of this issue to ask a question relative to this analysis. When building a custom database (sourmash lca index), what is the best way to manage multiple strains from the same species? Should I generate a single signature for all strains? Or is it ok to provide multiple signatures with the same taxonomic levels?

Since we are interested in strain-specific matching, we generally use one signature for each strain. I have to go dig to see where the output can take advantage of that, but at the very least you should get appropriate summarization at all the levels above strain! More soon.

ctb · 2022-02-08T14:16:59Z

Hi,
thanks for this software. I'm trying sourmash as an alternative for abundance estimation. When I run the sourmash lca summarize command, it works fine and reports in stdout what seems to be abundance values (at least a percentage). However, when I define an output file with the same query and database , this column does not appear and I only have count.
Sorry if I missed something here. sourmash 4.2.4

hi @jsgounot thanks for the question! Are you using sourmash lca summarize -o <output> to do this?

OK, dug into the code - the percentage in the text output to the screen is calculated as the count for that lineage divided by the total counts in the query sketch. The count column in the CSV output is the numerator used for the percentage, but the denominator is not provided anywhere as far as I can see. It's a constant number so it doesn't change the relative abudances, but I can see why you would want it :)

It would be straightforward to add this to the output format for CSV as a new column, total_counts. Shall we do so?

jsgounot · 2022-02-09T01:44:09Z

Are you using sourmash lca summarize -o to do this?

Yes I am.

Since we are interested in strain-specific matching, we generally use one signature for each strain. I have to go dig to see where the output can take advantage of that, but at the very least you should get appropriate summarization at all the levels above strain! More soon.

Thanks for this answer, make sense.

It would be straightforward to add this to the output format for CSV as a new column, total_counts. Shall we do so?

This would be critical for me. I also need to know the proportion of my reads which does not 'match' with any of the reference, but this could be calculated with the above number. I guess however that results are biased by references length variation, unless you're able to retrieve sequences original size from signatures (and that no sequences were merged for one signature).

ctb · 2022-02-09T15:07:49Z

It would be straightforward to add this to the output format for CSV as a new column, total_counts. Shall we do so?

This would be critical for me.

OK! I note two places where this info should be added - one is in the output of lca summarize -o <csv>, the other is in sourmash sig describe, which right now just gives the number of distinct hashes but does not give the sum.

I also need to know the proportion of my reads which does not 'match' with any of the reference, but this could be calculated with the above number. I guess however that results are biased by references length variation, unless you're able to retrieve sequences original size from signatures (and that no sequences were merged for one signature).

Right, so you'd be using the fraction of total k-mers that is not matched as a proxy for the fraction of total reads that do not match.

We are also planning to add estimation of total k-mers in reference at some point soon; see #1798 for motivation.

I'm totally on board with adding all of this to sourmash and supporting your use case with lca summarize, but I also wanted to point you in another possible direction: using sourmash gather for some of this. You can read more about it in this preprint and also see the genome-grist software but shorn of all of the extra details:

use sourmash gather with abundance-weighted queries against a database of genome references: sourmash gather query.sig db1 [ db2 ... ] -o output.csv
Use the f_unique_weighted column to get your fractions - the sum of this number across all rows will give you the weighted fraction of all k-mers that match to anything in the database, and for each row it will give you the amount that matches to the specific genome.
You can also use sourmash tax to take the output.csv and add taxonomic information to it if you like.

There's some slightly outdated documentation on this here. We just got the preprint out so at some point we'll revise the sourmash documentation appropriately :).

Last but not least, note that the genome-grist software will use sourmash to find the relevant references to a metagenome, download them for you from GenBank, and then actually do the mapping for you. This should get you to the actual read numbers. genome-grist is less mature than sourmash, though, so a bit of buyer beware applies! We'd love feedback if this looks interesting and needs some specific additions to meet your needs.

jsgounot · 2022-02-11T06:45:16Z

Hi,

Right, so you'd be using the fraction of total k-mers that is not matched as a proxy for the fraction of total reads that do not match.

Correct.

Concerning the rest of your nice answer, sourmash gather sounds an interesting tool but might not be adapted in my case where I 1) have a complete database with most of the diversity and 2) want to add some custom additional genomes easily. Am I wrong here?

I do have another question: Does lca method include the winner take all as described here?

Thank you very much for your time.

ctb · 2022-02-11T13:31:18Z

Hi,

Right, so you'd be using the fraction of total k-mers that is not matched as a proxy for the fraction of total reads that do not match.

Correct.

👍 !!

Concerning the rest of your nice answer, sourmash gather sounds an interesting tool but might not be adapted in my case where I 1) have a complete database with most of the diversity and 2) want to add some custom additional genomes easily. Am I wrong here?

It should work fine in that case, too.

I do have another question: Does lca method include the winner take all as described here?

No, the gather method is the winner-take-all method (that then became the min set cov/minimum metagenome cover discussed in the preprint for sourmash gather).

We've mostly switched to gather from lca ourselves as a taxonomic analysis solution, because lca doesn't provide strain level resolution. But it's definitely a forward leaning position that we're still exploring and explaining!

Thank you very much for your time.

You're very welcome! Thank you for all the questions!

I will see if I can get the updates to lca summarize in this weekend, although it might be a few more weeks before it's available in a release.

ctb · 2022-02-12T19:59:06Z

#1833 should fix lca summarize output.

Still need to create an issue about maybe updating the output of sourmash sig describe.

ctb · 2022-02-14T16:25:49Z

#1833 has been merged into latest. Will ping you again when it's available in a sourmash release; let me know if you would like help trying it out in the meantime!

ctb · 2022-03-12T04:18:57Z

#1833 is now available in sourmash v4.3.0.

ctb mentioned this issue Feb 12, 2022

[MRG] update lca summarize output to output total counts #1838

Merged

ctb mentioned this issue Feb 16, 2022

does weighted fraction matching k-mers =~ % reads that map? dib-lab/genome-grist#167

Open

This was referenced Mar 12, 2022

describe sourmash estimates and their correlations with read mapping rates #1881

Closed

[MRG] add sum_hashes to sourmash sig describe output. #1882

Merged

ctb closed this as completed in #1882 Mar 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lca summarize abundance with outfile #1833

lca summarize abundance with outfile #1833

jsgounot commented Feb 8, 2022

jsgounot commented Feb 8, 2022

ctb commented Feb 8, 2022

ctb commented Feb 8, 2022

ctb commented Feb 8, 2022

jsgounot commented Feb 9, 2022 •

edited

Loading

ctb commented Feb 9, 2022

jsgounot commented Feb 11, 2022

ctb commented Feb 11, 2022

ctb commented Feb 12, 2022

ctb commented Feb 14, 2022

ctb commented Mar 12, 2022

lca summarize abundance with outfile #1833

lca summarize abundance with outfile #1833

Comments

jsgounot commented Feb 8, 2022

jsgounot commented Feb 8, 2022

ctb commented Feb 8, 2022

ctb commented Feb 8, 2022

ctb commented Feb 8, 2022

jsgounot commented Feb 9, 2022 • edited Loading

ctb commented Feb 9, 2022

jsgounot commented Feb 11, 2022

ctb commented Feb 11, 2022

ctb commented Feb 12, 2022

ctb commented Feb 14, 2022

ctb commented Mar 12, 2022

jsgounot commented Feb 9, 2022 •

edited

Loading