add script that computes and prints the three different ways to make … #2

dkoslicki · 2022-05-16T03:00:23Z

…ANI/distance/containment matrices

Basically, we have Mahmudur's code, Tessa's code and Titus's code which all purport to compute a matrix of ANI (or distance, easily converted) values. Unfortunately, as this script demonstrates, all three lead to different results.

For example, for one pair of genomes, these approaches gave respectively ANI values of: 0.6489, 0.0, and 0.3244.

…ANI/distance/containment matrices

bluegenes · 2022-05-16T16:12:57Z

I see at least one of the issues -- by default sourmash compare uses jaccard, not containment, so adding --ani to the default command uses jaccard --> ANI. To run containment, we need to add --containment to the compare --ani command. Using --containment will produce different values for the different comparison directions. I was waiting for your thoughts on my average containment approach before adding it as an option to compare.

As for the 0 -- that might be happening as a result of zeroing out the ANI, per sourmash-bio/sourmash#2003

will continue looking...

ctb · 2022-05-16T17:24:18Z

as an aside, we might want to come back to the sourmash code to automatically enable --containment (with a message ;) when --ani is specified.

dkoslicki · 2022-05-16T19:03:14Z

@bluegenes that does make sense with jaccard vs containment. Apologies if I overlooked something, but remind me when/where you asked for my thoughts on average containment?

as an aside, we might want to come back to the sourmash code to automatically enable --containment (with a message ;) when --ani is specified.

Agreed!

bluegenes · 2022-05-16T20:35:17Z

@bluegenes that does make sense with jaccard vs containment. Apologies if I overlooked something, but remind me when/where you asked for my thoughts on average containment?

Not written anywhere -- it was during our last meeting. I mentioned that I was using it for all my comparisons to mapping ANI values, and wanted to hear any caveats/issues you could think of around using it more widely /returning it directly from sourmash.

The deeper question was on how to handle confidence intervals (if at all) for average containment. Since I'm taking the average of the containment ANI values, I'm not generating confidence intervals for avg containment.

as an aside, we might want to come back to the sourmash code to automatically enable --containment (with a message ;) when --ani is specified.

With sourmash-bio/sourmash#2056, I enabled--avg-containment for compare as well. Since it would produce a symmetrical matrix (as does jaccard), would it be a better compare default than ANI from --containment?

bluegenes · 2022-05-16T22:26:08Z

The last major difference I see is that I didn't actually change our scaled containment to use the bias factor yet, see sourmash-bio/sourmash#1798 (comment).

Using dkoslicki#1, the matrices look much more alike, though there are a couple cases with large discrepancies. I've tracked the big discrepancies down, and they're a result of --avg-containment ANI averaging a larger ANI with a 0.0 (due to the hard coded thresholds you pointed out). These thresholds were in the original equations from the mutation-rate-ci-calculator, but I've removed them in sourmash-bio/sourmash#2060 since they seem to be causing issues.

One thing I could do that would not involve an issue with semantic versioning -- I could use the containment bias factor just during ANI estimation, and continue ignoring it for non-ani containment for now. Testing in sourmash-bio/sourmash#2057.

mamba env for installations; use compare --avg-containment --ani

dkoslicki · 2022-05-31T23:05:13Z

And @mahmudhera it might make sense to move the discussion here instead of Tessa's PR on my fork. That way the trail is easier to follow

mahmudhera · 2022-06-01T01:38:53Z

You are correct. To make it easier to track things:

The issue has been solved partly by switching to sourmash latest branch (after Tessa's update to use avg-containment in sourmash), which resulted in an agreement between sourmash ANI estimate, and Tessa's ANI estimate.

The minor disagreement between my code in this repository and sourmash compare (reported here by Tessa) was solved by using the same seed (previously, we were using different seeds).

Currently, the script compare_matrices.py uses different seeds to compute ANI, and prints the largest % of the relative difference, which is ~4.5%. I think previously, we had > 50% relative difference in some cases because (a) Jaccard->ANI, (b) using hardcoded thresholds in sourmash (which are dropped for now).

add script that computes and prints the three different ways to make …

b8cb042

…ANI/distance/containment matrices

dkoslicki mentioned this pull request May 16, 2022

compare and visualize the similarity of many genomes using Average Nucleotide Identity (ANI) sourmash-bio/sourmash-examples#16

Open

mamba env for installations; use compare --avg-containment --ani

4f6c699

Merge pull request #1 from bluegenes/main

77196cd

mamba env for installations; use compare --avg-containment --ani

bluegenes mentioned this pull request May 19, 2022

[MRG] fix thresholds in distance_utils sourmash-bio/sourmash#2060

Merged

mahmudhera merged commit 5ebbe13 into mahmudhera:main May 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add script that computes and prints the three different ways to make … #2

add script that computes and prints the three different ways to make … #2

dkoslicki commented May 16, 2022

bluegenes commented May 16, 2022 •

edited

Loading

ctb commented May 16, 2022

dkoslicki commented May 16, 2022

bluegenes commented May 16, 2022 •

edited

Loading

bluegenes commented May 16, 2022 •

edited

Loading

dkoslicki commented May 31, 2022

mahmudhera commented Jun 1, 2022

add script that computes and prints the three different ways to make … #2

add script that computes and prints the three different ways to make … #2

Conversation

dkoslicki commented May 16, 2022

bluegenes commented May 16, 2022 • edited Loading

ctb commented May 16, 2022

dkoslicki commented May 16, 2022

bluegenes commented May 16, 2022 • edited Loading

bluegenes commented May 16, 2022 • edited Loading

dkoslicki commented May 31, 2022

mahmudhera commented Jun 1, 2022

bluegenes commented May 16, 2022 •

edited

Loading

bluegenes commented May 16, 2022 •

edited

Loading

bluegenes commented May 16, 2022 •

edited

Loading