Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add script that computes and prints the three different ways to make … #2

Merged
merged 3 commits into from
May 26, 2022

Conversation

dkoslicki
Copy link
Contributor

…ANI/distance/containment matrices

Basically, we have Mahmudur's code, Tessa's code and Titus's code which all purport to compute a matrix of ANI (or distance, easily converted) values. Unfortunately, as this script demonstrates, all three lead to different results.

For example, for one pair of genomes, these approaches gave respectively ANI values of: 0.6489, 0.0, and 0.3244.

@bluegenes
Copy link
Contributor

bluegenes commented May 16, 2022

I see at least one of the issues -- by default sourmash compare uses jaccard, not containment, so adding --ani to the default command uses jaccard --> ANI. To run containment, we need to add --containment to the compare --ani command. Using --containment will produce different values for the different comparison directions. I was waiting for your thoughts on my average containment approach before adding it as an option to compare.

As for the 0 -- that might be happening as a result of zeroing out the ANI, per sourmash-bio/sourmash#2003

will continue looking...

@ctb
Copy link

ctb commented May 16, 2022

as an aside, we might want to come back to the sourmash code to automatically enable --containment (with a message ;) when --ani is specified.

@dkoslicki
Copy link
Contributor Author

@bluegenes that does make sense with jaccard vs containment. Apologies if I overlooked something, but remind me when/where you asked for my thoughts on average containment?

as an aside, we might want to come back to the sourmash code to automatically enable --containment (with a message ;) when --ani is specified.

Agreed!

@bluegenes
Copy link
Contributor

bluegenes commented May 16, 2022

@bluegenes that does make sense with jaccard vs containment. Apologies if I overlooked something, but remind me when/where you asked for my thoughts on average containment?

Not written anywhere -- it was during our last meeting. I mentioned that I was using it for all my comparisons to mapping ANI values, and wanted to hear any caveats/issues you could think of around using it more widely /returning it directly from sourmash.

The deeper question was on how to handle confidence intervals (if at all) for average containment. Since I'm taking the average of the containment ANI values, I'm not generating confidence intervals for avg containment.

as an aside, we might want to come back to the sourmash code to automatically enable --containment (with a message ;) when --ani is specified.

With sourmash-bio/sourmash#2056, I enabled--avg-containment for compare as well. Since it would produce a symmetrical matrix (as does jaccard), would it be a better compare default than ANI from --containment?

@bluegenes
Copy link
Contributor

bluegenes commented May 16, 2022

The last major difference I see is that I didn't actually change our scaled containment to use the bias factor yet, see sourmash-bio/sourmash#1798 (comment).

Using dkoslicki#1, the matrices look much more alike, though there are a couple cases with large discrepancies. I've tracked the big discrepancies down, and they're a result of --avg-containment ANI averaging a larger ANI with a 0.0 (due to the hard coded thresholds you pointed out). These thresholds were in the original equations from the mutation-rate-ci-calculator, but I've removed them in sourmash-bio/sourmash#2060 since they seem to be causing issues.

One thing I could do that would not involve an issue with semantic versioning -- I could use the containment bias factor just during ANI estimation, and continue ignoring it for non-ani containment for now. Testing in sourmash-bio/sourmash#2057.

mamba env for installations; use compare --avg-containment --ani
@dkoslicki
Copy link
Contributor Author

And @mahmudhera it might make sense to move the discussion here instead of Tessa's PR on my fork. That way the trail is easier to follow

@mahmudhera
Copy link
Owner

You are correct. To make it easier to track things:

The issue has been solved partly by switching to sourmash latest branch (after Tessa's update to use avg-containment in sourmash), which resulted in an agreement between sourmash ANI estimate, and Tessa's ANI estimate.

The minor disagreement between my code in this repository and sourmash compare (reported here by Tessa) was solved by using the same seed (previously, we were using different seeds).

Currently, the script compare_matrices.py uses different seeds to compute ANI, and prints the largest % of the relative difference, which is ~4.5%. I think previously, we had > 50% relative difference in some cases because (a) Jaccard->ANI, (b) using hardcoded thresholds in sourmash (which are dropped for now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants