Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FracMinHash containment to ANI conversion #1859

Open
bluegenes opened this issue Mar 2, 2022 · 4 comments
Open

FracMinHash containment to ANI conversion #1859

bluegenes opened this issue Mar 2, 2022 · 4 comments
Labels
faq things to add to an FAQ or docs

Comments

@bluegenes
Copy link
Contributor

bluegenes commented Mar 2, 2022

ref ANI estimation PR #1788

I've been using our forthcoming ANI utilities to estimate pairwise ANI between GTDB genomes. From these data, we can examine the average containment --> ANI relationship for a given kmer length. Note that since the number of unique k-mers in each comparison also impacts the ANI estimate, so we do not expect a single ANI value per containment value.

I've currently run family-level comparisons using k=21, scaled=1000. I'm using the average of the directional containment values ("average containment") to estimate ANI. Average containment for these comparisons ranges from 0-1, and ANI estimates range from 80%-100%.

image

@ctb suggested binning containment values so we can develop a feel for containment --> ANI. Here I've binned containment by 0.05 intervals (containment ranges from 0-1).

Note: count is the total number of pairwise genome comparisons in each bin.
image

csv version of this table attached.
mean-containment-k21-to-ANI.csv

@ctb
Copy link
Contributor

ctb commented Mar 2, 2022

and here's some code:

import csv

class ContainmentToANI_Converter:
    def __init__(self, tablefile):
        with open(tablefile, newline="") as fp:
            r = csv.DictReader(fp)
            vals = []
            for row in r:
                highbound = float(row['highContainment'])
                minANI = float(row['minANI'])
                vals.append((highbound, minANI))
            self.vals = vals

    def convert_containment(self, c):
        biggest_cont = 0
        biggest_ani = 0
        for cont, ani in self.vals:
            if c >= cont and cont > biggest_cont:
                print(c, cont, biggest_cont, biggest_ani)
                biggest_cont = cont
                biggest_ani = ani
        return biggest_ani

It relies on having a column highContainment that contains the right side of the interval in @bluegenes CSV file, so you'll need to create that (or we can beg @bluegenes to make it for us with her next export :).

Outputs:

assert x.convert_containment(0.5) == 0.95
assert x.convert_containment(0.05) == 0.8

@ctb
Copy link
Contributor

ctb commented Mar 2, 2022

(here's the file for k=21 suitably modified for the code above)
mean-containment-k21-to-ANI.csv

@ctb ctb changed the title FracMinHash containment to ANI FracMinHash containment to ANI conversion Mar 3, 2022
@bluegenes
Copy link
Contributor Author

bluegenes commented Mar 8, 2022

k31, scaled 1000

These are results from gtdb representatives vs. all of gtdb. The counts are a little misleading, as there are certainly some duplicates (sigA --> sigB and sigB --> sig A are currently counted independently, will fix later). This does not affect binning, since we're just binning by 0.5 containment increments.

Note that k21 ANI values are closer to mapping-based ANI; k31 is a bit more sensitive.

gtdb-rs202 nucleotide-k31-scaled1000 containmentANI

image

mean-containment-k31-to-ANI.csv

@ctb
Copy link
Contributor

ctb commented May 3, 2022

can / should this issue be closed? @bluegenes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
faq things to add to an FAQ or docs
Projects
None yet
Development

No branches or pull requests

2 participants