FracMinHash containment to ANI conversion #1859

bluegenes · 2022-03-02T18:47:37Z

ref ANI estimation PR #1788

I've been using our forthcoming ANI utilities to estimate pairwise ANI between GTDB genomes. From these data, we can examine the average containment --> ANI relationship for a given kmer length. Note that since the number of unique k-mers in each comparison also impacts the ANI estimate, so we do not expect a single ANI value per containment value.

I've currently run family-level comparisons using k=21, scaled=1000. I'm using the average of the directional containment values ("average containment") to estimate ANI. Average containment for these comparisons ranges from 0-1, and ANI estimates range from 80%-100%.

@ctb suggested binning containment values so we can develop a feel for containment --> ANI. Here I've binned containment by 0.05 intervals (containment ranges from 0-1).

Note: count is the total number of pairwise genome comparisons in each bin.

csv version of this table attached.
mean-containment-k21-to-ANI.csv

The text was updated successfully, but these errors were encountered:

ctb · 2022-03-02T20:04:03Z

and here's some code:

import csv

class ContainmentToANI_Converter:
    def __init__(self, tablefile):
        with open(tablefile, newline="") as fp:
            r = csv.DictReader(fp)
            vals = []
            for row in r:
                highbound = float(row['highContainment'])
                minANI = float(row['minANI'])
                vals.append((highbound, minANI))
            self.vals = vals

    def convert_containment(self, c):
        biggest_cont = 0
        biggest_ani = 0
        for cont, ani in self.vals:
            if c >= cont and cont > biggest_cont:
                print(c, cont, biggest_cont, biggest_ani)
                biggest_cont = cont
                biggest_ani = ani
        return biggest_ani

It relies on having a column highContainment that contains the right side of the interval in @bluegenes CSV file, so you'll need to create that (or we can beg @bluegenes to make it for us with her next export :).

Outputs:

assert x.convert_containment(0.5) == 0.95
assert x.convert_containment(0.05) == 0.8

ctb · 2022-03-02T20:45:18Z

(here's the file for k=21 suitably modified for the code above)
mean-containment-k21-to-ANI.csv

bluegenes · 2022-03-08T01:21:39Z

k31, scaled 1000

These are results from gtdb representatives vs. all of gtdb. The counts are a little misleading, as there are certainly some duplicates (sigA --> sigB and sigB --> sig A are currently counted independently, will fix later). This does not affect binning, since we're just binning by 0.5 containment increments.

Note that k21 ANI values are closer to mapping-based ANI; k31 is a bit more sensitive.

mean-containment-k31-to-ANI.csv

ctb · 2022-05-03T13:19:44Z

can / should this issue be closed? @bluegenes

ctb changed the title ~~FracMinHash containment to ANI~~ FracMinHash containment to ANI conversion Mar 3, 2022

This was referenced May 14, 2022

update docs with more about ANI #2052

Open

compare and visualize the similarity of many genomes using Average Nucleotide Identity (ANI) sourmash-bio/sourmash-examples#16

Open

ctb mentioned this issue Sep 2, 2022

add 2022 JGI Petabyte Scale Sequence Search workshop slideshow into docs #2252

Open

ctb added the faq things to add to an FAQ or docs label Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FracMinHash containment to ANI conversion #1859

FracMinHash containment to ANI conversion #1859

bluegenes commented Mar 2, 2022 •

edited

Loading

ctb commented Mar 2, 2022

ctb commented Mar 2, 2022

bluegenes commented Mar 8, 2022 •

edited

Loading

ctb commented May 3, 2022

FracMinHash containment to ANI conversion #1859

FracMinHash containment to ANI conversion #1859

Comments

bluegenes commented Mar 2, 2022 • edited Loading

ctb commented Mar 2, 2022

ctb commented Mar 2, 2022

bluegenes commented Mar 8, 2022 • edited Loading

ctb commented May 3, 2022

bluegenes commented Mar 2, 2022 •

edited

Loading

bluegenes commented Mar 8, 2022 •

edited

Loading