compare and visualize the similarity of many genomes using Average Nucleotide Identity (ANI) #16

ctb · 2022-05-14T15:19:26Z

with sourmash v4.4, sourmash can now estimate Average Nucleotide Identity between genomes: this is the fraction of bases that would be identical in a pairwise alignment. However, sourmash estimates this based on k-mers, instead of using alignments. This is fast and lightweight, and doesn't need access to the full genomes!

Let's use ANI with sourmash compare and sourmash plot to look at the relationship between 12 E. coli genomes.

First, you'll need to download the GTDB genomic-reps database, containing 66k genomes, as in #13.

Then, run compare, selecting just the Escherichia genomes:

sourmash compare --include Escherichia gtdb-rs207.genomic-reps.dna.k31.zip --ani -o ecoli-ani.cmp

This creates a numpy comparison matrix in ecoli-ani.cmp (you can also generate a CSV output with --csv).

Now, use plot to quickly visualize the comparison matrix:

sourmash plot ecoli-ani.cmp

This will produce an image ecoli-ani.cmp/matrix.png:

This plot shows the clade structure of the 12 E. coli genomes, including three that have no detectable ANI similarity to any other ANI genomes.

The genome names corresponding to the label numbers can be found in the ecoli-ani.cmp.labels.txt file.

Caveats and details

ANI can be estimated by sourmash from ~85% to 99%, depending on the k-mer size used. See sourmash-bio/sourmash#1859 for some numbers for k=21 and k=31.

ANI estimates are somewhat dependent on the software and parameters used to calculate them. We are working on a systematic comparison of sourmash's ANI estimates with other ANI software!

sourmash ANI estimates are only available for scaled signatures (the default, when signatures are generated with sourmash sketch).

In order to accurately estimate ANI, sourmash signatures need to have enough hashes for the calculation; this is dependent on both the size of the genome(s) and the scaled factor used to generate the signatures. sourmash will output warnings to stderr when the sketches are too small to accurately estimate ANI.

Some details on how sourmash estimates ANI are in the sourmash docs.

Citations

The analytical work underlying the ANI calculations is introduced in Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances, Hera et al., 2022.

The text was updated successfully, but these errors were encountered:

ctb · 2022-05-14T15:31:32Z

@bluegenes @dkoslicki thoughts, concerns, updates?

dkoslicki · 2022-05-16T03:01:16Z

@ctb I'm looking into some strange discrepancies between the different ways ANI is being calculated (see here). But in spirit, I think this is a great example!

ctb added sourmash-v4.4 examples using sourmash v4.4 functionality ani using Average Nucleotide Identity (ANI) plotting plots and other output visualizations labels May 14, 2022

dkoslicki mentioned this issue May 16, 2022

add script that computes and prints the three different ways to make … mahmudhera/phylogenetic-tree-using-fracminhash#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compare and visualize the similarity of many genomes using Average Nucleotide Identity (ANI) #16

compare and visualize the similarity of many genomes using Average Nucleotide Identity (ANI) #16

ctb commented May 14, 2022 •

edited

Loading

ctb commented May 14, 2022

dkoslicki commented May 16, 2022

compare and visualize the similarity of many genomes using Average Nucleotide Identity (ANI) #16

compare and visualize the similarity of many genomes using Average Nucleotide Identity (ANI) #16

Comments

ctb commented May 14, 2022 • edited Loading

Caveats and details

Citations

ctb commented May 14, 2022

dkoslicki commented May 16, 2022

ctb commented May 14, 2022 •

edited

Loading