Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compare and visualize the similarity of many genomes using Average Nucleotide Identity (ANI) #16

Open
ctb opened this issue May 14, 2022 · 2 comments
Labels
ani using Average Nucleotide Identity (ANI) plotting plots and other output visualizations sourmash-v4.4 examples using sourmash v4.4 functionality

Comments

@ctb
Copy link
Contributor

ctb commented May 14, 2022

with sourmash v4.4, sourmash can now estimate Average Nucleotide Identity between genomes: this is the fraction of bases that would be identical in a pairwise alignment. However, sourmash estimates this based on k-mers, instead of using alignments. This is fast and lightweight, and doesn't need access to the full genomes!

Let's use ANI with sourmash compare and sourmash plot to look at the relationship between 12 E. coli genomes.

First, you'll need to download the GTDB genomic-reps database, containing 66k genomes, as in #13.

Then, run compare, selecting just the Escherichia genomes:

sourmash compare --include Escherichia gtdb-rs207.genomic-reps.dna.k31.zip --ani -o ecoli-ani.cmp

This creates a numpy comparison matrix in ecoli-ani.cmp (you can also generate a CSV output with --csv).

Now, use plot to quickly visualize the comparison matrix:

sourmash plot ecoli-ani.cmp

This will produce an image ecoli-ani.cmp/matrix.png:

ecoli-ani cmp matrix

This plot shows the clade structure of the 12 E. coli genomes, including three that have no detectable ANI similarity to any other ANI genomes.

The genome names corresponding to the label numbers can be found in the ecoli-ani.cmp.labels.txt file.

Caveats and details

ANI can be estimated by sourmash from ~85% to 99%, depending on the k-mer size used. See sourmash-bio/sourmash#1859 for some numbers for k=21 and k=31.

ANI estimates are somewhat dependent on the software and parameters used to calculate them. We are working on a systematic comparison of sourmash's ANI estimates with other ANI software!

sourmash ANI estimates are only available for scaled signatures (the default, when signatures are generated with sourmash sketch).

In order to accurately estimate ANI, sourmash signatures need to have enough hashes for the calculation; this is dependent on both the size of the genome(s) and the scaled factor used to generate the signatures. sourmash will output warnings to stderr when the sketches are too small to accurately estimate ANI.

Some details on how sourmash estimates ANI are in the sourmash docs.

Citations

The analytical work underlying the ANI calculations is introduced in Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances, Hera et al., 2022.

@ctb ctb added sourmash-v4.4 examples using sourmash v4.4 functionality ani using Average Nucleotide Identity (ANI) plotting plots and other output visualizations labels May 14, 2022
@ctb
Copy link
Contributor Author

ctb commented May 14, 2022

@bluegenes @dkoslicki thoughts, concerns, updates?

@dkoslicki
Copy link

@ctb I'm looking into some strange discrepancies between the different ways ANI is being calculated (see here). But in spirit, I think this is a great example!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ani using Average Nucleotide Identity (ANI) plotting plots and other output visualizations sourmash-v4.4 examples using sourmash v4.4 functionality
Projects
None yet
Development

No branches or pull requests

2 participants