unicity distances for genomes #1

ctb · 2022-11-18T15:28:49Z

Q: How many sourmash hashes does it take to classify a genome uniquely?

If the answer is 1, then there is a single hash that identifies this specific genome! 🎉

If the answer is “you cannot”, then you are looking at genomes that cannot be resolved with sourmash gather.

This number seems to be something called the unicity distance, and it is relatively straightforward to calculate!

Here are some unicity distances for a few GTDB genome sketches - the first column is the identifier, the second is the unicity distance, the third is the number of hashes in the sketch, and the fourth is the percentage of hashes that is the unicity distance; 100% would mean that it cannot be resolved by sourmash.

GCF_000814905.1 1 434 0.2%
GCA_007116955.1 1 247 0.4%
GCF_017948435.1 1 608 0.2%
GCA_017995835.1 1 317 0.3%
GCF_001981135.1 5 723 0.7%
GCF_900111215.1 1 223 0.4%
GCF_001084125.1 153 204 75.0%
GCF_002224665.1 1 506 0.2%
GCA_900758335.1 1 363 0.3%
GCF_008326505.1 1 202 0.5%
GCF_001204095.1 1 266 0.4%
GCA_018665005.1 1 231 0.4%
GCA_900478005.2 108 405 26.7%
GCF_003812625.1 28 525 5.3%
GCF_002088895.1 3 314 1.0%
GCA_002397945.1 1 122 0.8%
GCA_004116405.1 2 517 0.4%
GCF_003117635.1 2 493 0.4%
GCA_902789845.1 1 210 0.5%
GCF_016628485.1 1 255 0.4%

The text was updated successfully, but these errors were encountered:

taylorreiter · 2022-11-18T15:37:15Z

THIS IS SO COOL.

Can you add taxonomy and number of genomes for a given species in GTDB as columns to help contextualize? like i bet unicity is higher for e. coli because we have so many e. coli genomes.

Also, can you comment on whether it has to be a specific k-mer or if any of the k-mers will do? or rather, of the number of k-mers in the genome, how many fit the unicity criteria and could be used as the 1 kmer to identify the specific genome?

ctb · 2022-11-18T15:42:51Z

THIS IS SO COOL.

:) it does seem to nicely wrap a bunch of concerns up into a single number!

Can you add taxonomy and number of genomes for a given species in GTDB as columns to help contextualize? like i bet unicity is higher for e. coli because we have so many e. coli genomes.

yeah, exactly. working on it - I'm trying to figure out what a similar number might look like for a taxonomy.

Also, can you comment on whether it has to be a specific k-mer or if any of the k-mers will do? or rather, of the number of k-mers in the genome, how many fit the unicity criteria and could be used as the 1 kmer to identify the specific genome?

unicity distance is formally defined as the minimum, so there is no shorter list of k-mers (...see the connection to min-set-cov? 😆 )

there may be multiple collections of k-mers that can be used, tho. maybe a way to phrase that question would be "what number of the hashes in the genome can uniquely identify the genome?" but this only makes sense in the case where the unicity distance is 1 anyway.

right now the easiest/simplest interpretations are when unicity is 1 or infinite - 1 breaks LCA-style approaches for genome identification, infinite breaks gather-style approaches for genome identification.

ctb · 2022-11-18T15:43:24Z

p.s. currently calculating estimated unicity for all genomes in GTDB rs207.

widdowquinn · 2022-11-18T17:39:08Z

That's fascinating!

I see that GCF_001084125.1 is Neisseria gonorrhoeae, which is both an important and interesting case study for taxonomic distinction, and Neisseria are often considered challenging because they're highly recombinogenic, very closely-related to other Neisseria species, and tend to co-inhabit the same body space (the back of your nose, IIRC). I've used these when trying to convince people of the value of ANI in the past… some slides from an old talk…

There may be something more recent than this paper but it's a good place to start on the why of them being a challenging group.

ctb · 2022-11-18T18:30:59Z

very cool - I suspect that this unicity number is going to be a good, simple way to find all the challenging genomes 😆

shokrof · 2022-11-18T20:58:20Z

Thanks for sharing this number for me! I am curious to see if there is a correlation between the unicity number and the genotyping accuracy!

ctb · 2022-11-19T14:10:18Z

First pass x GTDB rs207 - 15.3% of genomes have unicity distance of 1; 29.2% have an infinite unicity distance (k=31, scaled=1000).

Note that the 29.2% is not an estimate, but the 15.3% is a lower bound - there may be more genomes that have a unicity distance of 1.

ctb · 2022-11-19T14:25:38Z

notebook: https://github.com/ctb/2022-sourmash-sens-spec/blob/main/explore-unicity.ipynb

top 20 species with infinite unicity (k=31, scaled=1000)

	species
s__Staphylococcus aureus	8367
s__Salmonella enterica	7602
s__Escherichia coli	6324
s__Streptococcus pneumoniae	5513
s__Mycobacterium tuberculosis	5485
s__Klebsiella pneumoniae	4428
s__Acinetobacter baumannii	2519
s__Pseudomonas aeruginosa	1873
s__Streptococcus pyogenes	1399
s__Listeria monocytogenes	1207
s__Mycobacterium abscessus	1139
s__Listeria monocytogenes_B	1097
s__Clostridioides difficile	958
s__Burkholderia mallei	937
s__Neisseria meningitidis	893
s__Streptococcus suis	890
s__Wolbachia pipientis	869
s__Pseudomonas_E viridiflava	855
s__Vibrio cholerae	854
s__Enterococcus_B faecium	851

top 20 genera with infinite unicity (k=31, scaled=1000)

	genus
g__Streptococcus	9682
g__Staphylococcus	9476
g__Salmonella	7700
g__Mycobacterium	6875
g__Escherichia	6383
g__Klebsiella	5032
g__Acinetobacter	2772
g__Listeria	2456
g__Pseudomonas	1900
g__Pseudomonas_E	1875
g__Vibrio	1794
g__Burkholderia	1654
g__Neisseria	1397
g__Campylobacter_D	1389
g__Enterococcus_B	1178
g__Clostridioides	960
g__Bordetella	938
g__Francisella	906
g__Wolbachia	905
g__Enterococcus	787

taylorreiter · 2022-11-19T17:35:07Z

oooh very cool. I think this shows either species (or genera) with a ton of genomes in GenBank or GTDB (like e. coli, p. arg, etc.) or genomes with high swappiness (Neisseria, Wolbachia)

bavinatzer · 2022-11-20T12:19:35Z

Q: How many sourmash hashes does it take to classify a genome uniquely?

If the answer is 1, then there is a single hash that identifies this specific genome! 🎉

If the answer is “you cannot”, then you are looking at genomes that cannot be resolved with sourmash gather.

This number seems to be something called the unicity distance, and it is relatively straightforward to calculate!

Here are some unicity distances for a few GTDB genome sketches - the first column is the identifier, the second is the unicity distance, the third is the number of hashes in the sketch, and the fourth is the percentage of hashes that is the unicity distance; 100% would mean that it cannot be resolved by sourmash.
GCF_000814905.1 1 434 0.2%
GCA_007116955.1 1 247 0.4%
GCF_017948435.1 1 608 0.2%
GCA_017995835.1 1 317 0.3%
GCF_001981135.1 5 723 0.7%
GCF_900111215.1 1 223 0.4%
GCF_001084125.1 153 204 75.0%
GCF_002224665.1 1 506 0.2%
GCA_900758335.1 1 363 0.3%
GCF_008326505.1 1 202 0.5%
GCF_001204095.1 1 266 0.4%
GCA_018665005.1 1 231 0.4%
GCA_900478005.2 108 405 26.7%
GCF_003812625.1 28 525 5.3%
GCF_002088895.1 3 314 1.0%
GCA_002397945.1 1 122 0.8%
GCA_004116405.1 2 517 0.4%
GCF_003117635.1 2 493 0.4%
GCA_902789845.1 1 210 0.5%
GCF_016628485.1 1 255 0.4%

One question to help me understand this. You say "100% would mean that it cannot be resolved by sourmash". The way I understand it, it would be the opposite: 0% (that is as I understand it: 0 unique hashes would mean that sourmash cannot identify that genome. Please explain. Thank you.

bavinatzer · 2022-11-20T12:24:22Z

notebook: https://github.com/ctb/2022-sourmash-sens-spec/blob/main/explore-unicity.ipynb

top 20 species with infinite unicity (k=31, scaled=1000)

species
s__Staphylococcus aureus 8367
s__Salmonella enterica 7602
s__Escherichia coli 6324
s__Streptococcus pneumoniae 5513
s__Mycobacterium tuberculosis 5485
s__Klebsiella pneumoniae 4428
s__Acinetobacter baumannii 2519
s__Pseudomonas aeruginosa 1873
s__Streptococcus pyogenes 1399
s__Listeria monocytogenes 1207
s__Mycobacterium abscessus 1139
s__Listeria monocytogenes_B 1097
s__Clostridioides difficile 958
s__Burkholderia mallei 937
s__Neisseria meningitidis 893
s__Streptococcus suis 890
s__Wolbachia pipientis 869
s__Pseudomonas_E viridiflava 855
s__Vibrio cholerae 854
s__Enterococcus_B faecium 851

top 20 genera with infinite unicity (k=31, scaled=1000)

genus
g__Streptococcus 9682
g__Staphylococcus 9476
g__Salmonella 7700
g__Mycobacterium 6875
g__Escherichia 6383
g__Klebsiella 5032
g__Acinetobacter 2772
g__Listeria 2456
g__Pseudomonas 1900
g__Pseudomonas_E 1875
g__Vibrio 1794
g__Burkholderia 1654
g__Neisseria 1397
g__Campylobacter_D 1389
g__Enterococcus_B 1178
g__Clostridioides 960
g__Bordetella 938
g__Francisella 906
g__Wolbachia 905
g__Enterococcus 787

Again, to see if I get it right: infinity unicity means that the genome, species, or genus, does not have a single hash that is unique to it (present in the genome, species or genus, but not in any other genome). And, if the unicity distance of a species or genus were to be one, does that mean that this one hash is present in every single genome of the species or genus?

ctb · 2022-11-20T14:16:48Z

One question to help me understand this. You say "100% would mean that it cannot be resolved by sourmash". The way I understand it, it would be the opposite: 0% (that is as I understand it: 0 unique hashes would mean that sourmash cannot identify that genome. Please explain. Thank you.

My fault - It's an oddity of the way I'm doing the analysis and reporting. If the analysis reaches the end of the hashes without uniquely identifying the genome, it has infinite unicity distance. If it takes 50% of the hashes to get there, that number of hashes is the unicity distance. A unicity distance of 0 isn't possible, a unicity distance of 1 means a single hash has been found that uniquely identifies this genome.

Again, to see if I get it right: infinity unicity means that the genome, species, or genus, does not have a single hash that is unique to it (present in the genome, species or genus, but not in any other genome). And, if the unicity distance of a species or genus were to be one, does that mean that this one hash is present in every single genome of the species or genus?

Here we're dealing only with genomes - not species or genus. Species and genus are taxonomic terms and I haven't figured out how to calculate unicity for them (I'm trying out a different measure in #2).

So, yes, infinite unicity distance means that the genome does not have any single hash or combination of hashes in it that is unique to that genome. This would be a consequence of sourmash's hashing of course - clearly there is at least some difference between the genomes otherwise they wouldn't be different genomes - but sourmash isn't finding anything with k=31 and 1000.

And a unicity distance of 1 means that there is at least one individiual hash (there may be many individual hashes!) that can identify that genome uniquely.

Thanks for the questions!

ctb · 2022-11-20T14:17:05Z

(I'll improve the way I'm talking about this to be more correct ;)

bavinatzer · 2022-11-20T14:28:45Z

Perfect! I got it now! Thank you, Titus!

ctb mentioned this issue Nov 18, 2022

adjust default thresholding for prefetch and gather? sourmash-bio/sourmash#2360

Open

ctb mentioned this issue Nov 19, 2022

taxonomic entropy for k-mers #2

Open

This was referenced Nov 21, 2022

playing around with unicity "groups" - can we localize infinite unicity taxonomically for Staph aureus? #3

Open

explore unicity distance of individual contigs in database dib-lab/charcoal#230

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicity distances for genomes #1

unicity distances for genomes #1

ctb commented Nov 18, 2022 •

edited

Loading

taylorreiter commented Nov 18, 2022

ctb commented Nov 18, 2022

ctb commented Nov 18, 2022

widdowquinn commented Nov 18, 2022 •

edited

Loading

ctb commented Nov 18, 2022

shokrof commented Nov 18, 2022

ctb commented Nov 19, 2022

ctb commented Nov 19, 2022 •

edited

Loading

taylorreiter commented Nov 19, 2022

bavinatzer commented Nov 20, 2022

bavinatzer commented Nov 20, 2022 •

edited

Loading

top 20 species with infinite unicity (k=31, scaled=1000)

top 20 genera with infinite unicity (k=31, scaled=1000)

ctb commented Nov 20, 2022

ctb commented Nov 20, 2022

bavinatzer commented Nov 20, 2022

unicity distances for genomes #1

unicity distances for genomes #1

Comments

ctb commented Nov 18, 2022 • edited Loading

taylorreiter commented Nov 18, 2022

ctb commented Nov 18, 2022

ctb commented Nov 18, 2022

widdowquinn commented Nov 18, 2022 • edited Loading

ctb commented Nov 18, 2022

shokrof commented Nov 18, 2022

ctb commented Nov 19, 2022

ctb commented Nov 19, 2022 • edited Loading

top 20 species with infinite unicity (k=31, scaled=1000)

top 20 genera with infinite unicity (k=31, scaled=1000)

taylorreiter commented Nov 19, 2022

bavinatzer commented Nov 20, 2022

bavinatzer commented Nov 20, 2022 • edited Loading

top 20 species with infinite unicity (k=31, scaled=1000)

top 20 genera with infinite unicity (k=31, scaled=1000)

ctb commented Nov 20, 2022

ctb commented Nov 20, 2022

bavinatzer commented Nov 20, 2022

ctb commented Nov 18, 2022 •

edited

Loading

widdowquinn commented Nov 18, 2022 •

edited

Loading

ctb commented Nov 19, 2022 •

edited

Loading

bavinatzer commented Nov 20, 2022 •

edited

Loading