Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicity distances for genomes #1

Open
ctb opened this issue Nov 18, 2022 · 14 comments
Open

unicity distances for genomes #1

ctb opened this issue Nov 18, 2022 · 14 comments

Comments

@ctb
Copy link
Owner

ctb commented Nov 18, 2022

Q: How many sourmash hashes does it take to classify a genome uniquely?

If the answer is 1, then there is a single hash that identifies this specific genome! 🎉

If the answer is “you cannot”, then you are looking at genomes that cannot be resolved with sourmash gather.

This number seems to be something called the unicity distance, and it is relatively straightforward to calculate!

Here are some unicity distances for a few GTDB genome sketches - the first column is the identifier, the second is the unicity distance, the third is the number of hashes in the sketch, and the fourth is the percentage of hashes that is the unicity distance; 100% would mean that it cannot be resolved by sourmash.

GCF_000814905.1 1 434 0.2%
GCA_007116955.1 1 247 0.4%
GCF_017948435.1 1 608 0.2%
GCA_017995835.1 1 317 0.3%
GCF_001981135.1 5 723 0.7%
GCF_900111215.1 1 223 0.4%
GCF_001084125.1 153 204 75.0%
GCF_002224665.1 1 506 0.2%
GCA_900758335.1 1 363 0.3%
GCF_008326505.1 1 202 0.5%
GCF_001204095.1 1 266 0.4%
GCA_018665005.1 1 231 0.4%
GCA_900478005.2 108 405 26.7%
GCF_003812625.1 28 525 5.3%
GCF_002088895.1 3 314 1.0%
GCA_002397945.1 1 122 0.8%
GCA_004116405.1 2 517 0.4%
GCF_003117635.1 2 493 0.4%
GCA_902789845.1 1 210 0.5%
GCF_016628485.1 1 255 0.4%
@taylorreiter
Copy link

THIS IS SO COOL.

Can you add taxonomy and number of genomes for a given species in GTDB as columns to help contextualize? like i bet unicity is higher for e. coli because we have so many e. coli genomes.

Also, can you comment on whether it has to be a specific k-mer or if any of the k-mers will do? or rather, of the number of k-mers in the genome, how many fit the unicity criteria and could be used as the 1 kmer to identify the specific genome?

@ctb
Copy link
Owner Author

ctb commented Nov 18, 2022

THIS IS SO COOL.

:) it does seem to nicely wrap a bunch of concerns up into a single number!

Can you add taxonomy and number of genomes for a given species in GTDB as columns to help contextualize? like i bet unicity is higher for e. coli because we have so many e. coli genomes.

yeah, exactly. working on it - I'm trying to figure out what a similar number might look like for a taxonomy.

Also, can you comment on whether it has to be a specific k-mer or if any of the k-mers will do? or rather, of the number of k-mers in the genome, how many fit the unicity criteria and could be used as the 1 kmer to identify the specific genome?

unicity distance is formally defined as the minimum, so there is no shorter list of k-mers (...see the connection to min-set-cov? 😆 )

there may be multiple collections of k-mers that can be used, tho. maybe a way to phrase that question would be "what number of the hashes in the genome can uniquely identify the genome?" but this only makes sense in the case where the unicity distance is 1 anyway.

right now the easiest/simplest interpretations are when unicity is 1 or infinite - 1 breaks LCA-style approaches for genome identification, infinite breaks gather-style approaches for genome identification.

@ctb
Copy link
Owner Author

ctb commented Nov 18, 2022

p.s. currently calculating estimated unicity for all genomes in GTDB rs207.

@widdowquinn
Copy link

widdowquinn commented Nov 18, 2022

That's fascinating!

I see that GCF_001084125.1 is Neisseria gonorrhoeae, which is both an important and interesting case study for taxonomic distinction, and Neisseria are often considered challenging because they're highly recombinogenic, very closely-related to other Neisseria species, and tend to co-inhabit the same body space (the back of your nose, IIRC). I've used these when trying to convince people of the value of ANI in the past… some slides from an old talk…

Screenshot 2022-11-18 at 17 35 06

Screenshot 2022-11-18 at 17 35 22

Screenshot 2022-11-18 at 17 35 36

There may be something more recent than this paper but it's a good place to start on the why of them being a challenging group.

@ctb
Copy link
Owner Author

ctb commented Nov 18, 2022

very cool - I suspect that this unicity number is going to be a good, simple way to find all the challenging genomes 😆

@shokrof
Copy link

shokrof commented Nov 18, 2022

Thanks for sharing this number for me! I am curious to see if there is a correlation between the unicity number and the genotyping accuracy!

@ctb
Copy link
Owner Author

ctb commented Nov 19, 2022

First pass x GTDB rs207 - 15.3% of genomes have unicity distance of 1; 29.2% have an infinite unicity distance (k=31, scaled=1000).

Note that the 29.2% is not an estimate, but the 15.3% is a lower bound - there may be more genomes that have a unicity distance of 1.

@ctb
Copy link
Owner Author

ctb commented Nov 19, 2022

notebook: https://github.com/ctb/2022-sourmash-sens-spec/blob/main/explore-unicity.ipynb

top 20 species with infinite unicity (k=31, scaled=1000)

species
s__Staphylococcus aureus 8367
s__Salmonella enterica 7602
s__Escherichia coli 6324
s__Streptococcus pneumoniae 5513
s__Mycobacterium tuberculosis 5485
s__Klebsiella pneumoniae 4428
s__Acinetobacter baumannii 2519
s__Pseudomonas aeruginosa 1873
s__Streptococcus pyogenes 1399
s__Listeria monocytogenes 1207
s__Mycobacterium abscessus 1139
s__Listeria monocytogenes_B 1097
s__Clostridioides difficile 958
s__Burkholderia mallei 937
s__Neisseria meningitidis 893
s__Streptococcus suis 890
s__Wolbachia pipientis 869
s__Pseudomonas_E viridiflava 855
s__Vibrio cholerae 854
s__Enterococcus_B faecium 851

top 20 genera with infinite unicity (k=31, scaled=1000)

genus
g__Streptococcus 9682
g__Staphylococcus 9476
g__Salmonella 7700
g__Mycobacterium 6875
g__Escherichia 6383
g__Klebsiella 5032
g__Acinetobacter 2772
g__Listeria 2456
g__Pseudomonas 1900
g__Pseudomonas_E 1875
g__Vibrio 1794
g__Burkholderia 1654
g__Neisseria 1397
g__Campylobacter_D 1389
g__Enterococcus_B 1178
g__Clostridioides 960
g__Bordetella 938
g__Francisella 906
g__Wolbachia 905
g__Enterococcus 787

@taylorreiter
Copy link

oooh very cool. I think this shows either species (or genera) with a ton of genomes in GenBank or GTDB (like e. coli, p. arg, etc.) or genomes with high swappiness (Neisseria, Wolbachia)

@bavinatzer
Copy link

Q: How many sourmash hashes does it take to classify a genome uniquely?

If the answer is 1, then there is a single hash that identifies this specific genome! 🎉

If the answer is “you cannot”, then you are looking at genomes that cannot be resolved with sourmash gather.

This number seems to be something called the unicity distance, and it is relatively straightforward to calculate!

Here are some unicity distances for a few GTDB genome sketches - the first column is the identifier, the second is the unicity distance, the third is the number of hashes in the sketch, and the fourth is the percentage of hashes that is the unicity distance; 100% would mean that it cannot be resolved by sourmash.

GCF_000814905.1 1 434 0.2%
GCA_007116955.1 1 247 0.4%
GCF_017948435.1 1 608 0.2%
GCA_017995835.1 1 317 0.3%
GCF_001981135.1 5 723 0.7%
GCF_900111215.1 1 223 0.4%
GCF_001084125.1 153 204 75.0%
GCF_002224665.1 1 506 0.2%
GCA_900758335.1 1 363 0.3%
GCF_008326505.1 1 202 0.5%
GCF_001204095.1 1 266 0.4%
GCA_018665005.1 1 231 0.4%
GCA_900478005.2 108 405 26.7%
GCF_003812625.1 28 525 5.3%
GCF_002088895.1 3 314 1.0%
GCA_002397945.1 1 122 0.8%
GCA_004116405.1 2 517 0.4%
GCF_003117635.1 2 493 0.4%
GCA_902789845.1 1 210 0.5%
GCF_016628485.1 1 255 0.4%

One question to help me understand this. You say "100% would mean that it cannot be resolved by sourmash". The way I understand it, it would be the opposite: 0% (that is as I understand it: 0 unique hashes would mean that sourmash cannot identify that genome. Please explain. Thank you.

@bavinatzer
Copy link

bavinatzer commented Nov 20, 2022

notebook: https://github.com/ctb/2022-sourmash-sens-spec/blob/main/explore-unicity.ipynb

top 20 species with infinite unicity (k=31, scaled=1000)

species
s__Staphylococcus aureus 8367
s__Salmonella enterica 7602
s__Escherichia coli 6324
s__Streptococcus pneumoniae 5513
s__Mycobacterium tuberculosis 5485
s__Klebsiella pneumoniae 4428
s__Acinetobacter baumannii 2519
s__Pseudomonas aeruginosa 1873
s__Streptococcus pyogenes 1399
s__Listeria monocytogenes 1207
s__Mycobacterium abscessus 1139
s__Listeria monocytogenes_B 1097
s__Clostridioides difficile 958
s__Burkholderia mallei 937
s__Neisseria meningitidis 893
s__Streptococcus suis 890
s__Wolbachia pipientis 869
s__Pseudomonas_E viridiflava 855
s__Vibrio cholerae 854
s__Enterococcus_B faecium 851

top 20 genera with infinite unicity (k=31, scaled=1000)

genus
g__Streptococcus 9682
g__Staphylococcus 9476
g__Salmonella 7700
g__Mycobacterium 6875
g__Escherichia 6383
g__Klebsiella 5032
g__Acinetobacter 2772
g__Listeria 2456
g__Pseudomonas 1900
g__Pseudomonas_E 1875
g__Vibrio 1794
g__Burkholderia 1654
g__Neisseria 1397
g__Campylobacter_D 1389
g__Enterococcus_B 1178
g__Clostridioides 960
g__Bordetella 938
g__Francisella 906
g__Wolbachia 905
g__Enterococcus 787

Again, to see if I get it right: infinity unicity means that the genome, species, or genus, does not have a single hash that is unique to it (present in the genome, species or genus, but not in any other genome). And, if the unicity distance of a species or genus were to be one, does that mean that this one hash is present in every single genome of the species or genus?

@ctb
Copy link
Owner Author

ctb commented Nov 20, 2022

One question to help me understand this. You say "100% would mean that it cannot be resolved by sourmash". The way I understand it, it would be the opposite: 0% (that is as I understand it: 0 unique hashes would mean that sourmash cannot identify that genome. Please explain. Thank you.

My fault - It's an oddity of the way I'm doing the analysis and reporting. If the analysis reaches the end of the hashes without uniquely identifying the genome, it has infinite unicity distance. If it takes 50% of the hashes to get there, that number of hashes is the unicity distance. A unicity distance of 0 isn't possible, a unicity distance of 1 means a single hash has been found that uniquely identifies this genome.

Again, to see if I get it right: infinity unicity means that the genome, species, or genus, does not have a single hash that is unique to it (present in the genome, species or genus, but not in any other genome). And, if the unicity distance of a species or genus were to be one, does that mean that this one hash is present in every single genome of the species or genus?

Here we're dealing only with genomes - not species or genus. Species and genus are taxonomic terms and I haven't figured out how to calculate unicity for them (I'm trying out a different measure in #2).

So, yes, infinite unicity distance means that the genome does not have any single hash or combination of hashes in it that is unique to that genome. This would be a consequence of sourmash's hashing of course - clearly there is at least some difference between the genomes otherwise they wouldn't be different genomes - but sourmash isn't finding anything with k=31 and 1000.

And a unicity distance of 1 means that there is at least one individiual hash (there may be many individual hashes!) that can identify that genome uniquely.

Thanks for the questions!

@ctb
Copy link
Owner Author

ctb commented Nov 20, 2022

(I'll improve the way I'm talking about this to be more correct ;)

@bavinatzer
Copy link

Perfect! I got it now! Thank you, Titus!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants