-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unicity distances for genomes #1
Comments
THIS IS SO COOL. Can you add taxonomy and number of genomes for a given species in GTDB as columns to help contextualize? like i bet unicity is higher for e. coli because we have so many e. coli genomes. Also, can you comment on whether it has to be a specific k-mer or if any of the k-mers will do? or rather, of the number of k-mers in the genome, how many fit the unicity criteria and could be used as the 1 kmer to identify the specific genome? |
:) it does seem to nicely wrap a bunch of concerns up into a single number!
yeah, exactly. working on it - I'm trying to figure out what a similar number might look like for a taxonomy.
unicity distance is formally defined as the minimum, so there is no shorter list of k-mers (...see the connection to min-set-cov? 😆 ) there may be multiple collections of k-mers that can be used, tho. maybe a way to phrase that question would be "what number of the hashes in the genome can uniquely identify the genome?" but this only makes sense in the case where the unicity distance is 1 anyway. right now the easiest/simplest interpretations are when unicity is 1 or infinite - 1 breaks LCA-style approaches for genome identification, infinite breaks gather-style approaches for genome identification. |
p.s. currently calculating estimated unicity for all genomes in GTDB rs207. |
That's fascinating! I see that GCF_001084125.1 is Neisseria gonorrhoeae, which is both an important and interesting case study for taxonomic distinction, and Neisseria are often considered challenging because they're highly recombinogenic, very closely-related to other Neisseria species, and tend to co-inhabit the same body space (the back of your nose, IIRC). I've used these when trying to convince people of the value of ANI in the past… some slides from an old talk… There may be something more recent than this paper but it's a good place to start on the why of them being a challenging group. |
very cool - I suspect that this unicity number is going to be a good, simple way to find all the challenging genomes 😆 |
Thanks for sharing this number for me! I am curious to see if there is a correlation between the unicity number and the genotyping accuracy! |
First pass x GTDB rs207 - 15.3% of genomes have unicity distance of 1; 29.2% have an infinite unicity distance (k=31, scaled=1000). Note that the 29.2% is not an estimate, but the 15.3% is a lower bound - there may be more genomes that have a unicity distance of 1. |
notebook: https://github.com/ctb/2022-sourmash-sens-spec/blob/main/explore-unicity.ipynb top 20 species with infinite unicity (k=31, scaled=1000)
top 20 genera with infinite unicity (k=31, scaled=1000)
|
oooh very cool. I think this shows either species (or genera) with a ton of genomes in GenBank or GTDB (like e. coli, p. arg, etc.) or genomes with high swappiness (Neisseria, Wolbachia) |
One question to help me understand this. You say "100% would mean that it cannot be resolved by sourmash". The way I understand it, it would be the opposite: 0% (that is as I understand it: 0 unique hashes would mean that sourmash cannot identify that genome. Please explain. Thank you. |
Again, to see if I get it right: infinity unicity means that the genome, species, or genus, does not have a single hash that is unique to it (present in the genome, species or genus, but not in any other genome). And, if the unicity distance of a species or genus were to be one, does that mean that this one hash is present in every single genome of the species or genus? |
My fault - It's an oddity of the way I'm doing the analysis and reporting. If the analysis reaches the end of the hashes without uniquely identifying the genome, it has infinite unicity distance. If it takes 50% of the hashes to get there, that number of hashes is the unicity distance. A unicity distance of 0 isn't possible, a unicity distance of 1 means a single hash has been found that uniquely identifies this genome.
Here we're dealing only with genomes - not species or genus. Species and genus are taxonomic terms and I haven't figured out how to calculate unicity for them (I'm trying out a different measure in #2). So, yes, infinite unicity distance means that the genome does not have any single hash or combination of hashes in it that is unique to that genome. This would be a consequence of sourmash's hashing of course - clearly there is at least some difference between the genomes otherwise they wouldn't be different genomes - but sourmash isn't finding anything with k=31 and 1000. And a unicity distance of 1 means that there is at least one individiual hash (there may be many individual hashes!) that can identify that genome uniquely. Thanks for the questions! |
(I'll improve the way I'm talking about this to be more correct ;) |
Perfect! I got it now! Thank you, Titus! |
Q: How many sourmash hashes does it take to classify a genome uniquely?
If the answer is 1, then there is a single hash that identifies this specific genome! 🎉
If the answer is “you cannot”, then you are looking at genomes that cannot be resolved with sourmash gather.
This number seems to be something called the unicity distance, and it is relatively straightforward to calculate!
Here are some unicity distances for a few GTDB genome sketches - the first column is the identifier, the second is the unicity distance, the third is the number of hashes in the sketch, and the fourth is the percentage of hashes that is the unicity distance; 100% would mean that it cannot be resolved by sourmash.
The text was updated successfully, but these errors were encountered: