-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How does "gather" work? #296
Comments
On Mon, Jul 24, 2017 at 09:18:24PM -0700, Adrian Viehweger wrote:
Looking into the code for `sourmash gather` I built the following intuition about how it works algorithmically, and would like to know if I am correct:
Given a query and a database, it finds the best match. Then, the corresponding hashes/ k-mers are removed from the query and we repeat this until either no query is left or no matches are found.
Does this mean that if two organisms have more or less identical copies of x % of their genome (such as closely related strains), gather will report only one of the strains and not the other?
yes, if there are no k-mers that differentiate the two strains at the
calculated scaled value, then only one will be reported. The speed of
gather relies on finding a best match in the SBT quickly, which is one
of the main reasons we only report one. However, it also kind of makes
sense biologically, I think?
sorry for long silence - ANGUS workshops just finished, and now I'm traveling.
|
This is partly addressed in #393, and will be addressed further in documentation for 2.0. |
Fixed in #938 and available at https://sourmash.readthedocs.io/en/latest/classifying-signatures.html |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Looking into the code for
sourmash gather
I built the following intuition about how it works algorithmically, and would like to know if I am correct:Given a query and a database, it finds the best match. Then, the corresponding hashes/ k-mers are removed from the query and we repeat this until either no query is left or no matches are found.
Does this mean that if two organisms with more or less identical copies of x % of their genome (such as closely related strains) are present in a metagenomic sample, gather will report only one of the strains and not the other?
Thanks.
The text was updated successfully, but these errors were encountered: