-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] add some docs on search, gather, and lca methods #393
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
ed02147
Bump version and add note pointing to stable docs
luizirber 9ff705c
add some docs on search, gather, and lca
ctb e5e3c9f
updated some text
ctb 65893b2
Fix formatting error
luizirber d376a07
update sourmash lca output
ctb acf1107
upgrade the documentation a bunch
ctb 0885b57
Merge branch 'docs/improvements' of github.com:dib-lab/sourmash into …
ctb 5de8255
cleanup foo
ctb b10b490
address @taylorreiter comments
ctb be3660d
link to LCA databases
ctb 131e3b4
provide construction details
ctb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
# Classifying signatures: `search`, `gather`, and `lca` methods. | ||
|
||
sourmash provides several different techniques for doing | ||
classification and breakdown of signatures. | ||
|
||
## Searching for similar samples with `search`. | ||
|
||
The `sourmash search` command is most useful when you are looking for | ||
high similarity matches to other signatures; this is the most basic use | ||
case for MinHash searching. The command takes a query signature and one | ||
or more search signatures, and finds all the matches it can above a particular | ||
threshold. | ||
|
||
By default `search` will find matches with high [*Jaccard | ||
similarity*](https://en.wikipedia.org/wiki/Jaccard_index), which will | ||
consider all of the k-mers in the union of the two samples. | ||
Practically, this means that you will only find matches if there is | ||
both high overlap between the samples *and* relatively few k-mers that | ||
are disjoint between the samples. This is effective for finding genomes | ||
or transcriptomes that are similar but rarely works well for samples | ||
of vastly different sizes. | ||
|
||
One useful modification to `search` is to calculate containment with | ||
`--containment` instead of the (default) similarity; this will find | ||
matches where the query is contained within the subject, but the | ||
subject may have many other k-mers in it. For example, if you are using | ||
a plasmid as a query, you would use `--containment` to find genomes | ||
that contained that plasmid. | ||
|
||
See [the main sourmash | ||
tutorial](http://sourmash.readthedocs.io/en/latest/tutorials.html#make-and-search-a-database-quickly) | ||
for information on using `search` with and without `--containment`. | ||
|
||
## Breaking down metagenomic samples with `gather` and `lca` | ||
|
||
Neither search option (similarity or containment) is effective when | ||
comparing or searching with metagenomes, which typically have a | ||
mixture of many different genomes. While you might use containment to | ||
see if a query genome is present in one or more metagenomes, a common | ||
question to ask is the reverse: **what genomes are in my metagenome?** | ||
|
||
We have implemented two algorithms in sourmash to do this. | ||
|
||
One algorithm uses taxonomic information from e.g. GenBank to classify | ||
individual k-mers, and then infers taxonomic distributions of | ||
metagenome contents from the presence of these individual | ||
k-mers. (This is the approach pioneered by | ||
[Kraken](https://ccb.jhu.edu/software/kraken/) and many other tools.) | ||
`sourmash lca` can be used to classify individual genome bins with | ||
`classify`, or summarize metagenome taxonomy with `summarize`. The | ||
[sourmash lca tutorial](http://sourmash.readthedocs.io/en/latest/tutorials-lca.html) | ||
shows how to use the `lca classify` and `summarize` commands, and also | ||
provides guidance on building your own database. | ||
|
||
The other approach, `gather`, breaks a metagenome down into individual | ||
genomes based on greedy partitioning. Essentially, it takes a query | ||
metagenome and searches the database for the most highly contained | ||
genome; it then subtracts that match from the metagenome, and repeats. | ||
At the end it reports how much of the metagenome remains unknown. The | ||
[basic sourmash | ||
tutorial](http://sourmash.readthedocs.io/en/latest/tutorials.html#what-s-in-my-metagenome) | ||
has some sample output from using gather with GenBank. | ||
|
||
Our preliminary benchmarking suggests that `gather` is the most accurate | ||
method available for doing strain-level resolution of genomes. More on that | ||
as we move forward! | ||
|
||
## To do taxonomy, or not to do taxonomy? | ||
|
||
By default, there is no structured taxonomic information available in | ||
sourmash signatures or SBT databases of signatures. Generally what | ||
this means is that you will have to provide your own mapping from a | ||
match to some taxonomic hierarchy. This is generally the case when | ||
you are working with lots of genomes that have no taxonomic | ||
information. | ||
|
||
The `lca` subcommands, however, work with LCA databases, which contain | ||
taxonomic information by construction. This is one of the main | ||
differences between the `sourmash lca` subcommands and the basic | ||
`sourmash search` functionality. So the `lca` subcommands will generally | ||
output structured taxonomic information, and these are what you should look | ||
to if you are interested in doing classification. | ||
|
||
It's important to note that taxonomy based on k-mers is very, very | ||
specific and if you get a match, it's pretty reliable. On the | ||
converse, however, k-mer identification is very brittle with respect | ||
to evolutionary divergence, so if you don't get a match it may only mean | ||
that the particular species isn't known. | ||
|
||
## What commands should I use? | ||
|
||
It's not always easy to figure that out, we know! We're thinking about | ||
better tutorials and documentation constantly. | ||
|
||
We suggest the following approach: | ||
|
||
* build some signatures and do some searches, to get some basic familiarity | ||
with sourmash; | ||
|
||
* explore the available databases; | ||
|
||
* then ask questions [via the issue tracker](https://github.com/dib-lab/sourmash/issues) and we will do our best to help you out! | ||
|
||
This helps us figure out what people are actually interested in doing, and | ||
any help we provide via the issue tracker will eventually be added into the | ||
documentation. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this refer to the output of gather? (Line 72-74)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it refers to all of the non-LCA stuff. This is just going to have to be confusing, given the addition of lca gather :).