From ed0214719390c86d2645146356e7d7091503fb20 Mon Sep 17 00:00:00 2001 From: Luiz Irber Date: Tue, 31 Oct 2017 10:42:29 -0700 Subject: [PATCH 01/10] Bump version and add note pointing to stable docs --- doc/conf.py | 4 ++-- doc/index.rst | 6 ++++++ 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/doc/conf.py b/doc/conf.py index d907ae5c00..bb9df84b3c 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -61,9 +61,9 @@ # built documents. # # The short X.Y version. -version = '1.0' +version = '2.0' # The full version, including alpha/beta/rc tags. -release = '1.0' +release = '2.0.0a1' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. diff --git a/doc/index.rst b/doc/index.rst index deab315f75..9a72e3082c 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -3,6 +3,12 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. +.. note:: + We are working on releasing sourmash 2.0, + and this documentation reflects changes not released officially yet. + If you want to see the documentation for sourmash 1.0 check + the `stable version `__. + Welcome to sourmash! ==================== From 9ff705ca547067c1474e5f82a1e685e0f5ed2c18 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 11 Feb 2018 15:24:15 -0800 Subject: [PATCH 02/10] add some docs on search, gather, and lca --- doc/classifying-signatures.md | 101 ++++++++++++++++++++++++++++++++++ doc/index.rst | 1 + 2 files changed, 102 insertions(+) create mode 100644 doc/classifying-signatures.md diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md new file mode 100644 index 0000000000..e6563a6bde --- /dev/null +++ b/doc/classifying-signatures.md @@ -0,0 +1,101 @@ +# Classifying signatures: `search`, `gather`, and `lca` methods. + +sourmash provides several different techniques for doing +classification and breakdown of signatures. + +## Searching for similar samples with `search`. + +The `sourmash search` command is most useful when you are looking for +high similarity matches to other signatures; this is the most basic use +case for MinHash searching. The command takes a query signature and one +or more search signatures, and finds all the matches it can above a particular +threshold. + +By default `search` will find matches with high [*Jaccard +similarity*](https://en.wikipedia.org/wiki/Jaccard_index), which will +consider all of the k-mers in the union of the two samples. +Practically, this means that you will only find matches if there is +both high overlap between the samples *and* relatively few k-mers that +are disjoint between the samples. This is effective for finding genomes +or transcriptomes that are similar but rarely works well for samples +of vastly different sizes. + +One useful modification to `search` is to calculate containment with +`--containment` instead of the (default) similarity; this will find +matches where the query is contained within the subject, but the +subject may have many other k-mers in it. For example, if you are using +a plasmid as a query, you would use `--containment` to find genomes +that contained that plasmid. + +See [the main sourmash +tutorial](http://sourmash.readthedocs.io/en/latest/tutorials.html#make-and-search-a-database-quickly) +for information on using `search` with and without `--containment`. + +## Breaking down metagenomic samples with `gather` and `lca` + +Neither search option (similarity or containment) is effective when +comparing or searching with metagenomes, which typically have a +mixture of many different genomes. While you might use containment to +see if a query genome is present in one or more metagenomes, a common +question to ask is the reverse: **what genomes are in my metagenome?* + +We have implemented two algorithms in sourmash to do this. + +One, approaches based on lowest common ancestor ("LCA"), uses +taxonomic information from e.g. GenBank to classify individual k-mers, +and then infers taxonomic distributions of metagenome contents from +the presence of these individual k-mers. (This is the approach +pioneered by [Kraken](https://ccb.jhu.edu/software/kraken/) and many +other tools.) `sourmash lca` can be used to classify individual +genome bins with `classify`, or summarize metagenome taxonomy with +`summarize`. The [sourmash lca +tutorial](http://sourmash.readthedocs.io/en/latest/tutorials-lca.html) +shows how to use the `lca classify` and `summarize` commands, and also +provides guidance on building your own database. + +The other approach, `gather`, breaks a metagenome down into individual +genomes based on greedy partitioning. Essentially, it takes a query +metagenome and searches the database for the most highly contained +genome; it then subtracts that match from the metagenome, and repeats. +At the end it reports how much of the metagenome remains unknown. The +[basic sourmash +tutorial](http://sourmash.readthedocs.io/en/latest/tutorials.html#what-s-in-my-metagenome) +has some sample output from using gather with GenBank. + +Our preliminary benchmarking suggests that `gather` is the most accurate +method available for doing strain-level resolution of genomes. More on that +as we move forward! + +## To do taxonomy, or not to do taxonomy? + +By default, there is no taxonomic information provided in sourmash +signatures or SBT databases of signatures. Generally what this means +is that you will have to provide your own mapping from a match to some +taxonomic hierarchy. This is generally appropriate when you are working +with lots of genomes that have no taxonomic information. + +The `lca` subcommands, however, work with LCA databases, which can only +be built for genomes for which taxonomic information is present. This is +one of the main differences between the `sourmash lca` subcommands and the +basic `sourmash search` functionality. + +We're still thinking about how to combine taxonomy and search, so feedback +is welcome. + +## What commands should I use? + +It's not always easy to figure that out, we know! We're thinking about +better tutorials and documentation constantly. + +We suggest the following approach: + +* build some signatures and do some searches, to get some basic familiarity + with sourmash; + +* explore the available databases; + +* then ask questions [via the issue tracker](https://github.com/dib-lab/sourmash/issues) and we will do our best to help you out! + +This helps us figure out what people are actually interested in doing, and +any help we provide via the issue tracker will eventually be added into the +documentation. diff --git a/doc/index.rst b/doc/index.rst index d7cc252978..4ac686a11c 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -65,6 +65,7 @@ Contents: command-line tutorials tutorials-lca + classifying-signatures databases api api-example From e5e3c9f132a31323ed7686a308a3c95d83d8793f Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 11 Feb 2018 19:44:03 -0800 Subject: [PATCH 03/10] updated some text --- doc/classifying-signatures.md | 32 +++++++++++++++++++------------- 1 file changed, 19 insertions(+), 13 deletions(-) diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md index e6563a6bde..46ee611891 100644 --- a/doc/classifying-signatures.md +++ b/doc/classifying-signatures.md @@ -68,19 +68,25 @@ as we move forward! ## To do taxonomy, or not to do taxonomy? -By default, there is no taxonomic information provided in sourmash -signatures or SBT databases of signatures. Generally what this means -is that you will have to provide your own mapping from a match to some -taxonomic hierarchy. This is generally appropriate when you are working -with lots of genomes that have no taxonomic information. - -The `lca` subcommands, however, work with LCA databases, which can only -be built for genomes for which taxonomic information is present. This is -one of the main differences between the `sourmash lca` subcommands and the -basic `sourmash search` functionality. - -We're still thinking about how to combine taxonomy and search, so feedback -is welcome. +By default, there is no structured taxonomic information available in +sourmash signatures or SBT databases of signatures. Generally what +this means is that you will have to provide your own mapping from a +match to some taxonomic hierarchy. This is generally the case when +you are working with lots of genomes that have no taxonomic +information. + +The `lca` subcommands, however, work with LCA databases, which contain +taxonomic information by construction. This is one of the main +differences between the `sourmash lca` subcommands and the basic +`sourmash search` functionality. So the `lca` subcommands will generally +output structured taxonomic information, and these are what you should look +to if you are interested in doing classification. + +It's important to note that taxonomy based on k-mers is very, very +specific and if you get a match, it's pretty reliable. On the +converse, however, k-mer identification is very brittle with respect +to evolutionary divergence, so if you don't get a match it may only mean +that the particular species isn't known. ## What commands should I use? From 65893b2f767e814de4e74ce5928f772d9a298f0a Mon Sep 17 00:00:00 2001 From: Luiz Irber Date: Mon, 12 Feb 2018 15:53:15 -0800 Subject: [PATCH 04/10] Fix formatting error --- doc/classifying-signatures.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md index 46ee611891..e1702a825b 100644 --- a/doc/classifying-signatures.md +++ b/doc/classifying-signatures.md @@ -37,7 +37,7 @@ Neither search option (similarity or containment) is effective when comparing or searching with metagenomes, which typically have a mixture of many different genomes. While you might use containment to see if a query genome is present in one or more metagenomes, a common -question to ask is the reverse: **what genomes are in my metagenome?* +question to ask is the reverse: **what genomes are in my metagenome?** We have implemented two algorithms in sourmash to do this. From d376a07b70a6ca0c931d1c172732f2b61706fbd8 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 18 Feb 2018 07:41:09 -0800 Subject: [PATCH 05/10] update sourmash lca output --- sourmash_lib/lca/__main__.py | 39 +++++++++++++++++++++--------------- 1 file changed, 23 insertions(+), 16 deletions(-) diff --git a/sourmash_lib/lca/__main__.py b/sourmash_lib/lca/__main__.py index 2a248e31ff..f97b5a0db9 100644 --- a/sourmash_lib/lca/__main__.py +++ b/sourmash_lib/lca/__main__.py @@ -9,6 +9,23 @@ from .command_compare_csv import compare_csv from sourmash_lib.logging import set_quiet, error +usage=''' +sourmash lca [] - work with taxonomic information. + +** Commands can be: + +index - create LCA database +classify --db --query - classify genomes +gather - classify metagenomes +summarize --db --query - summarize mixture +rankinfo - database rank info +compare_csv - compare spreadsheets + +** Use '-h' to get subcommand-specific help, e.g. + +sourmash lca index -h +''' + def main(sysv_args): set_quiet(False) @@ -19,24 +36,14 @@ def main(sysv_args): 'compare_csv': compare_csv} parser = argparse.ArgumentParser( - description='lowest-common ancestor (LCA) utilities', - usage='''sourmash lca [] - -Commands can be: - -index - create LCA database -classify --db --query - classify genomes -summarize --db --query - summarize mixture -rankinfo - database rank info -compare_csv - compare spreadsheets + description='lowest-common ancestor (LCA) utilities', usage=usage) + parser.add_argument('lca_command', nargs='?') + args = parser.parse_args(sysv_args[0:1]) -Use '-h' to get subcommand-specific help, e.g. + if not args.lca_command: + print(usage) + sys.exit(1) -sourmash lca index -h -. -''') - parser.add_argument('lca_command') - args = parser.parse_args(sysv_args[0:1]) if args.lca_command not in commands: error('Unrecognized command: {}', args.lca_command) parser.print_help() From acf11077cdba497d1e24dc0653a9d36eadf2f301 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 18 Feb 2018 08:55:55 -0800 Subject: [PATCH 06/10] upgrade the documentation a bunch --- doc/command-line.md | 256 ++++++++++++++++++++++++++++++++++++--- doc/index.rst | 90 +++++++++++--- doc/tutorials-lca.md | 4 +- sourmash_lib/__main__.py | 61 ++++++---- 4 files changed, 349 insertions(+), 62 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index b04c7dfd78..88fd9247a0 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -1,14 +1,14 @@ # Using sourmash from the command line +From the command line, sourmash can be used to compute +[MinHash sketches][0] from DNA sequences, compare them to each other, +and plot the results; these sketches are saved into "signature files". +These signatures allow you to estimate sequence similarity quickly and +accurately in large collections, among other capabilities. -From the command line, sourmash can be used to compute [MinHash sketches][0] from DNA -sequences, compare them to each other, and plot the results. This -allows you to estimate sequence similarity quickly and accurately. - -Please see the [mash software][1] - and the [mash paper (Ondov et al., 2016)][2] -for background -information on how and why MinHash sketches work. +Please see the [mash software][1] and the +[mash paper (Ondov et al., 2016)][2] for background information on +how and why MinHash sketches work. ______ @@ -18,7 +18,6 @@ taken. ## An example - Grab three bacterial genomes from NCBI: ``` curl -L -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz @@ -43,17 +42,46 @@ This will output two files, `cmp.dendro.png` and `cmp.matrix.png`, containing a clustering & dendrogram of the sequences, as well as a similarity matrix and heatmap. -## The `sourmash` command and its subcommands +![E. coli comparison plot](_static/cmp.dendro.png) +![E. coli comparison plot](_static/cmp.matrix.png) +## The `sourmash` command and its subcommands To get a list of subcommands, run `sourmash` without any arguments. -There are three main subcommands: `compute`, `compare`, and `plot`. +There are five main subcommands: `compute`, `compare`, `plot`, +`search`, and `gather`. See [the tutorial](tutorials.html) for a +walkthrough of these commands. -### `sourmash compute` +* `compute` creates signatures. +* `compare` compares signatures and builds a distance matrix. +* `plot` plots distance matrices created by `compare`. +* `search` finds matches to a query signature in a collection of signatures. +* `gather` finds non-overlapping matches to a metagenome in a collection of signatures. +There are also a number of commands that work with taxonomic +information; these are grouped under the `sourmash lca` +subcommand. See [the LCA tutorial](tutorials-lca.html) for a +walkthrough of these commands. + +* `lca classify` classifies many signatures against an LCA database. +* `lca summarize` summarizes the content of a metagenome using an LCA database. +* `lca gather` finds non-overlapping matches to a metagenome in an LCA database. +* `lca index` creates a database for use with LCA subcommands. +* `lca rankinfo` summarizes the content of a database. +* `lca compare_csv` compares lineage spreadsheets, e.g. those output by `lca classify`. + +Finally, there are a number of utility and information commands: + +* `info` shows version and software information. +* `index` indexes many signatures using a Sequence Bloom Tree (SBT). +* `sbt_combine` combines multiple SBTs. +* `categorize` is an experimental command to categorize many signatures. +* `watch` is an experimental command to classify a stream of sequencing data. + +### `sourmash compute` -The `compute` subcommand computes and saves MinHash sketches for +The `compute` subcommand computes and saves signatures for each sequence in one or more sequence files. It takes as input FASTA or FASTQ files, and these files can be uncompressed or compressed with gzip or bzip2. The output will be one or more JSON signature files @@ -68,6 +96,12 @@ Optional arguments: --ksizes K1[,K2,K3] -- one or more k-mer sizes to use; default is 31 --force -- recompute existing signatures; convert non-DNA characters to N --output -- save all the signatures to this file; can be '-' for stdout. +--track-abundance -- compute and save k-mer abundances. +--name-from-first -- name the signature based on the first sequence in the file +--singleton -- instead of computing a single signature for each input file, + compute one for each sequence +--merged -- compute a single signature for all of the input files, + naming it ``` ### `sourmash compare` @@ -78,7 +112,9 @@ The `compare` subcommand compares one or more signature files is a text display of a similarity matrix where each entry `[i, j]` contains the estimated Jaccard index between input signature `i` and input signature `j`. The output matrix can be saved to a file -with `--output` and used with the `sourmash plot` subcommand. +with `--output` and used with the `sourmash plot` subcommand (or loaded +with `numpy.load(...)`. Using `--csv` will output a CSV file that can +be loaded into other languages than Python, such as R. Usage: ``` @@ -93,7 +129,6 @@ Options: ### `sourmash plot` - The `plot` subcommand produces two plots -- a dendrogram and a dendrogram+matrix -- from a distance matrix computed by `sourmash compare --output `. The default output is two PNG files. @@ -109,16 +144,199 @@ Options: --labels -- display the signature names (by default, the filenames) on the plot --indices -- turn off index display on the plot. --vmax -- maximum value (default 1.0) for heatmap. ---vmin -- minimum value (deafult 0.0) for heatmap. +--vmin -- minimum value (default 0.0) for heatmap. --subsample= -- plot a maximum of samples, randomly chosen. --subsample-seed= -- seed for pseudorandom number generator. ``` -Example figures: - +Example output: + +![An E. coli comparison plot](_static/ecoli_cmp.matrix.png) + +### `sourmash search` + +The `search` subcommand searches a collection of signatures or SBTs for +matches to the query signature. It can search for matches with either +high [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) +or containment; the default is to use Jaccard similarity, unless +`--containment` if specified. `-o/--output` will create a CSV file +containing the matches. + +`search` will load all of provided signatures into memory, which can +be slow and somewhat memory intensive for large collections. You can +use `sourmash index` to create a Sequence Bloom Tree (SBT) that can +be quickly searched on disk; this is [the same format in which we provide +GenBank and other databases](databases.html). + +Usage: +``` +sourmash search query.sig [ list of signatures or SBTs ] +``` + +Example output: + +``` +49 matches; showing first 20: +similarity match +---------- ----- + 75.4% NZ_JMGW01000001.1 Escherichia coli 1-176-05_S4_C2 e117605... + 72.2% NZ_GG774190.1 Escherichia coli MS 196-1 Scfld2538, whole ... + 71.4% NZ_JMGU01000001.1 Escherichia coli 2-011-08_S3_C2 e201108... + 70.1% NZ_JHRU01000001.1 Escherichia coli strain 100854 100854_1... + 69.0% NZ_JH659569.1 Escherichia coli M919 supercont2.1, whole g... +... +``` + +### `sourmash gather` + +The `gather` subcommand finds all non-overlapping matches to the +query. This is specifically meant for metagenome and genome bin +analysis. (See [Classifying Signatures](classifying-signatures.html) +for more information on the different approaches that can be used +here.) + +If the input signature was computed with `--track-abundance`, output +will be abundance weighted (unless `--ignore-abundances` is +specified). `-o/--output` will create a CSV file containing the +matches. + +`gather`, like `search`, will load all of provided signatures into +memory. You can use `sourmash index` to create a Sequence Bloom Tree +(SBT) that can be quickly searched on disk; this is +[the same format in which we provide GenBank and other databases](databases.html). + +Usage: +``` +sourmash gather query.sig [ list of signatures or SBTs ] +``` + +Example output: +``` +overlap p_query p_match +--------- ------- -------- +1.4 Mbp 11.0% 58.0% JANA01000001.1 Fusobacterium sp. OBRC... +1.0 Mbp 7.7% 25.9% CP001957.1 Haloferax volcanii DS2 pla... +0.9 Mbp 7.4% 11.8% BA000019.2 Nostoc sp. PCC 7120 DNA, c... +0.7 Mbp 5.9% 23.0% FOVK01000036.1 Proteiniclasticum rumi... +0.7 Mbp 5.3% 17.6% AE017285.1 Desulfovibrio vulgaris sub... +``` + +Note: + +Use `sourmash gather` to classify a metagenome against a collection of +genomes with no (or incomplete) taxonomic information. Use `sourmash +lca summarize` and `sourmash lca gather` to classify a metagenome +using a collection of genomes with taxonomic information. + +## `sourmash lca` subcommands + +These commands use LCA databases (created with `index`, below, or +prepared databases such as +[genbank-k31.lca.json.gz, from the LCA tutorial](tutorials-lca.html). + +### `sourmash lca classify` + +`sourmash lca classify` classifies one or more signatures using the given +list of LCA DBs. It is meant for classifying metagenome-assembled genome +bins (MAGs) and single-cell genomes (SAGs). + +Usage: + +``` +sourmash lca --query query.sig [query2.sig ...] --db [ ...] +``` + +Example output: + +``` +ID,status,superkingdom,phylum,class,order,family,genus,species +"NC_009665.1 Shewanella baltica OS185, complete genome",found,Bacteria,Proteobacteria,Gammaproteobacteria,Alteromonadales,Shewanellaceae,Shewanella,Shewanella baltica +``` + +### `sourmash lca summarize` + +`sourmash lca summarize` produces a Kraken-style summary of the +combined contents of the given query signatures. It is meant for +exploring metagenomes and metagenome-assembled genome bins. + +Usage: + +``` +sourmash lca summarize --query query.sig [query2.sig ...] + --db [ ...] +``` + +Example output: + +``` +100.0% 792 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica +100.0% 792 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella +100.0% 792 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae +100.0% 792 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales +100.0% 792 Bacteria;Proteobacteria;Gammaproteobacteria +100.0% 792 Bacteria;Proteobacteria +100.0% 792 Bacteria +``` + +### `sourmash lca gather` + +The `sourmash lca gather` command classifies finds all non-overlapping +matches to the query, similar to the `sourmash gather` command. This +is specifically meant for metagenome and genome bin analysis. (See +[Classifying Signatures](classifying-signatures.html) for more +information on the different approaches that can be used here.) + +Usage: + +``` +sourmash lca gather query.sig [ ...] +``` + +Example output: + +``` +overlap p_query p_match +--------- ------- -------- +1.8 Mbp 14.6% 9.1% Fusobacterium nucleatum +1.0 Mbp 7.8% 16.3% Proteiniclasticum ruminis +1.0 Mbp 7.7% 25.9% Haloferax volcanii +0.9 Mbp 7.4% 11.8% Nostoc sp. PCC 7120 +0.9 Mbp 7.0% 5.8% Shewanella baltica +0.8 Mbp 6.0% 8.6% Desulfovibrio vulgaris +0.6 Mbp 4.9% 12.6% Thermus thermophilus +``` + +### `sourmash lca index` + +The `sourmash lca index` command creates an LCA database from +a lineage spreadsheet and a collection of signatures. This can be used +to create LCA databases from private collections of genomes, and can +also be used to create databases for e.g. subsets of GenBank. + +See [the `sourmash lca` tutorial](tutorials-lca.html) and the blog +post +[Why are taxonomic assignments so different for Tara bins?](http://ivory.idyll.org/blog/2017-taxonomic-disagreements-in-tara-mags.html) +for some use cases. + +If you are interested in preparing lineage spreadsheets from GenBank +genomes (or building off of NCBI taxonomies more generally), please +see +[the NCBI lineage repository](https://github.com/dib-lab/2018-ncbi-lineages). + +### `sourmash lca rankinfo` + +The `sourmash lca rankinfo` command displays k-mer specificity +information for one or more LCA databases. See the blog post +[How specific are k-mers for taxonomic assignment of microbes, anyway?](http://ivory.idyll.org/blog/2017-how-specific-kmers.html) for example output. - +### `sourmash lca compare_csv` +The `sourmash lca compare_csv` command compares two lineage +spreadsheets (such as those output by `sourmash lca classify` or taken +as input by `sourmash lca index`) and summarizes their +agreement/disagreement. Please see the blog post +[Why are taxonomic assignments so different for Tara bins?](http://ivory.idyll.org/blog/2017-taxonomic-disagreements-in-tara-mags.html) +for an example use case. [0]:https://en.wikipedia.org/wiki/MinHash [1]:http://mash.readthedocs.io/en/latest/__ diff --git a/doc/index.rst b/doc/index.rst index 4ac686a11c..5ee186b42b 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -1,8 +1,3 @@ -.. sourmash documentation master file, created by - sphinx-quickstart on Sat Jun 4 16:35:43 2016. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - Welcome to sourmash! ==================== @@ -13,39 +8,61 @@ This allows you to estimate sequence similarity between even very large data sets quickly and accurately. sourmash can also be used to quickly search large databases of genomes -for matches to query genomes and metagenomes; see `our list of available -databases `__. +for matches to query genomes and metagenomes; see `our list of +available databases `__. + +sourmash also includes k-mer based taxonomic exploration and +classification routines for genome and metagenome analysis. These +routines can use the NCBI taxonomy but do not depend on it in any way. Please see the `mash `__ software and the `mash paper (Ondov et al., 2016) `__ for background information on how and why MinHash sketches work. +We have two tutorials available: one on `genome and metagenome searching `__, and one on `taxonomic analysis `__. + ---- To use sourmash, you must be comfortable with the UNIX command line; programmers may find the Python library and API useful as well. -In brief, +sourmash in brief +----------------- + +sourmash uses MinHash sketching to create "signatures", compressed +representations of DNA/RNA sequence. These signatures can then +be stored, searched, explored, and taxonomically annotated. * ``sourmash`` provides command line utilities for creating, comparing, - and searching MinHash sketches, as well as plotting and clustering - sketches by distance (see `the command-line docs `__). + and searching signatures, as well as plotting and clustering + signatures by similarity (see `the command-line docs `__). + +* ``sourmash`` can **search very large collections of signatures** to find matches + to a query. -* ``sourmash`` supports saving, loading, and communication of MinHash - sketches via `JSON `__, a ~human-readable & +* ``sourmash`` can also **identify parts of metagenomes that match known genomes**, + and can **taxonomically classify genomes and metagenomes** against databases + of known species. + +* ``sourmash`` can be used to **search databases of public sequences** + (e.g. all of GenBank) and can also be used to create and search databases + of **private sequencing data**. + +* ``sourmash`` supports saving, loading, and communication of signatures + via `JSON `__, a ~human-readable & editable format. -* ``sourmash`` also has a simple Python API for interacting with sketches, - including support for online updating and querying of sketches +* ``sourmash`` also has a simple Python API for interacting with signatures, + including support for online updating and querying of signatures (see `the API docs `__). -* ``sourmash`` isn't terribly slow, and relies on an underlying CPython +* ``sourmash`` isn't terribly slow, and relies on an underlying Cython module. * ``sourmash`` is developed `on GitHub - `__ and is freely and openly - available under the BSD 3-clause license. Please see `the README + `__ and is **freely and openly + available** under the BSD 3-clause license. Please see `the README `__ for more information on development, support, and contributing. @@ -56,6 +73,45 @@ and experiment with it yourself `interactively with a binder `__ at `mybinder.org `__. +Memory and speed +---------------- + +sourmash has relatively small disk and memory requirements compared to +many other software programs used for genome search and taxonomic +classification. + +First, `mash` beats sourmash in speed and memory, so if you can use mash, +more power to you :) + +`sourmash search` and `sourmash gather` can be used to search all +genbank microbial genomes ([using our prepared +databases](databases.html)) with about 20 GB of disk and in under 1 GB +of RAM. Typically a search for a single genome takes about 30 seconds +on a laptop. + +`sourmash lca` can be used to search/classify against all genbank +microbial genomes with about 200 MB of disk space and about 10 GB of +RAM. Typically a metagenome classification takes about 1 minute on a +laptop. + +Limitations +----------- + +**sourmash cannot find matches across large evolutionary distances.** + +sourmash seems to work well to search and compare data sets for +matches at the species and genus level, but does not have much +sensitivity beyond that. (It seems to be particularly good at +strain-level analysis.) You should use protein-based analyses +to do searches across larger evolutionary distances. + +**sourmash signatures can be very large.** + +We use a modification of the MinHash sketch approach that allows us +to search the contents of metagenomes and large genomes with no loss +of sensitivity, but there is a tradeoff: there is no guaranteed limit +to signature size when using 'scaled' signatures. + Contents: --------- diff --git a/doc/tutorials-lca.md b/doc/tutorials-lca.md index 1dc0a71e38..b3d0b2617d 100644 --- a/doc/tutorials-lca.md +++ b/doc/tutorials-lca.md @@ -9,9 +9,9 @@ links and details. (These `sourmash lca classify` and `sourmash lca summarize` steps require about 4 GB of RAM when using the genbank database, as below.) -First, install sourmash from the LCA branch: +First, install sourmash 2.0a4 or later. ``` -pip install -U https://github.com/dib-lab/sourmash/archive/add/lca.zip +pip install -U https://github.com/dib-lab/sourmash/archive/master.zip ``` Next, download a genbank LCA database for k=31: diff --git a/sourmash_lib/__main__.py b/sourmash_lib/__main__.py index 7fa278adbf..e2f54632f5 100644 --- a/sourmash_lib/__main__.py +++ b/sourmash_lib/__main__.py @@ -12,51 +12,64 @@ plot, watch, info, storage) from .lca import main as lca_main +usage=''' +sourmash [] -def main(): - set_quiet(False) - - commands = {'search': search, 'compute': compute, - 'compare': compare, 'plot': plot, - 'import_csv': import_csv, 'dump': dump, - 'index': index, - 'categorize': categorize, 'gather': gather, - 'watch': watch, - 'sbt_combine': sbt_combine, 'info': info, - 'storage': storage, - 'lca': lca_main} - parser = argparse.ArgumentParser( - description='work with compressed sequence representations', - usage='''sourmash [] - -Commands can be: +** Commands include: compute Compute MinHash signatures for sequences in files. compare Compute similarity matrix for multiple signatures. search Search a signature against a list of signatures. plot Plot a distance matrix made by 'compare'. +gather Search a metagenome signature for multiple + non-overlapping matches. -Sequence Bloom Tree (SBT) utilities: +** Taxonomic classification utilities: + + Run 'sourmash lca' for the taxonomic classification routines. + +** Sequence Bloom Tree (SBT) utilities: index Index a collection of signatures for fast searching. sbt_combine Combine multiple SBTs into a new one. categorize Identify best matches for many signatures using an SBT. -gather Search a metagenome signature for multiple - non-overlapping matches in the SBT. watch Classify a stream of sequences. +** Other information: + info Sourmash version and other information. Use '-h' to get subcommand-specific help, e.g. sourmash compute -h -. -''') - parser.add_argument('command') + +** Documentation is available at https://sourmash.readthedocs.io/ +''' + +def main(): + set_quiet(False) + + commands = {'search': search, 'compute': compute, + 'compare': compare, 'plot': plot, + 'import_csv': import_csv, 'dump': dump, + 'index': index, + 'categorize': categorize, 'gather': gather, + 'watch': watch, + 'sbt_combine': sbt_combine, 'info': info, + 'storage': storage, + 'lca': lca_main} + parser = argparse.ArgumentParser( + description='work with compressed sequence representations') + parser.add_argument('command', nargs='?') args = parser.parse_args(sys.argv[1:2]) + + if not args.command: + print(usage) + sys.exit(1) + if args.command not in commands: error('Unrecognized command') - parser.print_help() + print(usage) sys.exit(1) cmd = commands.get(args.command) From 5de82551308fad40d5884b7f56f962ad2f23b1f9 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 18 Feb 2018 09:08:13 -0800 Subject: [PATCH 07/10] cleanup foo --- doc/command-line.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index 88fd9247a0..43c685170f 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -42,8 +42,9 @@ This will output two files, `cmp.dendro.png` and `cmp.matrix.png`, containing a clustering & dendrogram of the sequences, as well as a similarity matrix and heatmap. -![E. coli comparison plot](_static/cmp.dendro.png) -![E. coli comparison plot](_static/cmp.matrix.png) +Matrix: + +![Matrix](_static/cmp.matrix.png) ## The `sourmash` command and its subcommands @@ -159,7 +160,7 @@ The `search` subcommand searches a collection of signatures or SBTs for matches to the query signature. It can search for matches with either high [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) or containment; the default is to use Jaccard similarity, unless -`--containment` if specified. `-o/--output` will create a CSV file +`--containment` is specified. `-o/--output` will create a CSV file containing the matches. `search` will load all of provided signatures into memory, which can From b10b49050f9e9b215ff9f710e184597d1760c784 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 18 Feb 2018 09:37:15 -0800 Subject: [PATCH 08/10] address @taylorreiter comments --- doc/classifying-signatures.md | 17 ++++++++--------- doc/command-line.md | 4 ++-- 2 files changed, 10 insertions(+), 11 deletions(-) diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md index e1702a825b..63b92dfed9 100644 --- a/doc/classifying-signatures.md +++ b/doc/classifying-signatures.md @@ -41,15 +41,14 @@ question to ask is the reverse: **what genomes are in my metagenome?** We have implemented two algorithms in sourmash to do this. -One, approaches based on lowest common ancestor ("LCA"), uses -taxonomic information from e.g. GenBank to classify individual k-mers, -and then infers taxonomic distributions of metagenome contents from -the presence of these individual k-mers. (This is the approach -pioneered by [Kraken](https://ccb.jhu.edu/software/kraken/) and many -other tools.) `sourmash lca` can be used to classify individual -genome bins with `classify`, or summarize metagenome taxonomy with -`summarize`. The [sourmash lca -tutorial](http://sourmash.readthedocs.io/en/latest/tutorials-lca.html) +One algorithm uses taxonomic information from e.g. GenBank to classify +individual k-mers, and then infers taxonomic distributions of +metagenome contents from the presence of these individual +k-mers. (This is the approach pioneered by +[Kraken](https://ccb.jhu.edu/software/kraken/) and many other tools.) +`sourmash lca` can be used to classify individual genome bins with +`classify`, or summarize metagenome taxonomy with `summarize`. The +[sourmash lca tutorial](http://sourmash.readthedocs.io/en/latest/tutorials-lca.html) shows how to use the `lca classify` and `summarize` commands, and also provides guidance on building your own database. diff --git a/doc/command-line.md b/doc/command-line.md index 43c685170f..46d832aee4 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -231,7 +231,7 @@ using a collection of genomes with taxonomic information. ## `sourmash lca` subcommands -These commands use LCA databases (created with `index`, below, or +These commands use LCA databases (created with `lca index`, below, or prepared databases such as [genbank-k31.lca.json.gz, from the LCA tutorial](tutorials-lca.html). @@ -281,7 +281,7 @@ Example output: ### `sourmash lca gather` -The `sourmash lca gather` command classifies finds all non-overlapping +The `sourmash lca gather` command finds all non-overlapping matches to the query, similar to the `sourmash gather` command. This is specifically meant for metagenome and genome bin analysis. (See [Classifying Signatures](classifying-signatures.html) for more From be3660de4f26970a0aa0e9445f4735f3cb2dffb0 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 18 Feb 2018 09:50:14 -0800 Subject: [PATCH 09/10] link to LCA databases --- doc/databases.md | 23 +++++++++++++++++------ 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/doc/databases.md b/doc/databases.md index 747bd1562d..54cb44159a 100644 --- a/doc/databases.md +++ b/doc/databases.md @@ -1,9 +1,9 @@ # Prepared search databases -## RefSeq microbial genomes +## RefSeq microbial genomes - SBT -These database are formatted for use with `sourmash sbt_search` and -`sourmash sbt_gather`. +These database are formatted for use with `sourmash search` and +`sourmash gather`. Approximately 60,000 microbial genomes (including viral and fungal) from NCBI RefSeq. @@ -12,10 +12,10 @@ from NCBI RefSeq. * [RefSeq k=31, 2017.05.09][1] - 3.5 GB * [RefSeq k=51, 2017.05.09][2] - 3.5 GB -## Genbank microbial genomes +## Genbank microbial genomes - SBT -These database are formatted for use with `sourmash sbt_search` and -`sourmash sbt_gather`. +These database are formatted for use with `sourmash search` and +`sourmash gather`. Approximately 100,000 microbial genomes (including viral and fungal) from NCBI Genbank. @@ -49,3 +49,14 @@ sourmash compute -k 21,31,51 \ [3]:https://s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k21-2017.05.09.tar.gz [4]:https://s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k31-2017.05.09.tar.gz [5]:https://s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k51-2017.05.09.tar.gz + +## Genbank LCA Database + +These databases are formatted for use with `sourmash lca`. + +Approximately 87,000 microbial genomes (including viral and fungal) +from NCBI Genbank. + +* [Genbank k=21, 2017.11.07](https://osf.io/s3jx8/download), 105 MB +* [Genbank k=31, 2017.11.07](https://osf.io/zskb9/download), 118 MB +* [Genbank k=51, 2017.11.07](https://osf.io/md2nt/download), 123 MB From 131e3b4f24a5306182ef0ad4f349ff4be302a264 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 18 Feb 2018 09:54:36 -0800 Subject: [PATCH 10/10] provide construction details --- doc/databases.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/doc/databases.md b/doc/databases.md index 54cb44159a..5dd629c64f 100644 --- a/doc/databases.md +++ b/doc/databases.md @@ -60,3 +60,17 @@ from NCBI Genbank. * [Genbank k=21, 2017.11.07](https://osf.io/s3jx8/download), 105 MB * [Genbank k=31, 2017.11.07](https://osf.io/zskb9/download), 118 MB * [Genbank k=51, 2017.11.07](https://osf.io/md2nt/download), 123 MB + +### Details + +The above LCA databases were calculated as follows: + +``` +sourmash lca index genbank-genomes-taxonomy.2017.05.29.csv \ + genbank-k21.lca.json.gz -k 21 --scaled=10000 \ + -f --traverse-directory .sbt.genbank-k21 --split-identifiers +``` + +See +[github.com/dib-lab/2018-ncbi-lineages](https://github.com/dib-lab/2018-ncbi-lineages) +for information on preparing the genbank-genomes-taxonomy file.