Skip to content

Commit

Permalink
[MRG] add 'sourmash signature' signature manipulation utilities. (#587)
Browse files Browse the repository at this point in the history
Provides merge, flatten, rename, intersect, extract, downsample, subtract, import, export, and overlap utilities via sourmash signature subcommands.
  • Loading branch information
ctb authored Jan 8, 2019
1 parent 7f00dfe commit 1d1f61b
Show file tree
Hide file tree
Showing 31 changed files with 2,284 additions and 123 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ You can also use pip to install the pre-release like so:
```
pip install --pre sourmash
```

A quickstart tutorial [is available](https://sourmash.readthedocs.io/en/latest/tutorials.html).

### Requirements
Expand Down
13 changes: 9 additions & 4 deletions doc/api-example.md
Original file line number Diff line number Diff line change
Expand Up @@ -321,10 +321,12 @@ The hashing function used is identical between num and scaled signatures,
so the hash values themselves are compatible - it's the comparison between
collections of them that doesn't work.

But, in some circumstances, num signatures can be
extracted from scaled signatures, and vice versa. We haven't yet
implemented nice shortcuts for this in sourmash, but you can hack it
together yourself quite easily :).
But, in some circumstances, num signatures can be extracted from
scaled signatures, and vice versa. We haven't yet implemented a
Python API for this in sourmash, but you can hack it together yourself
quite easily, and a conversion utility is implemented through the command
line in `sourmash signature downsample`.


To extract a num MinHash object from a scaled MinHash, first create or load
your MinHash, and then extract the hash values:
Expand Down Expand Up @@ -367,6 +369,9 @@ more hash values in the scaled MinHash than num.
Yoda sayeth: *When understand these two sentences you can, use this code may
you.*

(You can also take a look at the logic in `sourmash signature
downsample` if you are interested.)

## Working with fast search trees (Sequence Bloom Trees, or SBTs)

Suppose we have a number of signatures calculated with `--scaled`, like so:
Expand Down
191 changes: 188 additions & 3 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,7 @@ genomes with no (or incomplete) taxonomic information. Use `sourmash
lca summarize` and `sourmash lca gather` to classify a metagenome
using a collection of genomes with taxonomic information.

## `sourmash lca` subcommands
## `sourmash lca` subcommands for taxonomic classification

These commands use LCA databases (created with `lca index`, below, or
prepared databases such as
Expand Down Expand Up @@ -288,7 +288,7 @@ signature were classified as family *Shewanellaceae*, genus
*Shewanella*, or species *Shewanella baltica*. Then the lowest
compatible node (here species *Shewanella baltica*) would be reported,
and the status of the classification would be `found`. However, if a
number of additional k-mers in the input signaturer were classified as
number of additional k-mers in the input signature were classified as
*Shewanella oneidensis*, sourmash would be unable to resolve the
taxonomic assignment below genus *Shewanella* and it would report
a status of `disagree` with the genus-level assignment of *Shewanella*;
Expand Down Expand Up @@ -343,7 +343,7 @@ and the following example summarize output to stdout:
```

The output is space-separated and consists of three columns: the
perrcentage of total k-mers that have this classification; the number of
percentage of total k-mers that have this classification; the number of
k-mers that have this classification; and the lineage classification.
K-mer classifications are reported hierarchically, so the percentages
and totals contain all assignments that are at a lower taxonomic level -
Expand Down Expand Up @@ -420,3 +420,188 @@ for an example use case.
[1]:http://mash.readthedocs.io/en/latest/__
[2]:http://biorxiv.org/content/early/2015/10/26/029827
[3]:https://en.wikipedia.org/wiki/Jaccard_index

## `sourmash signature` subcommands for signature manipulation

These commands manipulate signatures from the command line. Currently
supported subcommands are `merge`, `rename`, `intersect`,
`extract`, `downsample`, `subtract`, `import`, `export`, `info`, and
`flatten`.

All of the signature commands work only on compatible signatures, where
the k-mer size and nucleotide/protein sequences match. If working directly
with the hash values (e.g. `merge`, `intersect`, `subtract`) then the
scaled values must also match; you can use `downsample` to convert a bunch
of samples to the same scaled value.

If there are multiple signatures in a file with different ksizes and/or
from nucleotide and protein sequences, you can choose amongst them with
`-k/--ksize` and `--dna` or `--protein`, as with other sourmash commands
such as `search`, `gather`, and `compare`.

Note, you can use `sourmash sig` as shorthand for all of these commands.

### `sourmash signature merge`

Merge two (or more) signatures.

For example,
```
sourmash signature merge file1.sig file2.sig -o merged.sig
```
will output the union of all the hashes in `file1.sig` and `file2.sig`
to `merged.sig`.

All of the signatures passed to merge must either have been computed
with `--track-abundance`, or not. If they have `track_abundance` on,
then the merged signature will have the sum of all abundances across
the individual signatures. The `--flatten` flag will override this
behavior and allow merging of mixtures by removing all abundances.

### `sourmash signature rename`

Rename the display name for a signature - this is the name output for matches
in `compare`, `search`, `gather`, etc.

For example,
```
sourmash signature rename file1.sig "new name" -o renamed.sig
```
will place a renamed copy of the hashes in `file1.sig` in the file
`renamed.sig`.

### `sourmash signature subtract`

Subtract all of the hash values from one signature that are in one or more
of the others.

For example,

```
sourmash signature subtract file1.sig file2.sig file3.sig -o subtracted.sig
```
will subtract all of the hashes in `file2.sig` and `file3.sig` from
`file1.sig`, and save the new signature to `subtracted.sig`.

To use `subtract` on signatures calculated with
`--track-abundance`, you must specify `--flatten`.

### `sourmash signature intersect`

Output the intersection of the hash values in multiple signature files.

For example,

```
sourmash signature intersect file1.sig file2.sig file3.sig -o intersect.sig
```
will output the intersection of all the hashes in those three files to
`intersect.sig`.

The `intersect` command flattens all signatures, i.e. the abundances
in any signatures will be ignored and the output signature will have
`track_abundance` turned off.

### `sourmash signature downsample`

Downsample one or more signatures.

With `downsample`, you can --

* increase the `--scaled` value for a signature computed with `--scaled`, shrinking it in size;
* decrease the `num` value for a traditional num MinHash, shrinking it in size;
* try to convert a `--scaled` signature to a `num` signature;
* try to convert a `num` signature to a `--scaled` signature.

For example,
```
sourmash signature downsample file1.sig file2.sig --scaled 100000 -o downsampled.sig
```
will output each signature, downsampled to a scaled value of 100000, to
`downsampled.sig`; and
```
sourmash signature downsample --num 500 scaled_file.sig -o downsampled.sig
```
will try to convert a scaled MinHash to a num MinHash.

### `sourmash signature extract`

Extract the specified signature(s) from a collection of signatures.

For example,
```
sourmash signature extract *.sig -k 21 --dna -o extracted.sig
```
will extract all nucleotide signatures calculated at k=21 from all
.sig files in the current directory.

There are currently two other useful selectors for `extract`: you can specify
(part of) an md5sum, as output in the CSVs produced by `search` and `gather`;
and you can specify (part of) a name.

For example,
```
sourmash signature extract tests/test-data/*.fa.sig --md5 09a0869
```
will extract the signature from `47.fa.sig` which has an md5sum of
`09a08691ce52952152f0e866a59f6261`; and
```
sourmash signature extract tests/test-data/*.fa.sig --name NC_009665
```
will extract the same signature, which has an accession number of
`NC_009665.1`.

### `sourmash signature flatten`

Flatten the specified signature(s), removing abundances and setting
track_abundance to False.

For example,
```
sourmash signature flatten *.sig -o flattened.sig
```
will remove all abundances from all of the .sig files in the current
directory.

The `flatten` command accepts the same selectors as `extract`.

### `sourmash signature import`

Import signatures into sourmash format. Currently only supports mash,
and can import mash sketches output by `mash info -d <filename.msh>`.

For example,
```
sourmash signature import filename.msh.json -o imported.sig
```
will import the contents of `filename.msh.json` into `imported.sig`.

### `sourmash signature export`

Export signatures from sourmash format. Currently only supports
mash dump format.

For example,
```
sourmash signature export filename.sig -o filename.sig.msh.json
```

### `sourmash signature overlap`

Display a detailed comparison of two signatures. This computes the
Jaccard similarity (as in `sourmash compare` or `sourmash search`) and
the Jaccard containment in both directions (as with `--containment`).
It also displays the number of hash values in the union and
intersection of the two signatures, as well as the number of disjoint
hash values in each signature.

This command has two uses - first, it is helpful for understanding how
similarity and containment are calculated, and second, it is useful for
analyzing signatures with very small overlaps, where the similarity
and/or containment might be very close to zero.

For example,
```
sourmash signature overlap file1.sig file2.sig
```
will display the detailed comparison of `file1.sig` and `file2.sig`.
10 changes: 7 additions & 3 deletions sourmash/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
gather, index, sbt_combine, search,
plot, watch, info, storage, migrate, multigather)
from .lca import main as lca_main
from .sig import main as sig_main

usage='''
sourmash <command> [<args>]
Expand All @@ -35,9 +36,10 @@
categorize Identify best matches for many signatures using an SBT.
watch Classify a stream of sequences.
** Other information:
** Other commands:
info Sourmash version and other information.
info Display sourmash version and other information.
signature Sourmash signature manipulation utilities.
Use '-h' to get subcommand-specific help, e.g.
Expand All @@ -59,7 +61,9 @@ def main():
'storage': storage,
'lca': lca_main,
'migrate': migrate,
'multigather': multigather}
'multigather': multigather,
'sig': sig_main,
'signature': sig_main}
parser = argparse.ArgumentParser(
description='work with compressed sequence representations')
parser.add_argument('command', nargs='?')
Expand Down
2 changes: 1 addition & 1 deletion sourmash/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -1006,7 +1006,7 @@ def gather(args):
sourmash_args.add_moltype_args(parser)

args = parser.parse_args(args)
set_quiet(args.quiet)
set_quiet(args.quiet, args.debug)
moltype = sourmash_args.calculate_moltype(args)

# load the query signature & figure out all the things
Expand Down
12 changes: 7 additions & 5 deletions sourmash/lca/command_classify.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@
from collections import Counter

from .. import sourmash_args, load_signatures
from ..logging import notify, error
from ..logging import notify, error, debug, set_quiet
from . import lca_utils
from .lca_utils import debug, set_debug, check_files_exist
from .lca_utils import check_files_exist

DEFAULT_THRESHOLD=5 # how many counts of a taxid at min

Expand Down Expand Up @@ -88,7 +88,10 @@ def classify(args):
p.add_argument('--scaled', type=float)
p.add_argument('--traverse-directory', action='store_true',
help='load all signatures underneath directories.')
p.add_argument('-d', '--debug', action='store_true')
p.add_argument('-q', '--quiet', action='store_true',
help='suppress non-error output')
p.add_argument('-d', '--debug', action='store_true',
help='output debugging output')
args = p.parse_args(args)

if not args.db:
Expand All @@ -99,8 +102,7 @@ def classify(args):
error('Error! must specify at least one query signature with --query')
sys.exit(-1)

if args.debug:
set_debug(args.debug)
set_quiet(args.quiet, args.debug)

# flatten --db and --query
args.db = [item for sublist in args.db for item in sublist]
Expand Down
12 changes: 7 additions & 5 deletions sourmash/lca/command_compare_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,20 @@
from collections import defaultdict

from .. import sourmash_args
from ..logging import notify, error, print_results
from ..logging import notify, error, print_results, set_quiet
from . import lca_utils
from .lca_utils import debug, set_debug, zip_lineage
from .lca_utils import zip_lineage
from .command_index import load_taxonomy_assignments


def compare_csv(args):
p = argparse.ArgumentParser(prog="sourmash lca compare_csv")
p.add_argument('csv1', help='taxonomy spreadsheet output by classify')
p.add_argument('csv2', help='custom taxonomy spreadsheet')
p.add_argument('-d', '--debug', action='store_true')
p.add_argument('-q', '--quiet', action='store_true',
help='suppress non-error output')
p.add_argument('-d', '--debug', action='store_true',
help='output debugging output')
p.add_argument('-C', '--start-column', default=2, type=int,
help='column at which taxonomic assignments start')
p.add_argument('--tabs', action='store_true',
Expand All @@ -32,8 +35,7 @@ def compare_csv(args):
error('error, --start-column cannot be less than 2')
sys.exit(-1)

if args.debug:
set_debug(args.debug)
set_quiet(args.quiet, args.debug)

# first, load classify-style spreadsheet
notify('loading classify output from: {}', args.csv1)
Expand Down
12 changes: 7 additions & 5 deletions sourmash/lca/command_gather.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@
from collections import Counter, defaultdict, namedtuple

from .. import sourmash_args, save_signatures, SourmashSignature
from ..logging import notify, error, print_results
from ..logging import notify, error, print_results, set_quiet, debug
from . import lca_utils
from .lca_utils import debug, set_debug, check_files_exist
from .lca_utils import check_files_exist
from ..search import format_bp

LCAGatherResult = namedtuple('LCAGatherResult',
Expand Down Expand Up @@ -186,17 +186,19 @@ def gather_main(args):
p = argparse.ArgumentParser(prog="sourmash lca gather")
p.add_argument('query')
p.add_argument('db', nargs='+')
p.add_argument('-d', '--debug', action='store_true')
p.add_argument('-o', '--output', type=argparse.FileType('wt'),
help='output CSV containing matches to this file')
p.add_argument('--output-unassigned', type=argparse.FileType('wt'),
help='output unassigned portions of the query as a signature to this file')
p.add_argument('--ignore-abundance', action='store_true',
help='do NOT use k-mer abundances if present')
p.add_argument('-q', '--quiet', action='store_true',
help='suppress non-error output')
p.add_argument('-d', '--debug', action='store_true',
help='output debugging output')
args = p.parse_args(args)

if args.debug:
set_debug(args.debug)
set_quiet(args.quiet, args.debug)

if not check_files_exist(args.query, *args.db):
sys.exit(-1)
Expand Down
Loading

0 comments on commit 1d1f61b

Please sign in to comment.