Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add 'sourmash signature' signature manipulation utilities. #587

Merged
merged 52 commits into from
Jan 8, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
e48358c
initial setup for 'sourmash signature' command line
ctb Dec 26, 2018
9c1c5f1
fix module name sourmash_lib -> sourmash
ctb Dec 26, 2018
e02ebbc
fix module name sourmash_lib -> sourmash
ctb Dec 26, 2018
97bd691
fix module name sourmash_lib -> sourmash
ctb Dec 26, 2018
58df721
fix module name sourmash_lib -> sourmash
ctb Dec 26, 2018
d146e4d
adjust tests and test filenames to be explicit
ctb Dec 26, 2018
f044e2a
initial merge and intersect functionality
ctb Dec 26, 2018
d3abd2b
add merge, intersect, and rename
ctb Dec 26, 2018
86f5163
added subtract, extract, and downsample
ctb Dec 26, 2018
22b74ac
added moltype and ksize
ctb Dec 26, 2018
f535085
add -o options
ctb Dec 26, 2018
30cb9f9
added documentation, refactored sourmash sig extract a bit.
ctb Dec 27, 2018
98137c3
spell check new documentation :)
ctb Dec 27, 2018
100ec12
add --md5 and --name selectors to sourmash sig extract; update tests,…
ctb Dec 27, 2018
58c5e89
Merge branch 'master' of github.com:dib-lab/sourmash into add/signatu…
ctb Dec 27, 2018
87c27ee
add 'info' command
ctb Dec 27, 2018
45d1c13
Merge branch 'master' into add/signature_subcommand_2
ctb Dec 27, 2018
25ff113
Merge branch 'master' into add/signature_subcommand_2
ctb Dec 28, 2018
ed77e0d
adjust tests to ensure that full signatures are identical
ctb Dec 28, 2018
457daec
start thinking about abundance stuff
ctb Dec 29, 2018
5b5fe65
Merge branch 'master' of github.com:dib-lab/sourmash into add/signatu…
ctb Dec 30, 2018
364d2ff
fix merge/intersect behavior with respect to abund, add some tests
ctb Dec 30, 2018
bef4b98
add a flatten command
ctb Dec 30, 2018
2f3425b
make sure subtract breaks on --track-abundance signatures
ctb Dec 30, 2018
8ee1aad
add import/export
ctb Dec 31, 2018
1ef65e8
add py27 compat print foo
ctb Dec 31, 2018
904446e
add --num to downsample
ctb Dec 31, 2018
fe5e019
update sourmash main doc
ctb Jan 1, 2019
550b10d
num to scaled conversion
ctb Jan 1, 2019
b924e89
add conversion from scaled to num
ctb Jan 1, 2019
a432a6d
make downsample work on multisig files
ctb Jan 1, 2019
b65c341
fix headings in sourmash sig docs
ctb Jan 3, 2019
ced81e7
add sourmash signature overlap
ctb Jan 3, 2019
c4e8db8
Merge branch 'master' of github.com:dib-lab/sourmash into add/signatu…
ctb Jan 3, 2019
a4b17ba
Merge branch 'master' into add/signature_subcommand_2
ctb Jan 4, 2019
b7608c3
Merge branch 'master' of github.com:dib-lab/sourmash into add/signatu…
ctb Jan 5, 2019
ca51600
add missing subcommands to docs, output
ctb Jan 5, 2019
462390f
fix typo
taylorreiter Jan 5, 2019
123f091
make sure 'sourmash sig info' prints to stdout not stderr
ctb Jan 7, 2019
5e9ffb8
minor comment update
ctb Jan 7, 2019
22dd21a
merge now supports multiple signatures per file; add merge --flatten …
ctb Jan 7, 2019
3ff4aed
added support for multiple signatures in one file to both intersect a…
ctb Jan 7, 2019
a594d17
added multisig tests for everything but flatten
ctb Jan 7, 2019
2661d87
added more multisig tests, and added --flatten to subtract
ctb Jan 7, 2019
343145f
Merge branch 'master' of github.com:dib-lab/sourmash into add/signatu…
ctb Jan 7, 2019
a92289b
adjust test to use last_result
ctb Jan 7, 2019
597f167
revamp --quiet and --debug handling in lca submodule
ctb Jan 7, 2019
fda4e9c
fix py2 division bug **caught by tests and CI**
ctb Jan 7, 2019
0b74aaa
fix errors introduced by debug revamp in lca :)
ctb Jan 7, 2019
18b2f18
rename sig info to sig describe
ctb Jan 8, 2019
2b06374
deal more nicely with signature loading errors in describe()
ctb Jan 8, 2019
c08b81a
isolate black magic functionality
ctb Jan 8, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ You can also use pip to install the pre-release like so:
```
pip install --pre sourmash
```

A quickstart tutorial [is available](https://sourmash.readthedocs.io/en/latest/tutorials.html).

### Requirements
Expand Down
13 changes: 9 additions & 4 deletions doc/api-example.md
Original file line number Diff line number Diff line change
Expand Up @@ -321,10 +321,12 @@ The hashing function used is identical between num and scaled signatures,
so the hash values themselves are compatible - it's the comparison between
collections of them that doesn't work.

But, in some circumstances, num signatures can be
extracted from scaled signatures, and vice versa. We haven't yet
implemented nice shortcuts for this in sourmash, but you can hack it
together yourself quite easily :).
But, in some circumstances, num signatures can be extracted from
scaled signatures, and vice versa. We haven't yet implemented a
Python API for this in sourmash, but you can hack it together yourself
quite easily, and a conversion utility is implemented through the command
line in `sourmash signature downsample`.


To extract a num MinHash object from a scaled MinHash, first create or load
your MinHash, and then extract the hash values:
Expand Down Expand Up @@ -367,6 +369,9 @@ more hash values in the scaled MinHash than num.
Yoda sayeth: *When understand these two sentences you can, use this code may
you.*

(You can also take a look at the logic in `sourmash signature
downsample` if you are interested.)

## Working with fast search trees (Sequence Bloom Trees, or SBTs)

Suppose we have a number of signatures calculated with `--scaled`, like so:
Expand Down
191 changes: 188 additions & 3 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,7 @@ genomes with no (or incomplete) taxonomic information. Use `sourmash
lca summarize` and `sourmash lca gather` to classify a metagenome
using a collection of genomes with taxonomic information.

## `sourmash lca` subcommands
## `sourmash lca` subcommands for taxonomic classification

These commands use LCA databases (created with `lca index`, below, or
prepared databases such as
Expand Down Expand Up @@ -288,7 +288,7 @@ signature were classified as family *Shewanellaceae*, genus
*Shewanella*, or species *Shewanella baltica*. Then the lowest
compatible node (here species *Shewanella baltica*) would be reported,
and the status of the classification would be `found`. However, if a
number of additional k-mers in the input signaturer were classified as
number of additional k-mers in the input signature were classified as
*Shewanella oneidensis*, sourmash would be unable to resolve the
taxonomic assignment below genus *Shewanella* and it would report
a status of `disagree` with the genus-level assignment of *Shewanella*;
Expand Down Expand Up @@ -343,7 +343,7 @@ and the following example summarize output to stdout:
```

The output is space-separated and consists of three columns: the
perrcentage of total k-mers that have this classification; the number of
percentage of total k-mers that have this classification; the number of
k-mers that have this classification; and the lineage classification.
K-mer classifications are reported hierarchically, so the percentages
and totals contain all assignments that are at a lower taxonomic level -
Expand Down Expand Up @@ -420,3 +420,188 @@ for an example use case.
[1]:http://mash.readthedocs.io/en/latest/__
[2]:http://biorxiv.org/content/early/2015/10/26/029827
[3]:https://en.wikipedia.org/wiki/Jaccard_index

## `sourmash signature` subcommands for signature manipulation

These commands manipulate signatures from the command line. Currently
supported subcommands are `merge`, `rename`, `intersect`,
`extract`, `downsample`, `subtract`, `import`, `export`, `info`, and
`flatten`.

All of the signature commands work only on compatible signatures, where
the k-mer size and nucleotide/protein sequences match. If working directly
with the hash values (e.g. `merge`, `intersect`, `subtract`) then the
scaled values must also match; you can use `downsample` to convert a bunch
of samples to the same scaled value.

If there are multiple signatures in a file with different ksizes and/or
from nucleotide and protein sequences, you can choose amongst them with
`-k/--ksize` and `--dna` or `--protein`, as with other sourmash commands
such as `search`, `gather`, and `compare`.

Note, you can use `sourmash sig` as shorthand for all of these commands.
ctb marked this conversation as resolved.
Show resolved Hide resolved

### `sourmash signature merge`

Merge two (or more) signatures.

For example,
```
sourmash signature merge file1.sig file2.sig -o merged.sig
```
will output the union of all the hashes in `file1.sig` and `file2.sig`
to `merged.sig`.

All of the signatures passed to merge must either have been computed
with `--track-abundance`, or not. If they have `track_abundance` on,
then the merged signature will have the sum of all abundances across
the individual signatures. The `--flatten` flag will override this
behavior and allow merging of mixtures by removing all abundances.

### `sourmash signature rename`
ctb marked this conversation as resolved.
Show resolved Hide resolved

Rename the display name for a signature - this is the name output for matches
in `compare`, `search`, `gather`, etc.

For example,
```
sourmash signature rename file1.sig "new name" -o renamed.sig
```
will place a renamed copy of the hashes in `file1.sig` in the file
`renamed.sig`.

### `sourmash signature subtract`

Subtract all of the hash values from one signature that are in one or more
of the others.

For example,

```
sourmash signature subtract file1.sig file2.sig file3.sig -o subtracted.sig
```
will subtract all of the hashes in `file2.sig` and `file3.sig` from
`file1.sig`, and save the new signature to `subtracted.sig`.

To use `subtract` on signatures calculated with
`--track-abundance`, you must specify `--flatten`.

### `sourmash signature intersect`
ctb marked this conversation as resolved.
Show resolved Hide resolved

Output the intersection of the hash values in multiple signature files.

For example,

```
sourmash signature intersect file1.sig file2.sig file3.sig -o intersect.sig
```
will output the intersection of all the hashes in those three files to
`intersect.sig`.

The `intersect` command flattens all signatures, i.e. the abundances
in any signatures will be ignored and the output signature will have
`track_abundance` turned off.

### `sourmash signature downsample`

Downsample one or more signatures.

With `downsample`, you can --

* increase the `--scaled` value for a signature computed with `--scaled`, shrinking it in size;
* decrease the `num` value for a traditional num MinHash, shrinking it in size;
* try to convert a `--scaled` signature to a `num` signature;
* try to convert a `num` signature to a `--scaled` signature.

For example,
```
sourmash signature downsample file1.sig file2.sig --scaled 100000 -o downsampled.sig
```
will output each signature, downsampled to a scaled value of 100000, to
`downsampled.sig`; and
```
sourmash signature downsample --num 500 scaled_file.sig -o downsampled.sig
```
will try to convert a scaled MinHash to a num MinHash.

### `sourmash signature extract`

Extract the specified signature(s) from a collection of signatures.

For example,
```
sourmash signature extract *.sig -k 21 --dna -o extracted.sig
```
will extract all nucleotide signatures calculated at k=21 from all
.sig files in the current directory.

There are currently two other useful selectors for `extract`: you can specify
(part of) an md5sum, as output in the CSVs produced by `search` and `gather`;
and you can specify (part of) a name.

For example,
```
sourmash signature extract tests/test-data/*.fa.sig --md5 09a0869
ctb marked this conversation as resolved.
Show resolved Hide resolved
```
will extract the signature from `47.fa.sig` which has an md5sum of
`09a08691ce52952152f0e866a59f6261`; and
```
sourmash signature extract tests/test-data/*.fa.sig --name NC_009665
```
will extract the same signature, which has an accession number of
`NC_009665.1`.

### `sourmash signature flatten`

Flatten the specified signature(s), removing abundances and setting
track_abundance to False.

For example,
```
sourmash signature flatten *.sig -o flattened.sig
```
will remove all abundances from all of the .sig files in the current
directory.

The `flatten` command accepts the same selectors as `extract`.

### `sourmash signature import`

Import signatures into sourmash format. Currently only supports mash,
and can import mash sketches output by `mash info -d <filename.msh>`.

For example,
```
sourmash signature import filename.msh.json -o imported.sig
```
will import the contents of `filename.msh.json` into `imported.sig`.

### `sourmash signature export`

Export signatures from sourmash format. Currently only supports
mash dump format.

For example,
```
sourmash signature export filename.sig -o filename.sig.msh.json
```

### `sourmash signature overlap`
ctb marked this conversation as resolved.
Show resolved Hide resolved

Display a detailed comparison of two signatures. This computes the
Jaccard similarity (as in `sourmash compare` or `sourmash search`) and
the Jaccard containment in both directions (as with `--containment`).
It also displays the number of hash values in the union and
intersection of the two signatures, as well as the number of disjoint
hash values in each signature.

This command has two uses - first, it is helpful for understanding how
similarity and containment are calculated, and second, it is useful for
analyzing signatures with very small overlaps, where the similarity
and/or containment might be very close to zero.

For example,
```
sourmash signature overlap file1.sig file2.sig
```
will display the detailed comparison of `file1.sig` and `file2.sig`.
10 changes: 7 additions & 3 deletions sourmash/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
gather, index, sbt_combine, search,
plot, watch, info, storage, migrate, multigather)
from .lca import main as lca_main
from .sig import main as sig_main

usage='''
sourmash <command> [<args>]
Expand All @@ -35,9 +36,10 @@
categorize Identify best matches for many signatures using an SBT.
watch Classify a stream of sequences.

** Other information:
** Other commands:

info Sourmash version and other information.
info Display sourmash version and other information.
ctb marked this conversation as resolved.
Show resolved Hide resolved
signature Sourmash signature manipulation utilities.

Use '-h' to get subcommand-specific help, e.g.

Expand All @@ -59,7 +61,9 @@ def main():
'storage': storage,
'lca': lca_main,
'migrate': migrate,
'multigather': multigather}
'multigather': multigather,
'sig': sig_main,
'signature': sig_main}
parser = argparse.ArgumentParser(
description='work with compressed sequence representations')
parser.add_argument('command', nargs='?')
Expand Down
2 changes: 1 addition & 1 deletion sourmash/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -1006,7 +1006,7 @@ def gather(args):
sourmash_args.add_moltype_args(parser)

args = parser.parse_args(args)
set_quiet(args.quiet)
set_quiet(args.quiet, args.debug)
moltype = sourmash_args.calculate_moltype(args)

# load the query signature & figure out all the things
Expand Down
12 changes: 7 additions & 5 deletions sourmash/lca/command_classify.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@
from collections import Counter

from .. import sourmash_args, load_signatures
from ..logging import notify, error
from ..logging import notify, error, debug, set_quiet
from . import lca_utils
from .lca_utils import debug, set_debug, check_files_exist
from .lca_utils import check_files_exist

DEFAULT_THRESHOLD=5 # how many counts of a taxid at min

Expand Down Expand Up @@ -88,7 +88,10 @@ def classify(args):
p.add_argument('--scaled', type=float)
p.add_argument('--traverse-directory', action='store_true',
help='load all signatures underneath directories.')
p.add_argument('-d', '--debug', action='store_true')
p.add_argument('-q', '--quiet', action='store_true',
help='suppress non-error output')
p.add_argument('-d', '--debug', action='store_true',
help='output debugging output')
args = p.parse_args(args)

if not args.db:
Expand All @@ -99,8 +102,7 @@ def classify(args):
error('Error! must specify at least one query signature with --query')
sys.exit(-1)

if args.debug:
set_debug(args.debug)
set_quiet(args.quiet, args.debug)

# flatten --db and --query
args.db = [item for sublist in args.db for item in sublist]
Expand Down
12 changes: 7 additions & 5 deletions sourmash/lca/command_compare_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,20 @@
from collections import defaultdict

from .. import sourmash_args
from ..logging import notify, error, print_results
from ..logging import notify, error, print_results, set_quiet
from . import lca_utils
from .lca_utils import debug, set_debug, zip_lineage
from .lca_utils import zip_lineage
from .command_index import load_taxonomy_assignments


def compare_csv(args):
p = argparse.ArgumentParser(prog="sourmash lca compare_csv")
p.add_argument('csv1', help='taxonomy spreadsheet output by classify')
p.add_argument('csv2', help='custom taxonomy spreadsheet')
p.add_argument('-d', '--debug', action='store_true')
p.add_argument('-q', '--quiet', action='store_true',
help='suppress non-error output')
p.add_argument('-d', '--debug', action='store_true',
help='output debugging output')
p.add_argument('-C', '--start-column', default=2, type=int,
help='column at which taxonomic assignments start')
p.add_argument('--tabs', action='store_true',
Expand All @@ -32,8 +35,7 @@ def compare_csv(args):
error('error, --start-column cannot be less than 2')
sys.exit(-1)

if args.debug:
set_debug(args.debug)
set_quiet(args.quiet, args.debug)

# first, load classify-style spreadsheet
notify('loading classify output from: {}', args.csv1)
Expand Down
12 changes: 7 additions & 5 deletions sourmash/lca/command_gather.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@
from collections import Counter, defaultdict, namedtuple

from .. import sourmash_args, save_signatures, SourmashSignature
from ..logging import notify, error, print_results
from ..logging import notify, error, print_results, set_quiet, debug
from . import lca_utils
from .lca_utils import debug, set_debug, check_files_exist
from .lca_utils import check_files_exist
from ..search import format_bp

LCAGatherResult = namedtuple('LCAGatherResult',
Expand Down Expand Up @@ -186,17 +186,19 @@ def gather_main(args):
p = argparse.ArgumentParser(prog="sourmash lca gather")
p.add_argument('query')
p.add_argument('db', nargs='+')
p.add_argument('-d', '--debug', action='store_true')
p.add_argument('-o', '--output', type=argparse.FileType('wt'),
help='output CSV containing matches to this file')
p.add_argument('--output-unassigned', type=argparse.FileType('wt'),
help='output unassigned portions of the query as a signature to this file')
p.add_argument('--ignore-abundance', action='store_true',
help='do NOT use k-mer abundances if present')
p.add_argument('-q', '--quiet', action='store_true',
help='suppress non-error output')
p.add_argument('-d', '--debug', action='store_true',
help='output debugging output')
args = p.parse_args(args)

if args.debug:
set_debug(args.debug)
set_quiet(args.quiet, args.debug)

if not check_files_exist(args.query, *args.db):
sys.exit(-1)
Expand Down
Loading