Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] implement sourmash sketch fromfile #1885

Merged
merged 155 commits into from
Apr 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
155 commits
Select commit Hold shift + click to select a range
cc0db3a
add -d/--debug to various commands
ctb Jan 4, 2022
1fbaaa0
upgrade ComputerParameters with __repr__ and __eq__
ctb Mar 13, 2022
931e499
cleanup and refactor
ctb Mar 13, 2022
ce37811
add tests for new behavior
ctb Mar 13, 2022
465e4c3
finish tests
ctb Mar 13, 2022
8663509
initial 'sourmash sketch fromfile' implementation
ctb Mar 13, 2022
f803f74
add ComputeParameters.to_param_str
ctb Mar 13, 2022
e9b91f8
add ComputeParameters.to_param_str
ctb Mar 13, 2022
9b314dd
output sourmash commands, hackity hack hack
ctb Mar 14, 2022
eb57c0f
fix spelling
ctb Mar 14, 2022
ebbe87a
Merge branch 'add/sketchfrom_support' into add/sketch_fromfile
ctb Mar 14, 2022
736fd8a
a fix, and some tests
ctb Mar 14, 2022
07637c5
add tests for to_param_str
ctb Mar 14, 2022
ef97a5c
Merge branch 'add/sketchfrom_support' into add/sketch_fromfile
ctb Mar 14, 2022
e77aa2b
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 14, 2022
4834026
add ComputeParameters.from_manifest_row
ctb Mar 14, 2022
dc2a149
switch to using ComputeParameters.from_manifest_row
ctb Mar 14, 2022
6d7b375
add ComputeParameters.from_manifest_row
ctb Mar 14, 2022
23f56f1
Merge branch 'update/more_sketch_fromfile_support' into add/sketch_fr…
ctb Mar 14, 2022
744b734
moar
ctb Mar 14, 2022
94357a8
add tests for ComputeParameters.from_manifest_row
ctb Mar 14, 2022
6bf8af7
whoops, add param string into sourmash cmd output
ctb Mar 14, 2022
00a6f12
Merge branch 'update/more_sketch_fromfile_support' into add/sketch_fr…
ctb Mar 14, 2022
edabb71
initial tests
ctb Mar 15, 2022
c9a6e4f
more tests
ctb Mar 15, 2022
c5fcd12
clean up help
ctb Mar 15, 2022
458512f
tests and more tests
ctb Mar 15, 2022
ad7f229
enable and test CSV output from fromfile
ctb Mar 15, 2022
b1bdbe3
fix multiple param strs output
ctb Mar 16, 2022
5fe66af
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 19, 2022
4cadd8f
simplify/report on missing
ctb Mar 20, 2022
1528fee
test --already done
ctb Mar 21, 2022
906ef0b
initial implementation of StandaloneManifestIndex
ctb Mar 22, 2022
72e9523
support prefix if not abspath
ctb Mar 22, 2022
0d79fb6
clean up
ctb Mar 23, 2022
b65e428
some standalone manifests tests - incl CLI
ctb Mar 23, 2022
56a31ad
iterate over internal locations instead
ctb Mar 23, 2022
1cfaab8
switch to picklist API
ctb Mar 23, 2022
da27a1b
aaaaand swap out for load_file_as_index :tada:
ctb Mar 23, 2022
a031d57
remove unnecessary spaces
ctb Mar 23, 2022
9a939a1
more tests
ctb Mar 23, 2022
47b1f40
more better prefix test
ctb Mar 23, 2022
a75815e
remove unnec space
ctb Mar 24, 2022
38c86ad
remove sourmash sketch command output
ctb Mar 24, 2022
9e8b792
add --output-manifest
ctb Mar 24, 2022
756d15e
do sketch summarization
ctb Mar 24, 2022
8a77943
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 25, 2022
e1a1975
upgrade output error messages
ctb Mar 25, 2022
da682b2
Merge branch 'add/debug' into add/manifestindex
ctb Mar 25, 2022
6104a18
fix SBT subdir loading error
ctb Mar 25, 2022
d9d3bff
add message about using --debug
ctb Mar 25, 2022
112dd3b
Merge branch 'add/debug' into add/manifestindex
ctb Mar 25, 2022
e34bd1a
Merge branch 'add/test_sbt_load_fail' into add/manifestindex
ctb Mar 25, 2022
87b72b8
doc etc
ctb Mar 25, 2022
f4546de
rationalize _signatures_with_internal
ctb Mar 25, 2022
cd9e670
test describe and fileinfo on manifests
ctb Mar 25, 2022
5147dc4
think through more manifest stuff
ctb Mar 25, 2022
50e87c2
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 25, 2022
59cfe9f
fix descr
ctb Mar 25, 2022
84b200b
rationalize _signatures_with_internal
ctb Mar 25, 2022
bdff48b
Merge branch 'refactor/mf_internal' into add/manifestindex
ctb Mar 25, 2022
3b591b8
fix docstring
ctb Mar 25, 2022
7e6caa9
add heading anchors config; fix napoleon package ref
ctb Mar 25, 2022
785e7c9
pin versions for doc building
ctb Mar 25, 2022
3e6872a
fix internal refs
ctb Mar 25, 2022
e8763b9
fix one last ref target
ctb Mar 25, 2022
6ead927
add docs
ctb Mar 25, 2022
85b2c12
clarify language
ctb Mar 25, 2022
de0b7b2
add docs
ctb Mar 25, 2022
b03ba2f
add more/better tests for lazy loading
ctb Mar 25, 2022
c1ada69
clarify
ctb Mar 25, 2022
1bd133d
a few more tests
ctb Mar 25, 2022
b9c0124
Merge branch 'add/manifestindex' into add/sketch_fromfile
ctb Mar 25, 2022
f70b210
fixes
ctb Mar 25, 2022
ab882cc
Merge branch 'fix/docs' into add/manifestindex
ctb Mar 26, 2022
c6a7e24
update docs
ctb Mar 26, 2022
38ece63
cleanup and comment on index code
ctb Mar 26, 2022
2924ede
minor improvements to Index tests
ctb Mar 26, 2022
cc1598e
add explicit test for lazy-loading prefetch on StandaloneManifestIndex
ctb Mar 26, 2022
38593b6
add explicit test for lazy-loading prefetch on StandaloneManifestIndex
ctb Mar 26, 2022
8899d2c
Merge branch 'fix/docs' into add/sketch_fromfile
ctb Mar 26, 2022
403795d
Merge branch 'add/manifestindex' into add/sketch_fromfile
ctb Mar 26, 2022
646ba4b
Merge branch 'cleanup/index' into add/sketch_fromfile
ctb Mar 26, 2022
2ca9340
Merge branch 'cleanup/index_2' into add/sketch_fromfile
ctb Mar 26, 2022
d699fca
start adding sig check
ctb Mar 26, 2022
1deadbd
add some more tests
ctb Mar 26, 2022
6494028
some cleanup and refactor
ctb Mar 26, 2022
896fca9
cleanup
ctb Mar 26, 2022
47f9e34
write some manifest utility methods
ctb Mar 26, 2022
c77e1d8
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 26, 2022
899cfec
updated output messages
ctb Mar 26, 2022
99a14b8
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 27, 2022
c126437
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 27, 2022
a463fd0
update comments/docstrings
ctb Mar 27, 2022
d89bc40
Merge branch 'add/manifestindex' into cleanup/index
ctb Mar 27, 2022
f2aca85
Merge branch 'cleanup/index' into cleanup/index_2
ctb Mar 27, 2022
1313e1b
Merge branch 'cleanup/index_2' into add/sketch_fromfile
ctb Mar 27, 2022
aaa822d
get non-standard column names work with sig check -o
ctb Mar 27, 2022
00d7b4a
add an exclude test
ctb Mar 27, 2022
997741a
fix comments
ctb Mar 27, 2022
480288a
test many different coltypes
ctb Mar 27, 2022
da093e3
Update doc/command-line.md
ctb Mar 28, 2022
26e919b
update comments/docstrings
ctb Mar 28, 2022
e299017
Merge branch 'add/manifestindex' into add/sketch_fromfile
ctb Mar 28, 2022
841ebf1
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 28, 2022
d1d7042
Merge branch 'cleanup/index' into cleanup/index_2
ctb Mar 28, 2022
03e3f39
Merge branch 'cleanup/index' into add/sketch_fromfile
ctb Mar 28, 2022
8bafbf7
Merge branch 'cleanup/index_2' into add/sketch_fromfile
ctb Mar 28, 2022
93d38ef
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 29, 2022
68c7edb
Apply suggestions from code review
ctb Mar 29, 2022
d6fc369
Merge branch 'cleanup/index_2' into add/sketch_fromfile
ctb Mar 29, 2022
507aee0
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 29, 2022
147ea22
command line docs for sig check
ctb Mar 29, 2022
19b6ccc
adjust command line, add usage etc
ctb Mar 29, 2022
d7eba5a
copy sig check code over
ctb Mar 29, 2022
cf654d0
Merge branch 'add/sig_check' into add/sketch_fromfile
ctb Mar 29, 2022
7de11e7
add --fail to sig check
ctb Mar 29, 2022
9f56e72
require manifests etc
ctb Mar 29, 2022
5f6b399
test no --picklist
ctb Mar 29, 2022
cfefd4c
test nosave manifest
ctb Mar 29, 2022
0d21bf4
test __iadd__ for manifests
ctb Mar 29, 2022
2faa51a
remove 'add_to_found'
ctb Mar 29, 2022
4b582ac
revert
ctb Mar 29, 2022
a90a1b4
simplify per @bluegenes
ctb Mar 30, 2022
51fe2dd
Merge branch 'add/sig_check' into add/sketch_fromfile
ctb Mar 30, 2022
fe58eb7
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 30, 2022
713bd6c
test --force-output
ctb Mar 30, 2022
83f7ef4
test --output-manifest-matching
ctb Mar 30, 2022
e82ebc1
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 30, 2022
965caad
add some missing-name tests
ctb Mar 30, 2022
2cd2be3
catch/error out on missing names.
ctb Mar 30, 2022
ae17dcf
seed, license tests
ctb Mar 31, 2022
26e765e
add some minimal how-to-docs docs
ctb Mar 31, 2022
cc81271
add docs
ctb Mar 31, 2022
f61c21f
other columns will be ignored
ctb Mar 31, 2022
5467446
adjust --check-sequence usage and add initial test
ctb Mar 31, 2022
dcda994
add tests; still failing
ctb Mar 31, 2022
a3a682d
add more tests, fix bad DNA check
ctb Mar 31, 2022
8b95b09
allow multiple CSVs to fromfile
ctb Mar 31, 2022
fac98c7
consume multiple CSVs; check for dup names
ctb Mar 31, 2022
3c67bf1
annotate places for further work/coverage
ctb Mar 31, 2022
d863a08
test --ignore-missing
ctb Mar 31, 2022
9911c8c
add test of partially done
ctb Mar 31, 2022
37c2beb
adjust comments
ctb Mar 31, 2022
ed2fa57
catch empty files/files with no data
ctb Mar 31, 2022
06a0ccb
tighten code
ctb Apr 1, 2022
93d07e1
test multiple -p
ctb Apr 1, 2022
4556155
refactor
ctb Apr 1, 2022
85b08c1
more refactor
ctb Apr 1, 2022
1398155
more refactor
ctb Apr 1, 2022
1ef1c00
warning messages about missing genomes/proteomes
ctb Apr 1, 2022
9327a12
some more tidying
ctb Apr 1, 2022
231a599
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Apr 1, 2022
14d09b4
resolve most of the review issues
ctb Apr 1, 2022
4f43d9e
add test for bad ComputeParameters
ctb Apr 1, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions doc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Documentation on the docs

We use
[MyST](https://myst-parser.readthedocs.io/en/latest/sphinx/intro.html)
to generate Sphinx doc output from Markdown input.

## Useful tips and tricks:

### Linking internally between sections in the docs

For linking within the sourmash docs, you should use the
[auto-generated header anchors](https://myst-parser.readthedocs.io/en/latest/syntax/optional.html#auto-generated-header-anchors)
provided by MyST.

You can generate a list of these for a given document with:
```
myst-anchors -l 3 command-line.md
```
18 changes: 13 additions & 5 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,12 +119,13 @@ information for each command.

Most of the commands in sourmash work with **signatures**, which contain information about genomic or proteomic sequences. Each signature contains one or more **sketches**, which are compressed versions of these sequences. Using sourmash, you can search, compare, and analyze these sequences in various ways.

To create a signature with one or more sketches, you use the `sourmash sketch` command. There are three main commands:
To create a signature with one or more sketches, you use the `sourmash sketch` command. There are four main commands:

```
sourmash sketch dna
sourmash sketch protein
sourmash sketch translate
sourmash sketch fromfile
```

The `sketch dna` command reads in **DNA sequences** and outputs **DNA sketches**.
Expand All @@ -133,10 +134,14 @@ The `sketch protein` command reads in **protein sequences** and outputs **protei

The `sketch translate` command reads in **DNA sequences**, translates them in all six frames, and outputs **protein sketches**.

`sourmash sketch` takes FASTA or FASTQ sequences as input; input data can be
uncompressed, compressed with gzip, or compressed with bzip2. The output
will be one or more JSON signature files that can be used with the other
sourmash commands.
The `sketch fromfile` command takes in a CSV file containing the
locations of genomes and proteomes, and outputs all of the requested
sketches. It is primarily intended for large-scale database construction.

All of the `sourmash sketch` commands take FASTA or FASTQ sequences as
input; input data can be uncompressed, compressed with gzip, or
compressed with bzip2. The output will be one or more signature files
that can be used by other sourmash commands.

Please see
[the `sourmash sketch` documentation page](sourmash-sketch.md) for
Expand Down Expand Up @@ -1585,6 +1590,9 @@ to stdout.

All of these save formats can be loaded by sourmash commands.

**We strongly suggest using .zip files to store signatures: they are fast,
small, and fully supported by all the sourmash commands.**

### Loading many signatures

#### Loading signatures within a directory hierarchy
Expand Down
5 changes: 5 additions & 0 deletions doc/developer.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,11 @@ Code coverage can be viewed interactively at [codecov.io][1].
[1]: https://codecov.io/gh/sourmash-bio/sourmash/
[2]: https://github.com/sourmash-bio/sourmash/actions

## Writing docs.

Please see [the docs README](README.md) for information on how we
write and build the sourmash docs.

## Code organization

There are three main components in the sourmash repo:
Expand Down
62 changes: 59 additions & 3 deletions doc/sourmash-sketch.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# `sourmash sketch` documentation

```{contents} Contents
:depth: 3
```

Most of the commands in sourmash work with **signatures**, which contain information about genomic or proteomic sequences. Each signature contains one or more **sketches**, which are compressed versions of these sequences. Using sourmash, you can search, compare, and analyze these sequences in various ways.

To create a signature with one or more sketches, you use the `sourmash sketch` command. There are three main commands:
Expand All @@ -8,6 +12,7 @@ To create a signature with one or more sketches, you use the `sourmash sketch` c
sourmash sketch dna
sourmash sketch protein
sourmash sketch translate
sourmash sketch fromfile
```

The `sketch dna` command reads in **DNA sequences** and outputs **DNA sketches**.
Expand All @@ -16,10 +21,14 @@ The `sketch protein` command reads in **protein sequences** and outputs **protei

The `sketch translate` command reads in **DNA sequences**, translates them in all six frames, and outputs **protein sketches**.

The `sketch fromfile` command takes in a CSV file containing the
locations of genomes and proteomes, and outputs all of the requested
sketches. It is primarily intended for large-scale database construction.

All `sourmash sketch` commands take FASTA or FASTQ sequences as input;
input data can be uncompressed, compressed with gzip, or compressed
with bzip2. The output will be one or more JSON signature files that
can be used with the other sourmash commands.
with bzip2. The output will be one or more signature files that
can be used by other sourmash commands.

## Quickstart

Expand Down Expand Up @@ -61,6 +70,53 @@ If you want to use different encodings, you can specify them in a few ways; here
sourmash sketch protein -p k=25,scaled=500,dayhoff genome.faa
```

### Translated DNA sketches for metagenomes

The command
```
sourmash sketch translate metagenome.fq
```
will take each read in the FASTQ file and translate the read into
amino acid sequence in all six possible coding frames. No attempt is
made to determine the right frame (but we are working on ways to
determine this; see [orpheum](https://github.com/czbiohub/orpheum)).

We suggest using this primarily on unassembled metagenome data. For
most microbial genomes, it is both higher quality and more efficient
to first predict the coding sequences (using e.g. prodigal) and then
use `sketch protein` to build signatures.

### Bulk sketch construction from many files

The `sourmash sketch fromfile` command is intended for use when
building many signatures as part of a larger workflow. It supports a
variety of options to build new signatures, parallelize
signature construction, and otherwise aid in tracking and managing
database construction.

The command
```
sourmash sketch fromfile datasets.csv -p dna -p protein -o database.zip
```
will ingest a CSV spreadsheet containing (at a minimum) the three columns
`name`, `genome_filename`, and `protein_filename`, and build all of
the signatures requested by the parameter strings. Other columns in
this file will be ignored.

If no protein, hp, or dayhoff sketches are requested, `protein_filename`
can be empty for a given row; likewise, if no DNA sketches are requested,
`genome_filename` can be empty for a given row.

Some of the key command-line options supported by `fromfile` are:
* `-o/--output-signatures` will save generated signatures to any of the [standard supported output formats](command-line.md#saving-signatures-more-generally).
* `-o/--output-csv-info` will save a CSV file of input filenames and parameter strings for use with the `sourmash sketch` command line; this can be used to construct signatures in parallel.
* `--already-done` will take a list of existing signatures/databases to check against; signatures with matching names and parameter strings will not be rebuilt.
* `--output-manifest-matching` will output a manifest of already-existing signatures, which can then be used with `sourmash sig cat` to collate signatures across databases; see [using manifests](command-line.md#using-manifests-to-explicitly-refer-to-collections-of-files). (This provides [`sourmash sig check` functionality](command-line.md#sourmash-signature-check---compare-picklists-and-manifests) in `sketch fromfile`.)

If you would like help and advice on constructing large databases, or
pointers to code for generating the `fromfile` CSV format, please ask
[on the sourmash issue tracker](https://github.com/sourmash-bio/sourmash/issues) or [gitter support channel](https://gitter.im/sourmash-bio/community).

## More detailed documentation

### Input formats
Expand Down Expand Up @@ -189,7 +245,7 @@ Unfortunately, changing the k-mer size or using different DNA/protein encodings

### Examining the output of `sourmash sketch`

You can use `sourmash sig describe` to get detailed information about the contents of a signature file. This can help if you want to see exactly what a particular `sourmash sketch` command does!
You can use `sourmash sig describe` to get detailed information about the contents of a signature file, and `sourmash sig fileinfo` to get a human-readable summary of the contents. This can help if you want to see exactly what a particular `sourmash sketch` command does!

### Filing issues and asking for help

Expand Down
1 change: 1 addition & 0 deletions src/sourmash/cli/sketch/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from . import protein as aa
from . import protein as prot
from . import translate
from . import fromfile
from ..utils import command_list
from argparse import SUPPRESS, RawDescriptionHelpFormatter
import os
Expand Down
2 changes: 1 addition & 1 deletion src/sourmash/cli/sketch/dna.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def subparser(subparsers):
)
subparser.add_argument(
'--check-sequence', action='store_true',
help='complain if input sequence is invalid (NOTE: only checks DNA)'
help='complain if input sequence is invalid DNA'
)
subparser.add_argument(
'-p', '--param-string', default=[],
Expand Down
78 changes: 78 additions & 0 deletions src/sourmash/cli/sketch/fromfile.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
"""create signatures from a CSV file"""

usage="""

sourmash sketch fromfile <csv file> --output-signatures <location> -p <...>

The 'sketch fromfile' command takes in a CSV file with list of names
and filenames to be used for building signatures. It is intended for
batch use, when building large collections of signatures.

One or more parameter strings must be specified with '-p'.

One or more existing collections of signatures can be provided via
'--already-done' and already-existing signatures (based on name and
sketch type) will not be recalculated or output.

If a location is provided via '--output-signatures', signatures will be saved
to that location.

Please see the 'sketch' documentation for more details:
https://sourmash.readthedocs.io/en/latest/sourmash-sketch.html
"""

import sourmash
from sourmash.logging import notify, print_results, error

from sourmash import command_sketch


def subparser(subparsers):
subparser = subparsers.add_parser('fromfile',
usage=usage)
subparser.add_argument(
'csvs', nargs='+',
help="input CSVs providing 'name', 'genome_filename', and 'protein_filename'"
)
subparser.add_argument(
'-p', '--param-string', default=[],
help='signature parameters to use.', action='append',
)
subparser.add_argument(
'--already-done', nargs='+', default=[],
help='one or more collections of existing signatures to avoid recalculating'
)
subparser.add_argument(
'--license', default='CC0', type=str,
help='signature license. Currently only CC0 is supported.'
)
subparser.add_argument(
'--check-sequence', action='store_true',
help='complain if input sequence is invalid (NOTE: only checks DNA)'
)
file_args = subparser.add_argument_group('File handling options')
file_args.add_argument(
'-o', '--output-signatures',
help='output computed signatures to this file',
)
file_args.add_argument(
'--force-output-already-exists', action='store_true',
help='overwrite/append to --output-signatures location'
)
file_args.add_argument(
'--ignore-missing', action='store_true',
help='proceed with building possible signatures, even if some input files are missing'
)
file_args.add_argument(
'--output-csv-info',
help='output information about what signatures need to be generated'
)
file_args.add_argument(
'--output-manifest-matching',
help='output a manifest file of already-existing signatures'
)


def main(args):
import sourmash.command_sketch
return sourmash.command_sketch.fromfile(args)
4 changes: 0 additions & 4 deletions src/sourmash/cli/sketch/protein.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,6 @@ def subparser(subparsers):
'--license', default='CC0', type=str,
help='signature license. Currently only CC0 is supported.'
)
subparser.add_argument(
'--check-sequence', action='store_true',
help='complain if input sequence is invalid'
)
subparser.add_argument(
'-p', '--param-string', default=[],
help='signature parameters to use.', action='append',
Expand Down
2 changes: 1 addition & 1 deletion src/sourmash/cli/sketch/translate.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def subparser(subparsers):
)
subparser.add_argument(
'--check-sequence', action='store_true',
help='complain if input sequence is invalid'
help='complain if input sequence is invalid DNA'
)
subparser.add_argument(
'-p', '--param-string', default=[],
Expand Down
36 changes: 28 additions & 8 deletions src/sourmash/command_compute.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from ._lowlevel import ffi, lib

DEFAULT_COMPUTE_K = '21,31,51'
DEFAULT_MMHASH_SEED = 42
DEFAULT_LINE_COUNT = 1500


Expand Down Expand Up @@ -197,8 +198,13 @@ def _compute_individual(args, signatures_factory):
if args.singleton:
for n, record in enumerate(screed_iter):
sigs = signatures_factory()
add_seq(sigs, record.sequence,
args.input_is_protein, args.check_sequence)
try:
add_seq(sigs, record.sequence,
args.input_is_protein, args.check_sequence)
except ValueError as exc:
error(f"ERROR when reading from '{filename}' - ")
error(str(exc))
sys.exit(-1)

set_sig_name(sigs, filename, name=record.name)
save_sigs_to_location(sigs, save_sigs)
Expand All @@ -211,7 +217,7 @@ def _compute_individual(args, signatures_factory):
sigs = signatures_factory()

# consume & calculate signatures
notify('... reading sequences from {}', filename)
notify(f'... reading sequences from {filename}')
name = None
for n, record in enumerate(screed_iter):
if n % 10000 == 0:
Expand All @@ -220,8 +226,13 @@ def _compute_individual(args, signatures_factory):
elif args.name_from_first:
name = record.name

add_seq(sigs, record.sequence,
args.input_is_protein, args.check_sequence)
try:
add_seq(sigs, record.sequence,
args.input_is_protein, args.check_sequence)
except ValueError as exc:
error(f"ERROR when reading from '{filename}' - ")
error(str(exc))
sys.exit(-1)

notify('...{} {} sequences', filename, n, end='')

Expand Down Expand Up @@ -348,12 +359,11 @@ def from_manifest_row(cls, row):
else:
ksize = row['ksize'] * 3

p = cls([ksize], 42, is_protein, is_dayhoff, is_hp, is_dna,
p = cls([ksize], DEFAULT_MMHASH_SEED, is_protein, is_dayhoff, is_hp, is_dna,
row['num'], row['with_abundance'], row['scaled'])

return p


def to_param_str(self):
"Convert object to equivalent params str."
pi = []
Expand Down Expand Up @@ -388,7 +398,7 @@ def to_param_str(self):
pi.append("abund")
# noabund is default

if self.seed != 42:
if self.seed != DEFAULT_MMHASH_SEED:
pi.append(f"seed={self.seed}")
# self.seed

Expand Down Expand Up @@ -474,6 +484,16 @@ def dna(self):
def dna(self, v):
return self._methodcall(lib.computeparams_set_dna, v)

@property
def moltype(self):
if self.dna: moltype = 'DNA'
elif self.protein: moltype = 'protein'
elif self.hp: moltype = 'hp'
elif self.dayhoff: moltype = 'dayhoff'
else: assert 0
ctb marked this conversation as resolved.
Show resolved Hide resolved

return moltype

@property
def num_hashes(self):
return self._methodcall(lib.computeparams_num_hashes)
Expand Down
Loading