diff --git a/doc/README.md b/doc/README.md new file mode 100644 index 0000000000..f53f3c1900 --- /dev/null +++ b/doc/README.md @@ -0,0 +1,18 @@ +# Documentation on the docs + +We use +[MyST](https://myst-parser.readthedocs.io/en/latest/sphinx/intro.html) +to generate Sphinx doc output from Markdown input. + +## Useful tips and tricks: + +### Linking internally between sections in the docs + +For linking within the sourmash docs, you should use the +[auto-generated header anchors](https://myst-parser.readthedocs.io/en/latest/syntax/optional.html#auto-generated-header-anchors) +provided by MyST. + +You can generate a list of these for a given document with: +``` +myst-anchors -l 3 command-line.md +``` diff --git a/doc/command-line.md b/doc/command-line.md index b2ce8fb8fd..88cae603cf 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -119,12 +119,13 @@ information for each command. Most of the commands in sourmash work with **signatures**, which contain information about genomic or proteomic sequences. Each signature contains one or more **sketches**, which are compressed versions of these sequences. Using sourmash, you can search, compare, and analyze these sequences in various ways. -To create a signature with one or more sketches, you use the `sourmash sketch` command. There are three main commands: +To create a signature with one or more sketches, you use the `sourmash sketch` command. There are four main commands: ``` sourmash sketch dna sourmash sketch protein sourmash sketch translate +sourmash sketch fromfile ``` The `sketch dna` command reads in **DNA sequences** and outputs **DNA sketches**. @@ -133,10 +134,14 @@ The `sketch protein` command reads in **protein sequences** and outputs **protei The `sketch translate` command reads in **DNA sequences**, translates them in all six frames, and outputs **protein sketches**. -`sourmash sketch` takes FASTA or FASTQ sequences as input; input data can be -uncompressed, compressed with gzip, or compressed with bzip2. The output -will be one or more JSON signature files that can be used with the other -sourmash commands. +The `sketch fromfile` command takes in a CSV file containing the +locations of genomes and proteomes, and outputs all of the requested +sketches. It is primarily intended for large-scale database construction. + +All of the `sourmash sketch` commands take FASTA or FASTQ sequences as +input; input data can be uncompressed, compressed with gzip, or +compressed with bzip2. The output will be one or more signature files +that can be used by other sourmash commands. Please see [the `sourmash sketch` documentation page](sourmash-sketch.md) for @@ -1585,6 +1590,9 @@ to stdout. All of these save formats can be loaded by sourmash commands. +**We strongly suggest using .zip files to store signatures: they are fast, +small, and fully supported by all the sourmash commands.** + ### Loading many signatures #### Loading signatures within a directory hierarchy diff --git a/doc/developer.md b/doc/developer.md index f21ee4b03d..1e4d4dbbeb 100644 --- a/doc/developer.md +++ b/doc/developer.md @@ -105,6 +105,11 @@ Code coverage can be viewed interactively at [codecov.io][1]. [1]: https://codecov.io/gh/sourmash-bio/sourmash/ [2]: https://github.com/sourmash-bio/sourmash/actions +## Writing docs. + +Please see [the docs README](README.md) for information on how we +write and build the sourmash docs. + ## Code organization There are three main components in the sourmash repo: diff --git a/doc/sourmash-sketch.md b/doc/sourmash-sketch.md index 37264ec673..1dd380d76e 100644 --- a/doc/sourmash-sketch.md +++ b/doc/sourmash-sketch.md @@ -1,5 +1,9 @@ # `sourmash sketch` documentation +```{contents} Contents +:depth: 3 +``` + Most of the commands in sourmash work with **signatures**, which contain information about genomic or proteomic sequences. Each signature contains one or more **sketches**, which are compressed versions of these sequences. Using sourmash, you can search, compare, and analyze these sequences in various ways. To create a signature with one or more sketches, you use the `sourmash sketch` command. There are three main commands: @@ -8,6 +12,7 @@ To create a signature with one or more sketches, you use the `sourmash sketch` c sourmash sketch dna sourmash sketch protein sourmash sketch translate +sourmash sketch fromfile ``` The `sketch dna` command reads in **DNA sequences** and outputs **DNA sketches**. @@ -16,10 +21,14 @@ The `sketch protein` command reads in **protein sequences** and outputs **protei The `sketch translate` command reads in **DNA sequences**, translates them in all six frames, and outputs **protein sketches**. +The `sketch fromfile` command takes in a CSV file containing the +locations of genomes and proteomes, and outputs all of the requested +sketches. It is primarily intended for large-scale database construction. + All `sourmash sketch` commands take FASTA or FASTQ sequences as input; input data can be uncompressed, compressed with gzip, or compressed -with bzip2. The output will be one or more JSON signature files that -can be used with the other sourmash commands. +with bzip2. The output will be one or more signature files that +can be used by other sourmash commands. ## Quickstart @@ -61,6 +70,53 @@ If you want to use different encodings, you can specify them in a few ways; here sourmash sketch protein -p k=25,scaled=500,dayhoff genome.faa ``` +### Translated DNA sketches for metagenomes + +The command +``` +sourmash sketch translate metagenome.fq +``` +will take each read in the FASTQ file and translate the read into +amino acid sequence in all six possible coding frames. No attempt is +made to determine the right frame (but we are working on ways to +determine this; see [orpheum](https://github.com/czbiohub/orpheum)). + +We suggest using this primarily on unassembled metagenome data. For +most microbial genomes, it is both higher quality and more efficient +to first predict the coding sequences (using e.g. prodigal) and then +use `sketch protein` to build signatures. + +### Bulk sketch construction from many files + +The `sourmash sketch fromfile` command is intended for use when +building many signatures as part of a larger workflow. It supports a +variety of options to build new signatures, parallelize +signature construction, and otherwise aid in tracking and managing +database construction. + +The command +``` +sourmash sketch fromfile datasets.csv -p dna -p protein -o database.zip +``` +will ingest a CSV spreadsheet containing (at a minimum) the three columns +`name`, `genome_filename`, and `protein_filename`, and build all of +the signatures requested by the parameter strings. Other columns in +this file will be ignored. + +If no protein, hp, or dayhoff sketches are requested, `protein_filename` +can be empty for a given row; likewise, if no DNA sketches are requested, +`genome_filename` can be empty for a given row. + +Some of the key command-line options supported by `fromfile` are: +* `-o/--output-signatures` will save generated signatures to any of the [standard supported output formats](command-line.md#saving-signatures-more-generally). +* `-o/--output-csv-info` will save a CSV file of input filenames and parameter strings for use with the `sourmash sketch` command line; this can be used to construct signatures in parallel. +* `--already-done` will take a list of existing signatures/databases to check against; signatures with matching names and parameter strings will not be rebuilt. +* `--output-manifest-matching` will output a manifest of already-existing signatures, which can then be used with `sourmash sig cat` to collate signatures across databases; see [using manifests](command-line.md#using-manifests-to-explicitly-refer-to-collections-of-files). (This provides [`sourmash sig check` functionality](command-line.md#sourmash-signature-check---compare-picklists-and-manifests) in `sketch fromfile`.) + +If you would like help and advice on constructing large databases, or +pointers to code for generating the `fromfile` CSV format, please ask +[on the sourmash issue tracker](https://github.com/sourmash-bio/sourmash/issues) or [gitter support channel](https://gitter.im/sourmash-bio/community). + ## More detailed documentation ### Input formats @@ -189,7 +245,7 @@ Unfortunately, changing the k-mer size or using different DNA/protein encodings ### Examining the output of `sourmash sketch` -You can use `sourmash sig describe` to get detailed information about the contents of a signature file. This can help if you want to see exactly what a particular `sourmash sketch` command does! +You can use `sourmash sig describe` to get detailed information about the contents of a signature file, and `sourmash sig fileinfo` to get a human-readable summary of the contents. This can help if you want to see exactly what a particular `sourmash sketch` command does! ### Filing issues and asking for help diff --git a/src/sourmash/cli/sketch/__init__.py b/src/sourmash/cli/sketch/__init__.py index 81808fc327..22abf26ed1 100644 --- a/src/sourmash/cli/sketch/__init__.py +++ b/src/sourmash/cli/sketch/__init__.py @@ -10,6 +10,7 @@ from . import protein as aa from . import protein as prot from . import translate +from . import fromfile from ..utils import command_list from argparse import SUPPRESS, RawDescriptionHelpFormatter import os diff --git a/src/sourmash/cli/sketch/dna.py b/src/sourmash/cli/sketch/dna.py index ea4f45358f..1d82f9df65 100644 --- a/src/sourmash/cli/sketch/dna.py +++ b/src/sourmash/cli/sketch/dna.py @@ -38,7 +38,7 @@ def subparser(subparsers): ) subparser.add_argument( '--check-sequence', action='store_true', - help='complain if input sequence is invalid (NOTE: only checks DNA)' + help='complain if input sequence is invalid DNA' ) subparser.add_argument( '-p', '--param-string', default=[], diff --git a/src/sourmash/cli/sketch/fromfile.py b/src/sourmash/cli/sketch/fromfile.py new file mode 100644 index 0000000000..84291b2931 --- /dev/null +++ b/src/sourmash/cli/sketch/fromfile.py @@ -0,0 +1,78 @@ +"""create signatures from a CSV file""" + +usage=""" + + sourmash sketch fromfile --output-signatures -p <...> + +The 'sketch fromfile' command takes in a CSV file with list of names +and filenames to be used for building signatures. It is intended for +batch use, when building large collections of signatures. + +One or more parameter strings must be specified with '-p'. + +One or more existing collections of signatures can be provided via +'--already-done' and already-existing signatures (based on name and +sketch type) will not be recalculated or output. + +If a location is provided via '--output-signatures', signatures will be saved +to that location. + +Please see the 'sketch' documentation for more details: + https://sourmash.readthedocs.io/en/latest/sourmash-sketch.html +""" + +import sourmash +from sourmash.logging import notify, print_results, error + +from sourmash import command_sketch + + +def subparser(subparsers): + subparser = subparsers.add_parser('fromfile', + usage=usage) + subparser.add_argument( + 'csvs', nargs='+', + help="input CSVs providing 'name', 'genome_filename', and 'protein_filename'" + ) + subparser.add_argument( + '-p', '--param-string', default=[], + help='signature parameters to use.', action='append', + ) + subparser.add_argument( + '--already-done', nargs='+', default=[], + help='one or more collections of existing signatures to avoid recalculating' + ) + subparser.add_argument( + '--license', default='CC0', type=str, + help='signature license. Currently only CC0 is supported.' + ) + subparser.add_argument( + '--check-sequence', action='store_true', + help='complain if input sequence is invalid (NOTE: only checks DNA)' + ) + file_args = subparser.add_argument_group('File handling options') + file_args.add_argument( + '-o', '--output-signatures', + help='output computed signatures to this file', + ) + file_args.add_argument( + '--force-output-already-exists', action='store_true', + help='overwrite/append to --output-signatures location' + ) + file_args.add_argument( + '--ignore-missing', action='store_true', + help='proceed with building possible signatures, even if some input files are missing' + ) + file_args.add_argument( + '--output-csv-info', + help='output information about what signatures need to be generated' + ) + file_args.add_argument( + '--output-manifest-matching', + help='output a manifest file of already-existing signatures' + ) + + +def main(args): + import sourmash.command_sketch + return sourmash.command_sketch.fromfile(args) diff --git a/src/sourmash/cli/sketch/protein.py b/src/sourmash/cli/sketch/protein.py index edc199b83c..24324ea905 100644 --- a/src/sourmash/cli/sketch/protein.py +++ b/src/sourmash/cli/sketch/protein.py @@ -36,10 +36,6 @@ def subparser(subparsers): '--license', default='CC0', type=str, help='signature license. Currently only CC0 is supported.' ) - subparser.add_argument( - '--check-sequence', action='store_true', - help='complain if input sequence is invalid' - ) subparser.add_argument( '-p', '--param-string', default=[], help='signature parameters to use.', action='append', diff --git a/src/sourmash/cli/sketch/translate.py b/src/sourmash/cli/sketch/translate.py index 79356bd5a0..df48d4818a 100644 --- a/src/sourmash/cli/sketch/translate.py +++ b/src/sourmash/cli/sketch/translate.py @@ -38,7 +38,7 @@ def subparser(subparsers): ) subparser.add_argument( '--check-sequence', action='store_true', - help='complain if input sequence is invalid' + help='complain if input sequence is invalid DNA' ) subparser.add_argument( '-p', '--param-string', default=[], diff --git a/src/sourmash/command_compute.py b/src/sourmash/command_compute.py index 9e1aa323c9..be0d87db00 100644 --- a/src/sourmash/command_compute.py +++ b/src/sourmash/command_compute.py @@ -14,6 +14,7 @@ from ._lowlevel import ffi, lib DEFAULT_COMPUTE_K = '21,31,51' +DEFAULT_MMHASH_SEED = 42 DEFAULT_LINE_COUNT = 1500 @@ -197,8 +198,13 @@ def _compute_individual(args, signatures_factory): if args.singleton: for n, record in enumerate(screed_iter): sigs = signatures_factory() - add_seq(sigs, record.sequence, - args.input_is_protein, args.check_sequence) + try: + add_seq(sigs, record.sequence, + args.input_is_protein, args.check_sequence) + except ValueError as exc: + error(f"ERROR when reading from '{filename}' - ") + error(str(exc)) + sys.exit(-1) set_sig_name(sigs, filename, name=record.name) save_sigs_to_location(sigs, save_sigs) @@ -211,7 +217,7 @@ def _compute_individual(args, signatures_factory): sigs = signatures_factory() # consume & calculate signatures - notify('... reading sequences from {}', filename) + notify(f'... reading sequences from {filename}') name = None for n, record in enumerate(screed_iter): if n % 10000 == 0: @@ -220,8 +226,13 @@ def _compute_individual(args, signatures_factory): elif args.name_from_first: name = record.name - add_seq(sigs, record.sequence, - args.input_is_protein, args.check_sequence) + try: + add_seq(sigs, record.sequence, + args.input_is_protein, args.check_sequence) + except ValueError as exc: + error(f"ERROR when reading from '{filename}' - ") + error(str(exc)) + sys.exit(-1) notify('...{} {} sequences', filename, n, end='') @@ -348,12 +359,11 @@ def from_manifest_row(cls, row): else: ksize = row['ksize'] * 3 - p = cls([ksize], 42, is_protein, is_dayhoff, is_hp, is_dna, + p = cls([ksize], DEFAULT_MMHASH_SEED, is_protein, is_dayhoff, is_hp, is_dna, row['num'], row['with_abundance'], row['scaled']) return p - def to_param_str(self): "Convert object to equivalent params str." pi = [] @@ -388,7 +398,7 @@ def to_param_str(self): pi.append("abund") # noabund is default - if self.seed != 42: + if self.seed != DEFAULT_MMHASH_SEED: pi.append(f"seed={self.seed}") # self.seed @@ -474,6 +484,16 @@ def dna(self): def dna(self, v): return self._methodcall(lib.computeparams_set_dna, v) + @property + def moltype(self): + if self.dna: moltype = 'DNA' + elif self.protein: moltype = 'protein' + elif self.hp: moltype = 'hp' + elif self.dayhoff: moltype = 'dayhoff' + else: assert 0 + + return moltype + @property def num_hashes(self): return self._methodcall(lib.computeparams_num_hashes) diff --git a/src/sourmash/command_sketch.py b/src/sourmash/command_sketch.py index 753f7cfaab..dd02dcb8e2 100644 --- a/src/sourmash/command_sketch.py +++ b/src/sourmash/command_sketch.py @@ -2,12 +2,23 @@ Functions implementing the 'sketch' subcommands and related functions. """ import sys +import os +from collections import defaultdict, Counter +import csv +import shlex +import screed + +import sourmash from .signature import SourmashSignature -from .logging import notify, error, set_quiet +from .logging import notify, error, set_quiet, print_results from .command_compute import (_compute_individual, _compute_merged, - ComputeParameters) + ComputeParameters, add_seq, set_sig_name, + DEFAULT_MMHASH_SEED) +from sourmash import sourmash_args from sourmash.sourmash_args import check_scaled_bounds, check_num_bounds +from sourmash.sig.__main__ import _summarize_manifest, _SketchInfo +from sourmash.manifest import CollectionManifest DEFAULTS = dict( dna='k=31,scaled=1000,noabund', @@ -114,7 +125,7 @@ def get_compute_params(self, *, split_ksizes=False): for moltype, params_d in self.params_list: # get defaults for this moltype from self.defaults: default_params = self.defaults[moltype] - def_seed = default_params.get('seed', 42) + def_seed = default_params.get('seed', DEFAULT_MMHASH_SEED) def_num = default_params.get('num', 0) def_abund = default_params['track_abundance'] def_scaled = default_params.get('scaled', 0) @@ -234,6 +245,7 @@ def protein(args): """ # for protein: args.input_is_protein = True + args.check_sequence = False # provide good defaults for dayhoff/hp/protein! if args.dayhoff and args.hp: @@ -283,3 +295,303 @@ def translate(args): _add_from_file_to_filenames(args) _execute_sketch(args, signatures_factory) + + +def _compute_sigs(to_build, output, *, check_sequence=False): + "actually build the signatures in 'to_build' and output them to 'output'" + save_sigs = sourmash_args.SaveSignaturesToLocation(output) + save_sigs.open() + + for (name, filename), param_objs in to_build.items(): + assert param_objs + + # now, set up to iterate over sequences. + with screed.open(filename) as screed_iter: + if not screed_iter: + error(f"ERROR: no sequences found in '{filename}'?!") + sys.exit(-1) + + # build the set of empty sigs + sigs = [] + + is_dna = param_objs[0].dna + for p in param_objs: + if p.dna: assert is_dna + sig = SourmashSignature.from_params(p) + sigs.append(sig) + + input_is_protein = not is_dna + + # read sequence records & sketch + notify(f'... reading sequences from {filename}') + for n, record in enumerate(screed_iter): + if n % 10000 == 0: + if n: + notify('\r...{} {}', filename, n, end='') + + try: + add_seq(sigs, record.sequence, input_is_protein, + check_sequence) + except ValueError as exc: + error(f"ERROR when reading from '{filename}' - ") + error(str(exc)) + sys.exit(-1) + + notify('...{} {} sequences', filename, n, end='') + + set_sig_name(sigs, filename, name) + for sig in sigs: + save_sigs.add(sig) + + notify(f'calculated {len(sigs)} signatures for {n+1} sequences in {filename}') + + + save_sigs.close() + notify(f"saved {len(save_sigs)} signature(s) to '{save_sigs.location}'. Note: signature license is CC0.") + + +def _output_csv_info(filename, sigs_to_build): + "output information about what signatures to build, in CSV format" + output_n = 0 + with sourmash_args.FileOutputCSV(filename) as csv_fp: + w = csv.DictWriter(csv_fp, fieldnames=['filename', 'sketchtype', + 'output_index', 'name', + 'param_strs']) + w.writeheader() + + output_n = 0 + for (name, filename), param_objs in sigs_to_build.items(): + param_strs = [] + + # should all be the same! + if param_objs[0].dna: + assert all( ( p.dna for p in param_objs ) ) + sketchtype = "dna" + else: + assert not any( ( p.dna for p in param_objs ) ) + sketchtype = "protein" + + for p in param_objs: + param_strs.append(p.to_param_str()) + + row = dict(filename=filename, sketchtype=sketchtype, + param_strs="-p " + " -p ".join(param_strs), + name=name, output_index=output_n) + + w.writerow(row) + + output_n += 1 + + +def fromfile(args): + if args.license != 'CC0': + error('error: sourmash only supports CC0-licensed signatures. sorry!') + sys.exit(-1) + + if args.output_signatures and os.path.exists(args.output_signatures): + if not args.force_output_already_exists: + error(f"** ERROR: output location '{args.output_signatures}' already exists!") + error(f"** Not overwriting/appending.") + error(f"** Use --force-output-already-exists if you want to overwrite/append.") + sys.exit(-1) + + # now, create the set of desired sketch specs. + try: + # omit a default moltype - must be provided in param string. + sig_factory = _signatures_for_sketch_factory(args.param_string, None) + except ValueError as e: + error(f"Error creating signatures: {str(e)}") + sys.exit(-1) + + # take the signatures factory => convert into a bunch of ComputeParameters + # objects. + build_params = list(sig_factory.get_compute_params(split_ksizes=True)) + + # confirm that they do not adjust seed, which is not supported in + # 'fromfile' b/c we don't store that info in manifests. (see #1849) + for p in build_params: + if p.seed != DEFAULT_MMHASH_SEED: + error("** ERROR: cannot set 'seed' in 'sketch fromfile'") + sys.exit(-1) + + # cross-product all of the names in the input CSV file + # with the sketch spec(s) provided on the command line. + + to_build = defaultdict(list) + all_names = {} + total_rows = 0 + skipped_sigs = 0 + n_missing_name = 0 + n_duplicate_name = 0 + + for csvfile in args.csvs: + with open(csvfile, newline="") as fp: + r = csv.DictReader(fp) + + for row in r: + name = row['name'] + if not name: + n_missing_name += 1 + continue + + genome = row['genome_filename'] + proteome = row['protein_filename'] + total_rows += 1 + + if name in all_names: + n_duplicate_name += 1 + else: + all_names[name] = (genome, proteome) + + fail_exit = False + if n_duplicate_name: + error(f"** ERROR: {n_duplicate_name} entries have duplicate 'name' records. Exiting!") + fail_exit = True + + if n_missing_name: + error(f"** ERROR: {n_missing_name} entries have blank 'name's? Exiting!") + fail_exit = True + + if fail_exit: + sys.exit(-1) + + # load manifests from '--already-done' databases => turn into + # ComputeParameters objects, indexed by name. + + already_done = defaultdict(list) + already_done_rows = [] + for filename in args.already_done: + idx = sourmash.load_file_as_index(filename) + manifest = idx.manifest + assert manifest + + # for each manifest row, + for row in manifest.rows: + name = row['name'] + if name: + # build a ComputeParameters object for later comparison + p = ComputeParameters.from_manifest_row(row) + + # add to list for this name + already_done[name].append(p) + + # matching name? check if we already have sig. if so, store! + if name in all_names: + if p in build_params: + already_done_rows.append(row) + + already_done_manifest = CollectionManifest(already_done_rows) + if args.already_done: + notify(f"Loaded {len(already_done)} pre-existing names from manifest(s)") + notify(f"collected {len(already_done_rows)} rows for already-done signatures.") + + ## now check which are already done and track only those that are + ## need to be done. + + total_sigs = 0 + missing = defaultdict(list) + missing_count = 0 + for name, (genome, proteome) in all_names.items(): + plist = already_done.get(name, []) + + # check list of already done against build parameters + for p in build_params: + total_sigs += 1 + + # does this signature already exist? + if p not in plist: + # nope - figure out genome/proteome needed + filename = genome if p.dna else proteome + filetype = 'genome' if p.dna else 'proteome' + + if filename: + # add to build list + to_build[(name, filename)].append(p) + else: + notify(f"WARNING: fromfile entry '{name}' is missing a {filetype}") + missing[name].append(p) + missing_count += 1 + else: + skipped_sigs += 1 + + ## we now have 'to_build' which contains the things we can build, + ## and 'missing', which contains anything we cannot build. Report! + + notify(f"Read {total_rows} rows, requesting that {total_sigs} signatures be built.") + + if already_done_manifest: + info_d = _summarize_manifest(already_done_manifest) + print_results('---') + print_results("summary of already-done sketches:") + + for ski in info_d['sketch_info']: + mh_type = f"num={ski['num']}" if ski['num'] else f"scaled={ski['scaled']}" + mh_abund = ", abund" if ski['abund'] else "" + + sketch_str = f"{ski['count']} sketches with {ski['moltype']}, k={ski['ksize']}, {mh_type}{mh_abund}" + + print_results(f" {sketch_str: <50} {ski['n_hashes']} total hashes") + + print_results('---') + + if args.output_manifest_matching: + already_done_manifest.write_to_filename(args.output_manifest_matching) + notify(f"output {len(already_done_manifest)} already-done signatures to '{args.output_manifest_matching}' in manifest format.") + + if missing: + error("** ERROR: we cannot build some of the requested signatures.") + error(f"** {missing_count} total signatures (for {len(missing)} names) cannot be built.") + if args.ignore_missing: + error("** (continuing past this error because --ignore-missing was set)") + else: + sys.exit(-1) + + notify(f"** {total_sigs - skipped_sigs} new signatures to build from {len(to_build)} files;") + if not to_build: + notify(f"** Nothing to build. Exiting!") + sys.exit(0) + + if skipped_sigs: + notify(f"** {skipped_sigs} already exist, so skipping those.") + else: + notify(f"** we found no pre-existing signatures that match.") + + ## first, print out a summary of to_build: + + print_results('---') + print_results("summary of sketches to build:") + + counter = Counter() + build_info_d = {} + for filename, param_objs in to_build.items(): + for p in param_objs: + moltype = p.moltype + assert len(p.ksizes) == 1 + ksize = p.ksizes[0] + if not p.dna: ksize //= 3 + + ski = _SketchInfo(ksize=ksize, moltype=p.moltype, + scaled=p.scaled, num=p.num_hashes, + abund=p.track_abundance) + counter[ski] += 1 + + for ski, count in counter.items(): + mh_type = f"num={ski.num}" if ski.num else f"scaled={ski.scaled}" + mh_abund = ", abund" if ski.abund else "" + + sketch_str = f"{count} sketches with {ski.moltype}, k={ski.ksize}, {mh_type}{mh_abund}" + + print_results(f" {sketch_str: <50}") + + print_results('---') + + ## now, onward ho - do we build anything, or output stuff, or just exit? + + if args.output_signatures: # actually compute + _compute_sigs(to_build, args.output_signatures, + check_sequence=args.check_sequence) + + if args.output_csv_info: # output info necessary to construct + _output_csv_info(args.output_csv_info, to_build) + + notify(f"** {total_sigs} total requested; output {total_sigs - skipped_sigs}, skipped {skipped_sigs}") diff --git a/tests/test-data/sketch_fromfile/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz b/tests/test-data/sketch_fromfile/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz new file mode 100644 index 0000000000..e052b274e2 Binary files /dev/null and b/tests/test-data/sketch_fromfile/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz differ diff --git a/tests/test-data/sketch_fromfile/GCA_903797575.1_PARATYPHIC668_protein.faa.gz b/tests/test-data/sketch_fromfile/GCA_903797575.1_PARATYPHIC668_protein.faa.gz new file mode 100644 index 0000000000..5406c2c63b Binary files /dev/null and b/tests/test-data/sketch_fromfile/GCA_903797575.1_PARATYPHIC668_protein.faa.gz differ diff --git a/tests/test-data/sketch_fromfile/salmonella-badseq.csv b/tests/test-data/sketch_fromfile/salmonella-badseq.csv new file mode 100644 index 0000000000..d1ffecfd2f --- /dev/null +++ b/tests/test-data/sketch_fromfile/salmonella-badseq.csv @@ -0,0 +1,2 @@ +ident,full_ident,name,genome_filename,protein_filename +GCA_903797575,GCA_903797575.1,GCA_903797575 Salmonella enterica,sketch_fromfile/GCA_903797575.1_PARATYPHIC668_protein.faa.gz, diff --git a/tests/test-data/sketch_fromfile/salmonella-dna-protein.zip b/tests/test-data/sketch_fromfile/salmonella-dna-protein.zip new file mode 100644 index 0000000000..5fd26246a0 Binary files /dev/null and b/tests/test-data/sketch_fromfile/salmonella-dna-protein.zip differ diff --git a/tests/test-data/sketch_fromfile/salmonella-missing.csv b/tests/test-data/sketch_fromfile/salmonella-missing.csv new file mode 100644 index 0000000000..b6ef55bcec --- /dev/null +++ b/tests/test-data/sketch_fromfile/salmonella-missing.csv @@ -0,0 +1,2 @@ +ident,full_ident,name,genome_filename,protein_filename +GCA_903797575,GCA_903797575.1,GCA_903797575 Salmonella enterica,sketch_fromfile/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz, diff --git a/tests/test-data/sketch_fromfile/salmonella-mult.csv b/tests/test-data/sketch_fromfile/salmonella-mult.csv new file mode 100644 index 0000000000..251e324a1c --- /dev/null +++ b/tests/test-data/sketch_fromfile/salmonella-mult.csv @@ -0,0 +1,3 @@ +ident,full_ident,name,genome_filename,protein_filename +GCA_903797575,GCA_903797575.1,GCA_903797575 Salmonella enterica,sketch_fromfile/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz,sketch_fromfile/GCA_903797575.1_PARATYPHIC668_protein.faa.gz +xxGCA_903797575,xxGCA_903797575.1,xxGCA_903797575 Salmonella enterica,sketch_fromfile/xxGCA_903797575.1_PARATYPHIC668_genomic.fna.gz,sketch_fromfile/xxGCA_903797575.1_PARATYPHIC668_protein.faa.gz diff --git a/tests/test-data/sketch_fromfile/salmonella-noname.csv b/tests/test-data/sketch_fromfile/salmonella-noname.csv new file mode 100644 index 0000000000..b464244315 --- /dev/null +++ b/tests/test-data/sketch_fromfile/salmonella-noname.csv @@ -0,0 +1,2 @@ +ident,full_ident,name,genome_filename,protein_filename +GCA_903797575,GCA_903797575.1,,sketch_fromfile/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz,sketch_fromfile/GCA_903797575.1_PARATYPHIC668_protein.faa.gz diff --git a/tests/test-data/sketch_fromfile/salmonella.csv b/tests/test-data/sketch_fromfile/salmonella.csv new file mode 100644 index 0000000000..5c1fc10508 --- /dev/null +++ b/tests/test-data/sketch_fromfile/salmonella.csv @@ -0,0 +1,2 @@ +ident,full_ident,name,genome_filename,protein_filename +GCA_903797575,GCA_903797575.1,GCA_903797575 Salmonella enterica,sketch_fromfile/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz,sketch_fromfile/GCA_903797575.1_PARATYPHIC668_protein.faa.gz diff --git a/tests/test-data/sketch_fromfile/xxGCA_903797575.1_PARATYPHIC668_genomic.fna.gz b/tests/test-data/sketch_fromfile/xxGCA_903797575.1_PARATYPHIC668_genomic.fna.gz new file mode 100644 index 0000000000..e052b274e2 Binary files /dev/null and b/tests/test-data/sketch_fromfile/xxGCA_903797575.1_PARATYPHIC668_genomic.fna.gz differ diff --git a/tests/test-data/sketch_fromfile/xxGCA_903797575.1_PARATYPHIC668_protein.faa.gz b/tests/test-data/sketch_fromfile/xxGCA_903797575.1_PARATYPHIC668_protein.faa.gz new file mode 100644 index 0000000000..5406c2c63b Binary files /dev/null and b/tests/test-data/sketch_fromfile/xxGCA_903797575.1_PARATYPHIC668_protein.faa.gz differ diff --git a/tests/test_sourmash_sketch.py b/tests/test_sourmash_sketch.py index 5d78367365..9f7912ade9 100644 --- a/tests/test_sourmash_sketch.py +++ b/tests/test_sourmash_sketch.py @@ -19,6 +19,7 @@ from sourmash.command_compute import ComputeParameters from sourmash.cli.compute import subparser from sourmash.cli import SourmashParser +from sourmash import manifest from sourmash import signature from sourmash import VERSION @@ -444,6 +445,7 @@ def test_manifest_row_to_compute_parameters_1(): assert not p.protein assert not p.dayhoff assert not p.hp + assert p.moltype == 'DNA' assert p.num_hashes == 0 assert p.scaled == 1000 assert p.ksizes == [21] @@ -460,6 +462,7 @@ def test_manifest_row_to_compute_parameters_2(): p = ComputeParameters.from_manifest_row(row) assert not p.dna assert p.protein + assert p.moltype == 'protein' assert not p.dayhoff assert not p.hp assert p.num_hashes == 0 @@ -479,6 +482,7 @@ def test_manifest_row_to_compute_parameters_3(): assert not p.dna assert not p.protein assert p.dayhoff + assert p.moltype == 'dayhoff' assert not p.hp assert p.num_hashes == 0 assert p.scaled == 200 @@ -498,12 +502,20 @@ def test_manifest_row_to_compute_parameters_4(): assert not p.protein assert not p.dayhoff assert p.hp + assert p.moltype == 'hp' assert p.num_hashes == 0 assert p.scaled == 200 assert p.ksizes == [96] assert not p.track_abundance assert p.seed == 42 + +def test_bad_compute_parameters(): + p = ComputeParameters([31], 42, 0, 0, 0, 0, 0, True, 1000) + with pytest.raises(AssertionError): + p.moltype + + ### command line tests @@ -539,6 +551,42 @@ def test_do_sourmash_sketchdna(runtmp): assert str(sig).endswith('short.fa') +def test_do_sourmash_sketchdna_check_sequence_succeed(runtmp): + testdata1 = utils.get_test_data('short.fa') + runtmp.sourmash('sketch', 'dna', testdata1, '--check-sequence') + + sigfile = runtmp.output('short.fa.sig') + assert os.path.exists(sigfile) + + sig = next(signature.load_signatures(sigfile)) + assert str(sig).endswith('short.fa') + + +def test_do_sourmash_sketchdna_check_sequence_fail(runtmp): + testdata1 = utils.get_test_data('shewanella.faa') + + with pytest.raises(SourmashCommandFailed) as exc: + runtmp.sourmash('sketch', 'dna', testdata1, '--check-sequence') + + err = runtmp.last_result.err + print(err) + assert "ERROR when reading from " in err + assert "invalid DNA character in input k-mer: MCGIVGAVAQRDVAEILVEGLRRLEYRGYDS" in err + + +def test_do_sourmash_sketchdna_check_sequence_fail_singleton(runtmp): + testdata1 = utils.get_test_data('shewanella.faa') + + with pytest.raises(SourmashCommandFailed) as exc: + runtmp.sourmash('sketch', 'dna', testdata1, '--check-sequence', + '--singleton') + + err = runtmp.last_result.err + print(err) + assert "ERROR when reading from " in err + assert "invalid DNA character in input k-mer: MCGIVGAVAQRDVAEILVEGLRRLEYRGYDS" in err + + def test_do_sourmash_sketchdna_from_file(runtmp): testdata1 = utils.get_test_data('short.fa') @@ -1477,3 +1525,445 @@ def test_dayhoff_with_stop_codons(runtmp): assert cli_mh2.contained_by(cli_mh1) < 1 assert py_mh2.contained_by(cli_mh1) < 1 assert h_mh2.contained_by(h_mh1) < 1 + + +### test sourmash sketch fromfile + + +def test_fromfile_dna(runtmp): + # does it run? yes, hopefully. + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '-o', 'out.zip', '-p', 'dna') + + print(runtmp.last_result.out) + print(runtmp.last_result.err) + + assert os.path.exists(runtmp.output('out.zip')) + idx = sourmash.load_file_as_index(runtmp.output('out.zip')) + siglist = list(idx.signatures()) + + assert len(siglist) == 1 + ss = siglist[0] + assert ss.name == 'GCA_903797575 Salmonella enterica' + assert ss.minhash.moltype == 'DNA' + assert "** 1 total requested; output 1, skipped 0" in runtmp.last_result.err + + +def test_fromfile_dna_empty(runtmp): + # test what happens on empty files. + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + # zero out the file + with gzip.open(runtmp.output('sketch_fromfile/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz'), 'w') as fp: + pass + + # now what happens? + with pytest.raises(SourmashCommandFailed): + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '-o', 'out.zip', '-p', 'dna') + + print(runtmp.last_result.out) + err = runtmp.last_result.err + print(err) + + assert "ERROR: no sequences found in " in err + + +def test_fromfile_dna_check_sequence_succeed(runtmp): + # does it run? yes, hopefully. + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '-o', 'out.zip', '-p', 'dna', '--check-sequence') + + print(runtmp.last_result.out) + print(runtmp.last_result.err) + + assert os.path.exists(runtmp.output('out.zip')) + idx = sourmash.load_file_as_index(runtmp.output('out.zip')) + siglist = list(idx.signatures()) + + assert len(siglist) == 1 + ss = siglist[0] + assert ss.name == 'GCA_903797575 Salmonella enterica' + assert ss.minhash.moltype == 'DNA' + assert "** 1 total requested; output 1, skipped 0" in runtmp.last_result.err + + +def test_fromfile_dna_check_sequence_fail(runtmp): + # does it run? yes, hopefully. + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + with pytest.raises(SourmashCommandFailed): + runtmp.sourmash('sketch', 'fromfile', + 'sketch_fromfile/salmonella-badseq.csv', + '-o', 'out.zip', '-p', 'dna', '--check-sequence') + + print(runtmp.last_result.out) + err = runtmp.last_result.err + print(err) + + assert "ERROR when reading from " in err + assert "invalid DNA character in input k-mer: MTNILKLFSRKAGEPLDSLAVKSVRQHLSGD" in err + + +def test_fromfile_dna_and_protein(runtmp): + # does it run and produce DNA _and_ protein signatures? + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '-o', 'out.zip', '-p', 'dna', '-p', 'protein') + + print(runtmp.last_result.out) + print(runtmp.last_result.err) + + assert os.path.exists(runtmp.output('out.zip')) + idx = sourmash.load_file_as_index(runtmp.output('out.zip')) + siglist = list(idx.signatures()) + + assert len(siglist) == 2 + + prot_sig = [ ss for ss in siglist if ss.minhash.moltype == 'protein' ] + assert len(prot_sig) == 1 + prot_sig = prot_sig[0] + assert prot_sig.name == 'GCA_903797575 Salmonella enterica' + + dna_sig = [ ss for ss in siglist if ss.minhash.moltype == 'DNA' ] + assert len(dna_sig) == 1 + dna_sig = dna_sig[0] + assert dna_sig.name == 'GCA_903797575 Salmonella enterica' + + assert "** 2 total requested; output 2, skipped 0" in runtmp.last_result.err + + +def test_fromfile_dna_and_protein_and_hp_and_dayhoff(runtmp): + # does it run and produce DNA _and_ protein signatures? + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '-o', 'out.zip', '-p', 'dna', '-p', 'dna,k=25', + '-p', 'protein', + '-p', 'hp', '-p', 'dayhoff') + + print(runtmp.last_result.out) + print(runtmp.last_result.err) + + assert os.path.exists(runtmp.output('out.zip')) + idx = sourmash.load_file_as_index(runtmp.output('out.zip')) + siglist = list(idx.signatures()) + + assert len(siglist) == 5 + + prot_sig = [ ss for ss in siglist if ss.minhash.moltype == 'protein' ] + assert len(prot_sig) == 1 + prot_sig = prot_sig[0] + assert prot_sig.name == 'GCA_903797575 Salmonella enterica' + + prot_sig = [ ss for ss in siglist if ss.minhash.moltype == 'hp' ] + assert len(prot_sig) == 1 + prot_sig = prot_sig[0] + assert prot_sig.name == 'GCA_903797575 Salmonella enterica' + + prot_sig = [ ss for ss in siglist if ss.minhash.moltype == 'dayhoff' ] + assert len(prot_sig) == 1 + prot_sig = prot_sig[0] + assert prot_sig.name == 'GCA_903797575 Salmonella enterica' + + dna_sig = [ ss for ss in siglist if ss.minhash.moltype == 'DNA' ] + assert len(dna_sig) == 2 + dna_sig = dna_sig[0] + assert dna_sig.name == 'GCA_903797575 Salmonella enterica' + + assert "** 5 total requested; output 5, skipped 0" in runtmp.last_result.err + + +def test_fromfile_dna_and_protein_noname(runtmp): + # nothing in the name column + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + with pytest.raises(SourmashCommandFailed): + runtmp.sourmash('sketch', 'fromfile', + 'sketch_fromfile/salmonella-noname.csv', + '-o', 'out.zip', '-p', 'dna', '-p', 'protein') + + out = runtmp.last_result.out + err = runtmp.last_result.err + + print(out) + print(err) + assert "ERROR: 1 entries have blank 'name's? Exiting!" in err + + +def test_fromfile_dna_and_protein_dup_name(runtmp): + # duplicate names + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + with pytest.raises(SourmashCommandFailed): + runtmp.sourmash('sketch', 'fromfile', + 'sketch_fromfile/salmonella.csv', + 'sketch_fromfile/salmonella.csv', + '-o', 'out.zip', '-p', 'dna', '-p', 'protein') + + out = runtmp.last_result.out + err = runtmp.last_result.err + + print(out) + print(err) + assert "ERROR: 1 entries have duplicate 'name' records. Exiting!" in err + + +def test_fromfile_dna_and_protein_missing(runtmp): + # test what happens when missing protein. + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + with pytest.raises(SourmashCommandFailed): + runtmp.sourmash('sketch', 'fromfile', + 'sketch_fromfile/salmonella-missing.csv', + '-o', 'out.zip', '-p', 'protein') + + out = runtmp.last_result.out + err = runtmp.last_result.err + + print(out) + print(err) + + assert "WARNING: fromfile entry 'GCA_903797575 Salmonella enterica' is missing a proteome" in err + assert "** ERROR: we cannot build some of the requested signatures." in err + assert "** 1 total signatures (for 1 names) cannot be built." in err + + +def test_fromfile_dna_and_protein_missing_ignore(runtmp): + # test what happens when missing protein + --ignore-missing + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + runtmp.sourmash('sketch', 'fromfile', + 'sketch_fromfile/salmonella-missing.csv', + '-o', 'out.zip', '-p', 'protein', '--ignore-missing') + + out = runtmp.last_result.out + err = runtmp.last_result.err + + print(out) + print(err) + + assert "WARNING: fromfile entry 'GCA_903797575 Salmonella enterica' is missing a proteome" in err + + assert "** ERROR: we cannot build some of the requested signatures." in err + assert "** 1 total signatures (for 1 names) cannot be built." in err + + assert "** (continuing past this error because --ignore-missing was set)" in err + assert "** 1 new signatures to build from 0 files;" in err + + +def test_fromfile_no_overwrite(runtmp): + # test --force-output-already-exists + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '-o', 'out.zip', '-p', 'dna') + + print(runtmp.last_result.out) + print(runtmp.last_result.err) + + assert os.path.exists(runtmp.output('out.zip')) + + # now run again; will fail since already exists + with pytest.raises(SourmashCommandFailed) as exc: + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '-o', 'out.zip', '-p', 'protein') + + err = runtmp.last_result.err + + assert "ERROR: output location 'out.zip' already exists!" in err + assert "Use --force-output-already-exists if you want to overwrite/append." in err + + +def test_fromfile_force_overwrite(runtmp): + # test --force-output-already-exists + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '-o', 'out.zip', '-p', 'dna') + + print(runtmp.last_result.out) + print(runtmp.last_result.err) + + assert os.path.exists(runtmp.output('out.zip')) + + # now run again, with --force + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '-o', 'out.zip', '-p', 'protein', '--force-output') + + print(runtmp.last_result.out) + print(runtmp.last_result.err) + + assert os.path.exists(runtmp.output('out.zip')) + idx = sourmash.load_file_as_index(runtmp.output('out.zip')) + siglist = list(idx.signatures()) + + assert len(siglist) == 2 + names = list(set([ ss.name for ss in siglist ])) + assert names[0] == 'GCA_903797575 Salmonella enterica' + assert "** 1 total requested; output 1, skipped 0" in runtmp.last_result.err + + +def test_fromfile_need_params(runtmp): + # check that we need a -p + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + with pytest.raises(SourmashCommandFailed) as exc: + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '-o', 'out.zip') + + print(str(exc)) + assert "Error creating signatures: No default moltype and none specified in param string" in str(exc) + + +def test_fromfile_seed_not_allowed(runtmp): + # check that we cannot adjust 'seed' + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + with pytest.raises(SourmashCommandFailed) as exc: + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '-o', 'out.zip', '-p', 'dna,seed=43') + print(str(exc)) + + assert "ERROR: cannot set 'seed' in 'sketch fromfile'" in str(exc) + + +def test_fromfile_license_not_allowed(runtmp): + # check that license is CC0 + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + with pytest.raises(SourmashCommandFailed) as exc: + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '-o', 'out.zip', '-p', 'dna', + '--license', 'BSD') + + print(str(exc)) + assert 'sourmash only supports CC0-licensed signatures' in str(exc) + + +def test_fromfile_dna_and_protein_csv_output(runtmp): + # does it run and produce DNA _and_ protein signatures? + test_inp = utils.get_test_data('sketch_fromfile') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '--output-csv', 'out.csv', '-p', 'dna', '-p', 'protein') + + print(runtmp.last_result.out) + print(runtmp.last_result.err) + + assert os.path.exists(runtmp.output('out.csv')) + + with open(runtmp.output('out.csv'), newline='') as fp: + r = csv.DictReader(fp) + # filename,sketchtype,output_index,name,param_strs + + x = [] + for row in r: + x.append(row) + + x.sort(key=lambda x: x['filename']) + + assert len(x) == 2 + assert x[0]['sketchtype'] == 'dna' + assert x[0]['param_strs'] == '-p dna,k=31,scaled=1000' + assert x[0]['filename'] == 'sketch_fromfile/GCA_903797575.1_PARATYPHIC668_genomic.fna.gz' + + assert x[1]['sketchtype'] == 'protein' + assert x[1]['param_strs'] == '-p protein,k=10,scaled=200' + assert x[1]['filename'] == 'sketch_fromfile/GCA_903797575.1_PARATYPHIC668_protein.faa.gz' + + # same name... + assert x[0]['name'] == x[1]['name'] == "GCA_903797575 Salmonella enterica" + # ...different output index. + assert x[1]['output_index'] != x[0]['output_index'] + + +def test_fromfile_dna_and_protein_already_exists(runtmp): + # does it properly ignore existing (--already-done) sigs? + test_inp = utils.get_test_data('sketch_fromfile') + already_done = utils.get_test_data('sketch_fromfile/salmonella-dna-protein.zip') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '-p', 'dna', '-p', 'protein', + '--already-done', already_done, + '--output-manifest', 'matching.csv') + + print(runtmp.last_result.out) + err = runtmp.last_result.err + print(err) + + assert 'Loaded 1 pre-existing names from manifest(s)' in err + assert 'Read 1 rows, requesting that 2 signatures be built.' in err + assert '** 0 new signatures to build from 0 files;' in err + assert '** Nothing to build. Exiting!' in err + + assert "output 2 already-done signatures to 'matching.csv' in manifest format." in err + mf = manifest.CollectionManifest.load_from_filename(runtmp.output('matching.csv')) + assert len(mf) == 2 + + +def test_fromfile_dna_and_protein_partly_already_exists(runtmp): + # does it properly ignore existing (--already-done) sigs? + test_inp = utils.get_test_data('sketch_fromfile') + already_done = utils.get_test_data('sketch_fromfile/salmonella-dna-protein.zip') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella-mult.csv', + '-p', 'dna', '-p', 'protein', + '--already-done', already_done) + + print(runtmp.last_result.out) + err = runtmp.last_result.err + print(err) + + assert 'Loaded 1 pre-existing names from manifest(s)' in err + assert 'Read 2 rows, requesting that 4 signatures be built.' in err + assert '** 2 new signatures to build from 2 files;' in err + assert "** 2 already exist, so skipping those." in err + assert "** 4 total requested; output 2, skipped 2" in err + + +def test_fromfile_dna_and_protein_already_exists_noname(runtmp): + # check that no name in already_exists is handled + test_inp = utils.get_test_data('sketch_fromfile') + already_done = utils.get_test_data('sketch_fromfile/salmonella-dna-protein.zip') + shutil.copytree(test_inp, runtmp.output('sketch_fromfile')) + + # run rename to get rid of names + runtmp.sourmash('sig', 'rename', already_done, '', + '-o', 'already-done.zip') + + runtmp.sourmash('sketch', 'fromfile', 'sketch_fromfile/salmonella.csv', + '-p', 'dna', '-p', 'protein', + '--already-done', 'already-done.zip') + + print(runtmp.last_result.out) + err = runtmp.last_result.err + print(err) + + assert 'Loaded 0 pre-existing names from manifest(s)' in err + assert 'Read 1 rows, requesting that 2 signatures be built.' in err + assert '** 2 new signatures to build from 2 files;' in err + assert '** 2 total requested; output 2, skipped 0' in err