From 7ee0052eff1abac177a04bde8c05153964adecd5 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Thu, 22 Feb 2024 08:21:10 -0800 Subject: [PATCH 01/30] rework the manifest documentation --- doc/command-line.md | 37 +++++++++++++++++++++++++----------- doc/databases-advanced.md | 12 ++++++------ src/sourmash/sig/__main__.py | 2 +- 3 files changed, 33 insertions(+), 18 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index 7b9cbe2615..ee29fa0962 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -1869,7 +1869,9 @@ will continue processing input sequences. ### `sourmash signature manifest` - output a manifest for a file -Output a manifest for a file, database, or collection. +Output a manifest for a file, database, or collection. Note that these +manifests are not always suitable for use as standalone manifests; +the `sourmash sig collect` command produces standalone manifests. For example, ``` @@ -1917,7 +1919,12 @@ collections of signatures and identifiers. ### `sourmash signature collect` - collect manifests across databases Collect manifests from across (many) files and merge into a single -standalone manifest. +standalone manifest. Standalone manifests can be used directly as a +sourmash database; they support efficient searching and selection of +sketches, as well as lazy loading of individual sketches from large +collections. See +[advanced usage information on sourmash databases](databases-advanced.md) +for more information. For example, ``` @@ -1932,6 +1939,9 @@ This manifest file can be loaded directly from the command line by sourmash. particularly useful when working with large collections of signatures and identifiers, and has command line options for merging and updating manifests. +Standalone manifests produced by `sig collect` work most efficiently when +constructed from many small zip file collections. + ## Advanced command-line usage ### Loading signatures and databases @@ -2028,7 +2038,7 @@ The following `coltype`s are currently supported for picklists: * `gather` - use the CSV output of `sourmash gather` as a picklist * `prefetch` - use the CSV output of `sourmash prefetch` as a picklist * `search` - use the CSV output of `sourmash prefetch` as a picklist -* `manifest` - use the CSV output of `sourmash sig manifest` as a picklist +* `manifest` - use CSV manifests as a picklist Identifiers are constructed by using the first space delimited word in the signature name. @@ -2037,7 +2047,7 @@ One way to build a picklist is to use `sourmash sig grep --csv out.csv` to construct a CSV file containing a list of all sketches that match the pattern (which can be a string or regexp). The `out.csv` file can be used as a picklist via the picklist -manifest format with `--picklist out.csv::manifest`. +manifest CSV format with `--picklist out.csv::manifest`. You can also use `sourmash sig describe --csv out.csv ` or `sourmash sig manifest -o out.csv ` to construct an @@ -2216,7 +2226,8 @@ sourmash sig fileinfo manifest.sqlmf ``` This manifest contains _references_ to the signatures (but not the signatures themselves) and can then be used as a database target for most -sourmash operations - search, gather, etc. +sourmash operations - search, gather, etc. Manifests support +fast selection and lazy loading of sketches in many situations. Note that `sig collect` will generate manifests containing the pathnames given to it - so if you use relative paths, the references @@ -2225,12 +2236,16 @@ run. You can use `sig collect --abspath` to rewrite the paths into absolute paths. **Our advice:** We suggest using zip file collections for most -situations; we primarily recommend using explicit manifests for -situations where you have a **very large** collection of signatures -(1000s or more), and don't want to make multiple copies of signatures -in the collection (as you would have to, with a zipfile). This can be -useful if you want to refer to different subsets of the collection -without making multiple copies in a zip file. +situations; we stronlgy recommend using standalone manifests for +situations where you have **very large** sketches or a **very large** +collection of sketches (1000s or more), and don't want to make +multiple copies of signatures in the collection (as you would have to, +with a zipfile). This is particularly useful if you want to refer to different +subsets of the collection without making multiple copies in a zip +file. + +You can read more about the details of zip files and manifests in +[the advanced usage information for databases](databases-advanced.md). ### Using sourmash plugins diff --git a/doc/databases-advanced.md b/doc/databases-advanced.md index 9e4d1c25d7..f3249c10ea 100644 --- a/doc/databases-advanced.md +++ b/doc/databases-advanced.md @@ -54,18 +54,18 @@ Both SBTs and LCA databases can only store homogenous collections of signature t We recommend SBT and LCA databases for use only in specific situations - e.g. SBTs are great for single-genome "best match" search for SBTs, and `sourmash lca` commands require LCA databases. -### Manifests +### Standalone manifests Manifests are catalogs of signature metadata - name, molecule type, k-mer size, and other information - that can be used to select specific signatures for searching or processing. Typically when using manifests the actual signatures themselves are not loaded until they are needed, although the efficiency of this depends on the signature storage mechanism; for example, JSON-format containers (`.sig` and `.lca.json` files) must be entirely loaded before any signature in the file them can be used, unlike zip containers. -As of sourmash 4.4 manifests can be *directly* loaded from the command line as standalone collections. This lets manifests serve as a catalog of signatures stored in many different locations. +As of sourmash 4.4 manifests can be *directly* loaded from the command line as standalone collections. This lets manifests serve as a catalog of signatures stored in many different locations. Sketches can be selected by name, k-mer size, molecule type, and other features without loading the actual sketch data. -Standalone manifests are preferable to both directory storage and pathlists (below), because they support fast selection and direct lazy loading. They are the most effective solution for managing custom collections of thousands to millions of signatures. +Standalone manifests are preferable to both directory storage and pathlists (below), because they support fast selection and direct lazy loading. They are the most effective solution for managing custom collections of thousands to millions of signatures, as well as working with multiple large sketches. Standalone manifests can be created with `sourmash sig collect` (sourmash v4.4 and later). -Sourmash supports two manifest file formats - CSV and SQLite. SQLite manifests are much faster and lower-memory than CSV manifests in exchange for consuming some extra disk space. +Sourmash supports two manifest file formats - CSV and SQLite. SQLite manifests are much faster and lower-memory than CSV manifests. ### Directories @@ -75,7 +75,7 @@ To read from a directory, specify the directory name on the sourmash command lin When directories are specified as outputs, the signatures will be saved by their complete md5sum underneath the directory. -We don't particularly recommend storing signatures in directory hierarchies, since most of their use cases are now covered by other approaches. +We don't recommend storing signatures in directory hierarchies, since most of their use cases are now covered by other approaches. ### Pathlists @@ -83,7 +83,7 @@ Pathlists are text files containing paths to one or more sourmash databases; any The paths in pathlists can be relative or absolute within the file system. If they are relative, they must resolve with respect to the current working directory of the sourmash command. -We don't recommend using pathlists any more, since the original use cases are now supported with picklists, but they are still supported! +We don't recommend using pathlists, since the original use cases are now supported with picklists and standalone manifests, but they are still supported. Pathlists are not output by any sourmash commands. diff --git a/src/sourmash/sig/__main__.py b/src/sourmash/sig/__main__.py index 1a89d6239f..80db52460f 100644 --- a/src/sourmash/sig/__main__.py +++ b/src/sourmash/sig/__main__.py @@ -1587,7 +1587,7 @@ def collect(args): for n_files, loc in enumerate(args.locations): notify(f"Loading signature information from {loc}.") - if n_files % 100 == 0: + if n_files % 100 == 0 and n_files: notify(f"... loaded {len(collected_mf)} sigs from {n_files} files") idx = sourmash.load_file_as_index(loc) if idx.manifest is None and require_manifest: From 5f6ef82bbb3181df797e9c245e41ef8db8492c23 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Fri, 1 Mar 2024 05:48:01 -0800 Subject: [PATCH 02/30] load manifest paths relative to cwd --- src/sourmash/index/__init__.py | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/src/sourmash/index/__init__.py b/src/sourmash/index/__init__.py index 154f37c126..8ff1706cb2 100644 --- a/src/sourmash/index/__init__.py +++ b/src/sourmash/index/__init__.py @@ -1155,7 +1155,16 @@ def load(cls, location, *, prefix=None): m = CollectionManifest.load_from_filename(location) if prefix is None: - prefix = os.path.dirname(location) + # @CTB hmm, good or bad idea? + if location.startswith('/'): + prefix = os.path.dirname(location) + else: + prefix = os.path.dirname(location) + print('XXX prefix is', (prefix,)) + relpath = os.path.relpath(os.curdir, prefix) + print('YYY relpath is', (relpath,)) + prefix = os.path.join(prefix, relpath) + print('ZZZ prefix is now', (prefix,)) return cls(m, location, prefix=prefix) From 2405e9d5d7101d1d02f08b48345fa6bf119a4e74 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Fri, 1 Mar 2024 13:53:24 +0000 Subject: [PATCH 03/30] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- src/sourmash/index/__init__.py | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/sourmash/index/__init__.py b/src/sourmash/index/__init__.py index 8ff1706cb2..616c2f8295 100644 --- a/src/sourmash/index/__init__.py +++ b/src/sourmash/index/__init__.py @@ -1156,15 +1156,15 @@ def load(cls, location, *, prefix=None): if prefix is None: # @CTB hmm, good or bad idea? - if location.startswith('/'): + if location.startswith("/"): prefix = os.path.dirname(location) else: prefix = os.path.dirname(location) - print('XXX prefix is', (prefix,)) + print("XXX prefix is", (prefix,)) relpath = os.path.relpath(os.curdir, prefix) - print('YYY relpath is', (relpath,)) + print("YYY relpath is", (relpath,)) prefix = os.path.join(prefix, relpath) - print('ZZZ prefix is now', (prefix,)) + print("ZZZ prefix is now", (prefix,)) return cls(m, location, prefix=prefix) From df870878931bdfc4348a965e8c098e0e994d895d Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Fri, 1 Mar 2024 06:30:24 -0800 Subject: [PATCH 04/30] more better tests --- src/sourmash/index/__init__.py | 5 +-- tests/test_cmd_signature.py | 82 ++++++++++++++++++++++++++++++++++ 2 files changed, 84 insertions(+), 3 deletions(-) diff --git a/src/sourmash/index/__init__.py b/src/sourmash/index/__init__.py index 616c2f8295..6d8fed9702 100644 --- a/src/sourmash/index/__init__.py +++ b/src/sourmash/index/__init__.py @@ -1156,15 +1156,14 @@ def load(cls, location, *, prefix=None): if prefix is None: # @CTB hmm, good or bad idea? + # if we disable, tests break. maybe change tests? if location.startswith("/"): prefix = os.path.dirname(location) else: + # calculate paths relative to cwd; @CTB more/better docs. prefix = os.path.dirname(location) - print("XXX prefix is", (prefix,)) relpath = os.path.relpath(os.curdir, prefix) - print("YYY relpath is", (relpath,)) prefix = os.path.join(prefix, relpath) - print("ZZZ prefix is now", (prefix,)) return cls(m, location, prefix=prefix) diff --git a/tests/test_cmd_signature.py b/tests/test_cmd_signature.py index 7f8365118f..c49796c93f 100644 --- a/tests/test_cmd_signature.py +++ b/tests/test_cmd_signature.py @@ -5348,3 +5348,85 @@ def test_sig_check_3_no_manifest_ok(runtmp): "for given picklist, found 7 matches to 7 distinct values" in runtmp.last_result.err ) + + +def test_sig_check_4_manifest_cwd_cwd(runtmp): + # check: manifest and sigs in cwd + prot_zip = utils.get_test_data('prot/all.zip') + + shutil.copyfile(prot_zip, runtmp.output('prot.zip')) + + # generate a picklist, whatever + runtmp.sourmash('sig', 'manifest', 'prot.zip', '-o', 'picklist.csv') + assert os.path.exists(runtmp.output('picklist.csv')) + + # use picklist with sig check to generate a manifest + runtmp.sourmash('sig', 'check', '-m', 'mf.csv', + '--picklist', 'picklist.csv::manifest', + 'prot.zip') + + # check that it all works + runtmp.sourmash('sig', 'cat', 'mf.csv') + + +def test_sig_check_4_manifest_subdir_cwd(runtmp): + # check: manifest in subdir and sigs in cwd + prot_zip = utils.get_test_data('prot/all.zip') + + shutil.copyfile(prot_zip, runtmp.output('prot.zip')) + os.mkdir(runtmp.output('mf_dir')) + + # generate a picklist, whatever + runtmp.sourmash('sig', 'manifest', 'prot.zip', '-o', 'picklist.csv') + assert os.path.exists(runtmp.output('picklist.csv')) + + # use picklist with sig check to generate a manifest + runtmp.sourmash('sig', 'check', '-m', 'mf_dir/mf.csv', + '--picklist', 'picklist.csv::manifest', + 'prot.zip') + + # check that it all works + runtmp.sourmash('sig', 'cat', 'mf_dir/mf.csv') + + +def test_sig_check_4_manifest_cwd_subdir(runtmp): + # check: manifest in cwd and sigs in subdir + prot_zip = utils.get_test_data('prot/all.zip') + + os.mkdir(runtmp.output('zip_dir')) + shutil.copyfile(prot_zip, runtmp.output('zip_dir/prot.zip')) + + # generate a picklist, whatever + runtmp.sourmash('sig', 'manifest', 'zip_dir/prot.zip', + '-o', 'picklist.csv') + assert os.path.exists(runtmp.output('picklist.csv')) + + # use picklist with sig check to generate a manifest + runtmp.sourmash('sig', 'check', '-m', 'mf.csv', + '--picklist', 'picklist.csv::manifest', + 'zip_dir/prot.zip') + + # check that it all works + runtmp.sourmash('sig', 'cat', 'mf.csv') + + +def test_sig_check_4_manifest_subdir_subdir(runtmp): + # check: manifest and sigs in subdir + prot_zip = utils.get_test_data('prot/all.zip') + + os.mkdir(runtmp.output('zip_dir')) + shutil.copyfile(prot_zip, runtmp.output('zip_dir/prot.zip')) + os.mkdir(runtmp.output('mf_dir')) + + # generate a picklist, whatever + runtmp.sourmash('sig', 'manifest', 'zip_dir/prot.zip', + '-o', 'picklist.csv') + assert os.path.exists(runtmp.output('picklist.csv')) + + # use picklist with sig check to generate a manifest + runtmp.sourmash('sig', 'check', '-m', 'mf_dir/mf.csv', + '--picklist', 'picklist.csv::manifest', + 'zip_dir/prot.zip') + + # check that it all works + runtmp.sourmash('sig', 'cat', 'mf_dir/mf.csv') From d7da0aac053e6f2b0bfec8d16d367056b9c46954 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Fri, 1 Mar 2024 14:33:27 +0000 Subject: [PATCH 05/30] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- tests/test_cmd_signature.py | 98 +++++++++++++++++++++++-------------- 1 file changed, 60 insertions(+), 38 deletions(-) diff --git a/tests/test_cmd_signature.py b/tests/test_cmd_signature.py index c49796c93f..1edb653903 100644 --- a/tests/test_cmd_signature.py +++ b/tests/test_cmd_signature.py @@ -5352,81 +5352,103 @@ def test_sig_check_3_no_manifest_ok(runtmp): def test_sig_check_4_manifest_cwd_cwd(runtmp): # check: manifest and sigs in cwd - prot_zip = utils.get_test_data('prot/all.zip') + prot_zip = utils.get_test_data("prot/all.zip") - shutil.copyfile(prot_zip, runtmp.output('prot.zip')) + shutil.copyfile(prot_zip, runtmp.output("prot.zip")) # generate a picklist, whatever - runtmp.sourmash('sig', 'manifest', 'prot.zip', '-o', 'picklist.csv') - assert os.path.exists(runtmp.output('picklist.csv')) + runtmp.sourmash("sig", "manifest", "prot.zip", "-o", "picklist.csv") + assert os.path.exists(runtmp.output("picklist.csv")) # use picklist with sig check to generate a manifest - runtmp.sourmash('sig', 'check', '-m', 'mf.csv', - '--picklist', 'picklist.csv::manifest', - 'prot.zip') + runtmp.sourmash( + "sig", + "check", + "-m", + "mf.csv", + "--picklist", + "picklist.csv::manifest", + "prot.zip", + ) # check that it all works - runtmp.sourmash('sig', 'cat', 'mf.csv') + runtmp.sourmash("sig", "cat", "mf.csv") def test_sig_check_4_manifest_subdir_cwd(runtmp): # check: manifest in subdir and sigs in cwd - prot_zip = utils.get_test_data('prot/all.zip') + prot_zip = utils.get_test_data("prot/all.zip") - shutil.copyfile(prot_zip, runtmp.output('prot.zip')) - os.mkdir(runtmp.output('mf_dir')) + shutil.copyfile(prot_zip, runtmp.output("prot.zip")) + os.mkdir(runtmp.output("mf_dir")) # generate a picklist, whatever - runtmp.sourmash('sig', 'manifest', 'prot.zip', '-o', 'picklist.csv') - assert os.path.exists(runtmp.output('picklist.csv')) + runtmp.sourmash("sig", "manifest", "prot.zip", "-o", "picklist.csv") + assert os.path.exists(runtmp.output("picklist.csv")) # use picklist with sig check to generate a manifest - runtmp.sourmash('sig', 'check', '-m', 'mf_dir/mf.csv', - '--picklist', 'picklist.csv::manifest', - 'prot.zip') + runtmp.sourmash( + "sig", + "check", + "-m", + "mf_dir/mf.csv", + "--picklist", + "picklist.csv::manifest", + "prot.zip", + ) # check that it all works - runtmp.sourmash('sig', 'cat', 'mf_dir/mf.csv') + runtmp.sourmash("sig", "cat", "mf_dir/mf.csv") def test_sig_check_4_manifest_cwd_subdir(runtmp): # check: manifest in cwd and sigs in subdir - prot_zip = utils.get_test_data('prot/all.zip') + prot_zip = utils.get_test_data("prot/all.zip") - os.mkdir(runtmp.output('zip_dir')) - shutil.copyfile(prot_zip, runtmp.output('zip_dir/prot.zip')) + os.mkdir(runtmp.output("zip_dir")) + shutil.copyfile(prot_zip, runtmp.output("zip_dir/prot.zip")) # generate a picklist, whatever - runtmp.sourmash('sig', 'manifest', 'zip_dir/prot.zip', - '-o', 'picklist.csv') - assert os.path.exists(runtmp.output('picklist.csv')) + runtmp.sourmash("sig", "manifest", "zip_dir/prot.zip", "-o", "picklist.csv") + assert os.path.exists(runtmp.output("picklist.csv")) # use picklist with sig check to generate a manifest - runtmp.sourmash('sig', 'check', '-m', 'mf.csv', - '--picklist', 'picklist.csv::manifest', - 'zip_dir/prot.zip') + runtmp.sourmash( + "sig", + "check", + "-m", + "mf.csv", + "--picklist", + "picklist.csv::manifest", + "zip_dir/prot.zip", + ) # check that it all works - runtmp.sourmash('sig', 'cat', 'mf.csv') + runtmp.sourmash("sig", "cat", "mf.csv") def test_sig_check_4_manifest_subdir_subdir(runtmp): # check: manifest and sigs in subdir - prot_zip = utils.get_test_data('prot/all.zip') + prot_zip = utils.get_test_data("prot/all.zip") - os.mkdir(runtmp.output('zip_dir')) - shutil.copyfile(prot_zip, runtmp.output('zip_dir/prot.zip')) - os.mkdir(runtmp.output('mf_dir')) + os.mkdir(runtmp.output("zip_dir")) + shutil.copyfile(prot_zip, runtmp.output("zip_dir/prot.zip")) + os.mkdir(runtmp.output("mf_dir")) # generate a picklist, whatever - runtmp.sourmash('sig', 'manifest', 'zip_dir/prot.zip', - '-o', 'picklist.csv') - assert os.path.exists(runtmp.output('picklist.csv')) + runtmp.sourmash("sig", "manifest", "zip_dir/prot.zip", "-o", "picklist.csv") + assert os.path.exists(runtmp.output("picklist.csv")) # use picklist with sig check to generate a manifest - runtmp.sourmash('sig', 'check', '-m', 'mf_dir/mf.csv', - '--picklist', 'picklist.csv::manifest', - 'zip_dir/prot.zip') + runtmp.sourmash( + "sig", + "check", + "-m", + "mf_dir/mf.csv", + "--picklist", + "picklist.csv::manifest", + "zip_dir/prot.zip", + ) # check that it all works - runtmp.sourmash('sig', 'cat', 'mf_dir/mf.csv') + runtmp.sourmash("sig", "cat", "mf_dir/mf.csv") From 0caee7fed9ed30d2427623b8fae247a3c20ff565 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 2 Mar 2024 11:36:51 -0500 Subject: [PATCH 06/30] implement --abspath, --relpath for sig check --- src/sourmash/cli/sig/check.py | 12 +++++++ src/sourmash/index/__init__.py | 11 ++---- src/sourmash/sig/__main__.py | 19 +++++++++- tests/test_cmd_signature.py | 64 +++++++++++++++++++++++++--------- 4 files changed, 79 insertions(+), 27 deletions(-) diff --git a/src/sourmash/cli/sig/check.py b/src/sourmash/cli/sig/check.py index a4c940eecb..b9896ff75f 100644 --- a/src/sourmash/cli/sig/check.py +++ b/src/sourmash/cli/sig/check.py @@ -67,6 +67,18 @@ def subparser(subparsers): default="csv", choices=["csv", "sql"], ) + subparser.add_argument( + "--abspath", help="convert all locations to absolute paths", action="store_true" + ) + subparser.add_argument( + "--no-abspath", help="do not convert all locations to absolute paths", action="store_false", dest='abspath' + ) + subparser.add_argument( + "--relpath", help="convert all locations to paths relative to the output manifest", action="store_true" + ) + subparser.add_argument( + "--no-relpath", help="do not convert all locations to paths relative to the output manifest", action="store_false", dest='relpath' + ) add_ksize_arg(subparser) add_moltype_args(subparser) diff --git a/src/sourmash/index/__init__.py b/src/sourmash/index/__init__.py index 6d8fed9702..80a0361de0 100644 --- a/src/sourmash/index/__init__.py +++ b/src/sourmash/index/__init__.py @@ -1155,15 +1155,8 @@ def load(cls, location, *, prefix=None): m = CollectionManifest.load_from_filename(location) if prefix is None: - # @CTB hmm, good or bad idea? - # if we disable, tests break. maybe change tests? - if location.startswith("/"): - prefix = os.path.dirname(location) - else: - # calculate paths relative to cwd; @CTB more/better docs. - prefix = os.path.dirname(location) - relpath = os.path.relpath(os.curdir, prefix) - prefix = os.path.join(prefix, relpath) + # by default, calculate paths relative to manifest location. + prefix = os.path.dirname(location) return cls(m, location, prefix=prefix) diff --git a/src/sourmash/sig/__main__.py b/src/sourmash/sig/__main__.py index 1a89d6239f..867274b305 100644 --- a/src/sourmash/sig/__main__.py +++ b/src/sourmash/sig/__main__.py @@ -1437,9 +1437,25 @@ def check(args): total_manifest_rows = CollectionManifest([]) + output_manifest_dir = os.curdir # CTB cleanup / test + if args.save_manifest_matching: + output_manifest_dir = os.path.dirname(args.save_manifest_matching) + relpath = os.path.relpath(os.curdir, output_manifest_dir) + # start loading! total_rows_examined = 0 for filename in args.signatures: + if args.abspath: + # convert to abspath + new_iloc = os.path.abspath(filename) + elif args.relpath: + # interpret paths relative to manifest directory + prefix = os.path.dirname(filename) + new_iloc = os.path.join(relpath, filename) + else: + # default: paths are relative to cwd + new_iloc = filename + idx = sourmash_args.load_file_as_index(filename, yield_all_files=args.force) idx = idx.select(ksize=args.ksize, moltype=moltype) @@ -1457,8 +1473,9 @@ def check(args): # rewrite locations so that each signature can be found by filename # of its container; this follows `sig collect` logic. + # CTB: note that this is relative to cwd, not manifest location. for row in sub_manifest.rows: - row["internal_location"] = filename + row["internal_location"] = new_iloc total_manifest_rows.add_row(row) # the len(sub_manifest) here should only be run when needed :) diff --git a/tests/test_cmd_signature.py b/tests/test_cmd_signature.py index 1edb653903..a30266e498 100644 --- a/tests/test_cmd_signature.py +++ b/tests/test_cmd_signature.py @@ -31,6 +31,11 @@ def _write_file(runtmp, basename, lines, *, gz=False): return loc +@pytest.fixture(params=['--no-abspath', '--abspath']) +def abspath(request): + return request.param + + def test_run_sourmash_signature_cmd(): status, out, err = utils.runscript("sourmash", ["signature"], fail_ok=True) assert "sourmash: error: argument cmd: invalid choice:" not in err @@ -3798,6 +3803,7 @@ def test_sig_describe_2_exclude_db_pattern(runtmp): def test_sig_describe_3_manifest_works(runtmp): # test on a manifest with relative paths, in proper location + # @CTB => has a / in it mf = utils.get_test_data("scaled/mf.csv") runtmp.sourmash("sig", "describe", mf, "--csv", "out.csv") @@ -4799,13 +4805,13 @@ def test_sig_kmers_2_hp(runtmp): assert check_mh2.similarity(mh) == 1.0 -def test_sig_check_1(runtmp): +def test_sig_check_1(runtmp, abspath): # basic check functionality sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") runtmp.sourmash( - "sig", "check", *sigfiles, "--picklist", f"{picklist}::manifest", "-m", "mf.csv" + "sig", "check", *sigfiles, "--picklist", f"{picklist}::manifest", "-m", "mf.csv", abspath ) out_mf = runtmp.output("mf.csv") @@ -4826,7 +4832,7 @@ def test_sig_check_1(runtmp): assert 31 in ksizes -def test_sig_check_1_mf_csv_gz(runtmp): +def test_sig_check_1_mf_csv_gz(runtmp, abspath): # basic check functionality, with gzipped manifest output sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -4839,6 +4845,7 @@ def test_sig_check_1_mf_csv_gz(runtmp): f"{picklist}::manifest", "-m", "mf.csv.gz", + abspath ) out_mf = runtmp.output("mf.csv.gz") @@ -4859,7 +4866,7 @@ def test_sig_check_1_mf_csv_gz(runtmp): assert 31 in ksizes -def test_sig_check_1_gz(runtmp): +def test_sig_check_1_gz(runtmp, abspath): # basic check functionality with gzipped picklist sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -4877,6 +4884,7 @@ def test_sig_check_1_gz(runtmp): "salmonella.csv.gz::manifest", "-m", "mf.csv", + abspath ) out_mf = runtmp.output("mf.csv") @@ -4897,7 +4905,7 @@ def test_sig_check_1_gz(runtmp): assert 31 in ksizes -def test_sig_check_1_nofail(runtmp): +def test_sig_check_1_nofail(runtmp, abspath): # basic check functionality with --fail-if-missing sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -4911,6 +4919,7 @@ def test_sig_check_1_nofail(runtmp): "-m", "mf.csv", "--fail-if-missing", + abspath ) out_mf = runtmp.output("mf.csv") @@ -4952,7 +4961,7 @@ def test_sig_check_1_no_picklist(runtmp): ("name", "identprefix"), ), ) -def test_sig_check_1_column(runtmp, column, coltype): +def test_sig_check_1_column(runtmp, column, coltype, abspath): # basic check functionality for various columns/coltypes sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -4967,6 +4976,7 @@ def test_sig_check_1_column(runtmp, column, coltype): "mf.csv", "-o", "missing.csv", + abspath ) out_mf = runtmp.output("mf.csv") @@ -4987,7 +4997,7 @@ def test_sig_check_1_column(runtmp, column, coltype): assert 31 in ksizes -def test_sig_check_1_diff_col_name(runtmp): +def test_sig_check_1_diff_col_name(runtmp, abspath): # 'sig check' with 'name2' column instead of default name sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist-diffcolumn.csv") @@ -5002,6 +5012,7 @@ def test_sig_check_1_diff_col_name(runtmp): "missing.csv", "-m", "mf.csv", + abspath ) out_mf = runtmp.output("mf.csv") @@ -5036,7 +5047,7 @@ def test_sig_check_1_diff_col_name(runtmp): assert rows[1][0] == "NOT THERE" -def test_sig_check_1_diff_col_name_zip(runtmp): +def test_sig_check_1_diff_col_name_zip(runtmp, abspath): # 'sig check' with 'name2' column instead of default name, on a zip file sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist-diffcolumn.csv") @@ -5055,6 +5066,7 @@ def test_sig_check_1_diff_col_name_zip(runtmp): "missing.csv", "-m", "mf.csv", + abspath, ) out_mf = runtmp.output("mf.csv") @@ -5089,7 +5101,7 @@ def test_sig_check_1_diff_col_name_zip(runtmp): assert rows[1][0] == "NOT THERE" -def test_sig_check_1_diff_col_name_exclude(runtmp): +def test_sig_check_1_diff_col_name_exclude(runtmp, abspath): # 'sig check' with 'name2' column, :exclude picklist sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist-diffcolumn.csv") @@ -5102,6 +5114,7 @@ def test_sig_check_1_diff_col_name_exclude(runtmp): f"{picklist}:name2:name:exclude", "-m", "mf.csv", + abspath ) out_mf = runtmp.output("mf.csv") @@ -5122,7 +5135,7 @@ def test_sig_check_1_diff_col_name_exclude(runtmp): assert 31 in ksizes -def test_sig_check_1_ksize(runtmp): +def test_sig_check_1_ksize(runtmp, abspath): # basic check functionality with selection for ksize sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -5137,6 +5150,7 @@ def test_sig_check_1_ksize(runtmp): f"{picklist}::manifest", "-m", "mf.csv", + abspath, ) out_mf = runtmp.output("mf.csv") @@ -5155,7 +5169,7 @@ def test_sig_check_1_ksize(runtmp): assert 31 in ksizes -def test_sig_check_1_ksize_output_sql(runtmp): +def test_sig_check_1_ksize_output_sql(runtmp, abspath): # basic check functionality with selection for ksize sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -5172,6 +5186,7 @@ def test_sig_check_1_ksize_output_sql(runtmp): "mf.mfsql", "-F", "sql", + abspath ) out_mf = runtmp.output("mf.mfsql") @@ -5190,7 +5205,7 @@ def test_sig_check_1_ksize_output_sql(runtmp): assert 31 in ksizes -def test_sig_check_2_output_missing(runtmp): +def test_sig_check_2_output_missing(runtmp, abspath): # output missing all as identical to input picklist sigfiles = utils.get_test_data("gather/combined.sig") picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -5205,6 +5220,7 @@ def test_sig_check_2_output_missing(runtmp): "missing.csv", "-m", "mf.csv", + abspath, ) out_csv = runtmp.output("missing.csv") @@ -5265,7 +5281,7 @@ def test_sig_check_2_output_missing_error_exit(runtmp): ("name", "identprefix"), ), ) -def test_sig_check_2_output_missing_column(runtmp, column, coltype): +def test_sig_check_2_output_missing_column(runtmp, column, coltype, abspath): # output missing all as identical to input picklist sigfiles = utils.get_test_data("gather/combined.sig") picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -5278,6 +5294,7 @@ def test_sig_check_2_output_missing_column(runtmp, column, coltype): f"{picklist}::manifest", "-o", "missing.csv", + abspath ) out_csv = runtmp.output("missing.csv") @@ -5328,7 +5345,7 @@ def test_sig_check_3_no_manifest(runtmp): assert "sig check requires a manifest by default, but no manifest present." in err -def test_sig_check_3_no_manifest_ok(runtmp): +def test_sig_check_3_no_manifest_ok(runtmp, abspath): # generate manifest if --no-require-manifest sbt = utils.get_test_data("v6.sbt.zip") picklist = utils.get_test_data("v6.sbt.zip.mf.csv") @@ -5340,6 +5357,7 @@ def test_sig_check_3_no_manifest_ok(runtmp): "--no-require-manifest", "--picklist", f"{picklist}::manifest", + abspath, ) print(runtmp.last_result.out) @@ -5350,7 +5368,7 @@ def test_sig_check_3_no_manifest_ok(runtmp): ) -def test_sig_check_4_manifest_cwd_cwd(runtmp): +def test_sig_check_4_manifest_cwd_cwd(runtmp, abspath): # check: manifest and sigs in cwd prot_zip = utils.get_test_data("prot/all.zip") @@ -5369,6 +5387,7 @@ def test_sig_check_4_manifest_cwd_cwd(runtmp): "--picklist", "picklist.csv::manifest", "prot.zip", + abspath, ) # check that it all works @@ -5395,13 +5414,16 @@ def test_sig_check_4_manifest_subdir_cwd(runtmp): "--picklist", "picklist.csv::manifest", "prot.zip", + "--relpath", ) + print(runtmp.last_result.out) + print(runtmp.last_result.err) + # check that it all works runtmp.sourmash("sig", "cat", "mf_dir/mf.csv") - -def test_sig_check_4_manifest_cwd_subdir(runtmp): +def test_sig_check_4_manifest_cwd_subdir(runtmp, abspath): # check: manifest in cwd and sigs in subdir prot_zip = utils.get_test_data("prot/all.zip") @@ -5421,8 +5443,12 @@ def test_sig_check_4_manifest_cwd_subdir(runtmp): "--picklist", "picklist.csv::manifest", "zip_dir/prot.zip", + abspath, ) + print(runtmp.last_result.out) + print(runtmp.last_result.err) + # check that it all works runtmp.sourmash("sig", "cat", "mf.csv") @@ -5448,7 +5474,11 @@ def test_sig_check_4_manifest_subdir_subdir(runtmp): "--picklist", "picklist.csv::manifest", "zip_dir/prot.zip", + "--relpath", ) + print(runtmp.last_result.out) + print(runtmp.last_result.err) + # check that it all works runtmp.sourmash("sig", "cat", "mf_dir/mf.csv") From 1fba4ee0e7077af0360e7ecaa4b3f61e653b00d2 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 2 Mar 2024 11:45:37 -0500 Subject: [PATCH 07/30] clean up relpath a bit --- src/sourmash/sig/__main__.py | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/src/sourmash/sig/__main__.py b/src/sourmash/sig/__main__.py index 867274b305..bdf839787e 100644 --- a/src/sourmash/sig/__main__.py +++ b/src/sourmash/sig/__main__.py @@ -1435,22 +1435,30 @@ def check(args): else: debug("sig check: manifest required") - total_manifest_rows = CollectionManifest([]) + # abspath/relpath checks + if args.abspath and args.relpath: + error("** Cannot specify both --abspath and --relpath; pick one!") + sys.exit(-1) + + if args.relpath or args.abspath and not args.save_manifest_matching: + notify("** WARNING: --abspath and --relpath only have effects when saving a manifest") - output_manifest_dir = os.curdir # CTB cleanup / test - if args.save_manifest_matching: + relpath = "." + if args.relpath and args.save_manifest_matching: output_manifest_dir = os.path.dirname(args.save_manifest_matching) relpath = os.path.relpath(os.curdir, output_manifest_dir) + total_manifest_rows = CollectionManifest([]) + # start loading! total_rows_examined = 0 for filename in args.signatures: + # if saving a manifest, think about how to rewrite locations. if args.abspath: # convert to abspath new_iloc = os.path.abspath(filename) elif args.relpath: # interpret paths relative to manifest directory - prefix = os.path.dirname(filename) new_iloc = os.path.join(relpath, filename) else: # default: paths are relative to cwd From f3079b9e5e9fc5b09fd2645f25c9bc5ba2de9eb5 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 2 Mar 2024 11:48:38 -0500 Subject: [PATCH 08/30] add --relpath to sig collect --- src/sourmash/cli/sig/collect.py | 9 +++++++++ src/sourmash/sig/__main__.py | 11 +++++++++++ 2 files changed, 20 insertions(+) diff --git a/src/sourmash/cli/sig/collect.py b/src/sourmash/cli/sig/collect.py index 1e5d8ded2f..8b8e57b8ab 100644 --- a/src/sourmash/cli/sig/collect.py +++ b/src/sourmash/cli/sig/collect.py @@ -56,6 +56,15 @@ def subparser(subparsers): subparser.add_argument( "--abspath", help="convert all locations to absolute paths", action="store_true" ) + subparser.add_argument( + "--no-abspath", help="do not convert all locations to absolute paths", action="store_false", dest='abspath' + ) + subparser.add_argument( + "--relpath", help="convert all locations to paths relative to the output manifest", action="store_true" + ) + subparser.add_argument( + "--no-relpath", help="do not convert all locations to paths relative to the output manifest", action="store_false", dest='relpath' + ) add_ksize_arg(subparser) add_moltype_args(subparser) diff --git a/src/sourmash/sig/__main__.py b/src/sourmash/sig/__main__.py index bdf839787e..cea4624775 100644 --- a/src/sourmash/sig/__main__.py +++ b/src/sourmash/sig/__main__.py @@ -1625,6 +1625,17 @@ def collect(args): mf = sourmash_args.get_manifest(idx) + # decide how to rewrite locations to container: + if args.abspath: + # convert to abspath + new_iloc = os.path.abspath(loc) + elif args.relpath: + # interpret paths relative to manifest directory + new_iloc = os.path.join(relpath, loc) + else: + # default: paths are relative to cwd + new_iloc = loc + for row in mf.rows: row["internal_location"] = loc collected_mf.add_row(row) From b17cd4f94932fcfa511bdab208226375628f9c44 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 3 Mar 2024 09:19:36 -0500 Subject: [PATCH 09/30] implement --relpath for sig collect too --- doc/command-line.md | 4 ++++ src/sourmash/cli/sig/check.py | 6 +++--- src/sourmash/cli/sig/collect.py | 4 ++-- src/sourmash/sig/__main__.py | 9 +++++---- 4 files changed, 14 insertions(+), 9 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index 3279921a36..d511c35796 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -1956,6 +1956,10 @@ picklist CSV. With `--save-manifest-matching`, `sig check` will save all of the _matched_ elements to a manifest file, which can then be used as a sourmash database. +When saving manifests with matched elements, sourmash will by default +not rewrite paths to the containers for the matched elements. This will +create buggy manifests ... @CTB. + `sourmash sig check` is particularly useful when working with large collections of signatures and identifiers. diff --git a/src/sourmash/cli/sig/check.py b/src/sourmash/cli/sig/check.py index b9896ff75f..b91cb34ce8 100644 --- a/src/sourmash/cli/sig/check.py +++ b/src/sourmash/cli/sig/check.py @@ -68,13 +68,13 @@ def subparser(subparsers): choices=["csv", "sql"], ) subparser.add_argument( - "--abspath", help="convert all locations to absolute paths", action="store_true" + "--abspath", "--use-absolute-paths", help="convert all locations to absolute paths", action="store_true" ) subparser.add_argument( - "--no-abspath", help="do not convert all locations to absolute paths", action="store_false", dest='abspath' + "--no-abspath", help="do not convert all locations to absolute paths", action="store_false", dest='abspath', ) subparser.add_argument( - "--relpath", help="convert all locations to paths relative to the output manifest", action="store_true" + "--relpath", "--use-relative-paths", help="convert all locations to paths relative to the output manifest", action="store_true", ) subparser.add_argument( "--no-relpath", help="do not convert all locations to paths relative to the output manifest", action="store_false", dest='relpath' diff --git a/src/sourmash/cli/sig/collect.py b/src/sourmash/cli/sig/collect.py index 8b8e57b8ab..56dfe6284c 100644 --- a/src/sourmash/cli/sig/collect.py +++ b/src/sourmash/cli/sig/collect.py @@ -54,13 +54,13 @@ def subparser(subparsers): help="merge new manifests into existing", ) subparser.add_argument( - "--abspath", help="convert all locations to absolute paths", action="store_true" + "--abspath", "--use-absolute-paths", help="convert all locations to absolute paths", action="store_true" ) subparser.add_argument( "--no-abspath", help="do not convert all locations to absolute paths", action="store_false", dest='abspath' ) subparser.add_argument( - "--relpath", help="convert all locations to paths relative to the output manifest", action="store_true" + "--relpath", "--use-relative-paths", help="convert all locations to paths relative to the output manifest", action="store_true", ) subparser.add_argument( "--no-relpath", help="do not convert all locations to paths relative to the output manifest", action="store_false", dest='relpath' diff --git a/src/sourmash/sig/__main__.py b/src/sourmash/sig/__main__.py index cea4624775..71d02a08de 100644 --- a/src/sourmash/sig/__main__.py +++ b/src/sourmash/sig/__main__.py @@ -1604,9 +1604,10 @@ def collect(args): # load from_file _extend_signatures_with_from_file(args, target_attr="locations") - # convert to abspath - if args.abspath: - args.locations = [os.path.abspath(iloc) for iloc in args.locations] + relpath = None + if args.relpath: + output_manifest_dir = os.path.dirname(args.output) + relpath = os.path.relpath(os.curdir, output_manifest_dir) # iterate through, loading all the manifests from all the locations. for n_files, loc in enumerate(args.locations): @@ -1637,7 +1638,7 @@ def collect(args): new_iloc = loc for row in mf.rows: - row["internal_location"] = loc + row["internal_location"] = new_iloc collected_mf.add_row(row) if args.manifest_format == "csv": From 72a9062ff235c92070e1dc81670284a57f4d9b88 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Sun, 3 Mar 2024 14:19:51 +0000 Subject: [PATCH 10/30] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- src/sourmash/cli/sig/check.py | 20 ++++++++++++++++---- src/sourmash/cli/sig/collect.py | 20 ++++++++++++++++---- src/sourmash/sig/__main__.py | 4 +++- tests/test_cmd_signature.py | 28 ++++++++++++++++++---------- 4 files changed, 53 insertions(+), 19 deletions(-) diff --git a/src/sourmash/cli/sig/check.py b/src/sourmash/cli/sig/check.py index b91cb34ce8..93335d4308 100644 --- a/src/sourmash/cli/sig/check.py +++ b/src/sourmash/cli/sig/check.py @@ -68,16 +68,28 @@ def subparser(subparsers): choices=["csv", "sql"], ) subparser.add_argument( - "--abspath", "--use-absolute-paths", help="convert all locations to absolute paths", action="store_true" + "--abspath", + "--use-absolute-paths", + help="convert all locations to absolute paths", + action="store_true", ) subparser.add_argument( - "--no-abspath", help="do not convert all locations to absolute paths", action="store_false", dest='abspath', + "--no-abspath", + help="do not convert all locations to absolute paths", + action="store_false", + dest="abspath", ) subparser.add_argument( - "--relpath", "--use-relative-paths", help="convert all locations to paths relative to the output manifest", action="store_true", + "--relpath", + "--use-relative-paths", + help="convert all locations to paths relative to the output manifest", + action="store_true", ) subparser.add_argument( - "--no-relpath", help="do not convert all locations to paths relative to the output manifest", action="store_false", dest='relpath' + "--no-relpath", + help="do not convert all locations to paths relative to the output manifest", + action="store_false", + dest="relpath", ) add_ksize_arg(subparser) diff --git a/src/sourmash/cli/sig/collect.py b/src/sourmash/cli/sig/collect.py index 56dfe6284c..73077fe9fb 100644 --- a/src/sourmash/cli/sig/collect.py +++ b/src/sourmash/cli/sig/collect.py @@ -54,16 +54,28 @@ def subparser(subparsers): help="merge new manifests into existing", ) subparser.add_argument( - "--abspath", "--use-absolute-paths", help="convert all locations to absolute paths", action="store_true" + "--abspath", + "--use-absolute-paths", + help="convert all locations to absolute paths", + action="store_true", ) subparser.add_argument( - "--no-abspath", help="do not convert all locations to absolute paths", action="store_false", dest='abspath' + "--no-abspath", + help="do not convert all locations to absolute paths", + action="store_false", + dest="abspath", ) subparser.add_argument( - "--relpath", "--use-relative-paths", help="convert all locations to paths relative to the output manifest", action="store_true", + "--relpath", + "--use-relative-paths", + help="convert all locations to paths relative to the output manifest", + action="store_true", ) subparser.add_argument( - "--no-relpath", help="do not convert all locations to paths relative to the output manifest", action="store_false", dest='relpath' + "--no-relpath", + help="do not convert all locations to paths relative to the output manifest", + action="store_false", + dest="relpath", ) add_ksize_arg(subparser) diff --git a/src/sourmash/sig/__main__.py b/src/sourmash/sig/__main__.py index 71d02a08de..da7bb86278 100644 --- a/src/sourmash/sig/__main__.py +++ b/src/sourmash/sig/__main__.py @@ -1441,7 +1441,9 @@ def check(args): sys.exit(-1) if args.relpath or args.abspath and not args.save_manifest_matching: - notify("** WARNING: --abspath and --relpath only have effects when saving a manifest") + notify( + "** WARNING: --abspath and --relpath only have effects when saving a manifest" + ) relpath = "." if args.relpath and args.save_manifest_matching: diff --git a/tests/test_cmd_signature.py b/tests/test_cmd_signature.py index a30266e498..e9e7b5e9fb 100644 --- a/tests/test_cmd_signature.py +++ b/tests/test_cmd_signature.py @@ -31,7 +31,7 @@ def _write_file(runtmp, basename, lines, *, gz=False): return loc -@pytest.fixture(params=['--no-abspath', '--abspath']) +@pytest.fixture(params=["--no-abspath", "--abspath"]) def abspath(request): return request.param @@ -4811,7 +4811,14 @@ def test_sig_check_1(runtmp, abspath): picklist = utils.get_test_data("gather/salmonella-picklist.csv") runtmp.sourmash( - "sig", "check", *sigfiles, "--picklist", f"{picklist}::manifest", "-m", "mf.csv", abspath + "sig", + "check", + *sigfiles, + "--picklist", + f"{picklist}::manifest", + "-m", + "mf.csv", + abspath, ) out_mf = runtmp.output("mf.csv") @@ -4845,7 +4852,7 @@ def test_sig_check_1_mf_csv_gz(runtmp, abspath): f"{picklist}::manifest", "-m", "mf.csv.gz", - abspath + abspath, ) out_mf = runtmp.output("mf.csv.gz") @@ -4884,7 +4891,7 @@ def test_sig_check_1_gz(runtmp, abspath): "salmonella.csv.gz::manifest", "-m", "mf.csv", - abspath + abspath, ) out_mf = runtmp.output("mf.csv") @@ -4919,7 +4926,7 @@ def test_sig_check_1_nofail(runtmp, abspath): "-m", "mf.csv", "--fail-if-missing", - abspath + abspath, ) out_mf = runtmp.output("mf.csv") @@ -4976,7 +4983,7 @@ def test_sig_check_1_column(runtmp, column, coltype, abspath): "mf.csv", "-o", "missing.csv", - abspath + abspath, ) out_mf = runtmp.output("mf.csv") @@ -5012,7 +5019,7 @@ def test_sig_check_1_diff_col_name(runtmp, abspath): "missing.csv", "-m", "mf.csv", - abspath + abspath, ) out_mf = runtmp.output("mf.csv") @@ -5114,7 +5121,7 @@ def test_sig_check_1_diff_col_name_exclude(runtmp, abspath): f"{picklist}:name2:name:exclude", "-m", "mf.csv", - abspath + abspath, ) out_mf = runtmp.output("mf.csv") @@ -5186,7 +5193,7 @@ def test_sig_check_1_ksize_output_sql(runtmp, abspath): "mf.mfsql", "-F", "sql", - abspath + abspath, ) out_mf = runtmp.output("mf.mfsql") @@ -5294,7 +5301,7 @@ def test_sig_check_2_output_missing_column(runtmp, column, coltype, abspath): f"{picklist}::manifest", "-o", "missing.csv", - abspath + abspath, ) out_csv = runtmp.output("missing.csv") @@ -5423,6 +5430,7 @@ def test_sig_check_4_manifest_subdir_cwd(runtmp): # check that it all works runtmp.sourmash("sig", "cat", "mf_dir/mf.csv") + def test_sig_check_4_manifest_cwd_subdir(runtmp, abspath): # check: manifest in cwd and sigs in subdir prot_zip = utils.get_test_data("prot/all.zip") From 5cf47498a0fe392ad8130d22478312a9294aeea0 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 3 Mar 2024 09:31:38 -0500 Subject: [PATCH 11/30] straighten out tests --- tests/test_cmd_signature.py | 87 ++++++++++++++++++++----------------- 1 file changed, 48 insertions(+), 39 deletions(-) diff --git a/tests/test_cmd_signature.py b/tests/test_cmd_signature.py index e9e7b5e9fb..745e51cbac 100644 --- a/tests/test_cmd_signature.py +++ b/tests/test_cmd_signature.py @@ -30,9 +30,17 @@ def _write_file(runtmp, basename, lines, *, gz=False): fp.write("\n".join(lines)) return loc +# these should both always succeed for 'sig check' and 'sig collect' output +# manifests. +@pytest.fixture(params=["--abspath", "--relpath"]) +def abspath_or_relpath(request): + return request.param -@pytest.fixture(params=["--no-abspath", "--abspath"]) -def abspath(request): +# this will fail if subdirs used; see #3008. but ths ensures v4 behavior of +# sig collect/sig check works, where manifest paths interpreted relative +# to cwd. +@pytest.fixture(params=["--no-abspath", "--abspath", "--relpath"]) +def abspath_relpath_v4(request): return request.param @@ -3803,7 +3811,6 @@ def test_sig_describe_2_exclude_db_pattern(runtmp): def test_sig_describe_3_manifest_works(runtmp): # test on a manifest with relative paths, in proper location - # @CTB => has a / in it mf = utils.get_test_data("scaled/mf.csv") runtmp.sourmash("sig", "describe", mf, "--csv", "out.csv") @@ -4805,7 +4812,7 @@ def test_sig_kmers_2_hp(runtmp): assert check_mh2.similarity(mh) == 1.0 -def test_sig_check_1(runtmp, abspath): +def test_sig_check_1(runtmp, abspath_relpath_v4): # basic check functionality sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -4818,7 +4825,7 @@ def test_sig_check_1(runtmp, abspath): f"{picklist}::manifest", "-m", "mf.csv", - abspath, + abspath_relpath_v4, ) out_mf = runtmp.output("mf.csv") @@ -4839,7 +4846,7 @@ def test_sig_check_1(runtmp, abspath): assert 31 in ksizes -def test_sig_check_1_mf_csv_gz(runtmp, abspath): +def test_sig_check_1_mf_csv_gz(runtmp, abspath_relpath_v4): # basic check functionality, with gzipped manifest output sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -4852,7 +4859,7 @@ def test_sig_check_1_mf_csv_gz(runtmp, abspath): f"{picklist}::manifest", "-m", "mf.csv.gz", - abspath, + abspath_relpath_v4, ) out_mf = runtmp.output("mf.csv.gz") @@ -4873,7 +4880,7 @@ def test_sig_check_1_mf_csv_gz(runtmp, abspath): assert 31 in ksizes -def test_sig_check_1_gz(runtmp, abspath): +def test_sig_check_1_gz(runtmp, abspath_relpath_v4): # basic check functionality with gzipped picklist sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -4891,7 +4898,7 @@ def test_sig_check_1_gz(runtmp, abspath): "salmonella.csv.gz::manifest", "-m", "mf.csv", - abspath, + abspath_relpath_v4, ) out_mf = runtmp.output("mf.csv") @@ -4912,7 +4919,7 @@ def test_sig_check_1_gz(runtmp, abspath): assert 31 in ksizes -def test_sig_check_1_nofail(runtmp, abspath): +def test_sig_check_1_nofail(runtmp, abspath_relpath_v4): # basic check functionality with --fail-if-missing sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -4926,7 +4933,7 @@ def test_sig_check_1_nofail(runtmp, abspath): "-m", "mf.csv", "--fail-if-missing", - abspath, + abspath_relpath_v4, ) out_mf = runtmp.output("mf.csv") @@ -4968,7 +4975,7 @@ def test_sig_check_1_no_picklist(runtmp): ("name", "identprefix"), ), ) -def test_sig_check_1_column(runtmp, column, coltype, abspath): +def test_sig_check_1_column(runtmp, column, coltype, abspath_relpath_v4): # basic check functionality for various columns/coltypes sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -4983,7 +4990,7 @@ def test_sig_check_1_column(runtmp, column, coltype, abspath): "mf.csv", "-o", "missing.csv", - abspath, + abspath_relpath_v4, ) out_mf = runtmp.output("mf.csv") @@ -5004,7 +5011,7 @@ def test_sig_check_1_column(runtmp, column, coltype, abspath): assert 31 in ksizes -def test_sig_check_1_diff_col_name(runtmp, abspath): +def test_sig_check_1_diff_col_name(runtmp, abspath_relpath_v4): # 'sig check' with 'name2' column instead of default name sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist-diffcolumn.csv") @@ -5019,7 +5026,7 @@ def test_sig_check_1_diff_col_name(runtmp, abspath): "missing.csv", "-m", "mf.csv", - abspath, + abspath_relpath_v4, ) out_mf = runtmp.output("mf.csv") @@ -5054,7 +5061,7 @@ def test_sig_check_1_diff_col_name(runtmp, abspath): assert rows[1][0] == "NOT THERE" -def test_sig_check_1_diff_col_name_zip(runtmp, abspath): +def test_sig_check_1_diff_col_name_zip(runtmp, abspath_relpath_v4): # 'sig check' with 'name2' column instead of default name, on a zip file sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist-diffcolumn.csv") @@ -5073,7 +5080,7 @@ def test_sig_check_1_diff_col_name_zip(runtmp, abspath): "missing.csv", "-m", "mf.csv", - abspath, + abspath_relpath_v4, ) out_mf = runtmp.output("mf.csv") @@ -5108,7 +5115,7 @@ def test_sig_check_1_diff_col_name_zip(runtmp, abspath): assert rows[1][0] == "NOT THERE" -def test_sig_check_1_diff_col_name_exclude(runtmp, abspath): +def test_sig_check_1_diff_col_name_exclude(runtmp, abspath_relpath_v4): # 'sig check' with 'name2' column, :exclude picklist sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist-diffcolumn.csv") @@ -5121,7 +5128,7 @@ def test_sig_check_1_diff_col_name_exclude(runtmp, abspath): f"{picklist}:name2:name:exclude", "-m", "mf.csv", - abspath, + abspath_relpath_v4, ) out_mf = runtmp.output("mf.csv") @@ -5142,7 +5149,7 @@ def test_sig_check_1_diff_col_name_exclude(runtmp, abspath): assert 31 in ksizes -def test_sig_check_1_ksize(runtmp, abspath): +def test_sig_check_1_ksize(runtmp, abspath_relpath_v4): # basic check functionality with selection for ksize sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -5157,7 +5164,7 @@ def test_sig_check_1_ksize(runtmp, abspath): f"{picklist}::manifest", "-m", "mf.csv", - abspath, + abspath_relpath_v4, ) out_mf = runtmp.output("mf.csv") @@ -5176,7 +5183,7 @@ def test_sig_check_1_ksize(runtmp, abspath): assert 31 in ksizes -def test_sig_check_1_ksize_output_sql(runtmp, abspath): +def test_sig_check_1_ksize_output_sql(runtmp, abspath_relpath_v4): # basic check functionality with selection for ksize sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -5193,7 +5200,7 @@ def test_sig_check_1_ksize_output_sql(runtmp, abspath): "mf.mfsql", "-F", "sql", - abspath, + abspath_relpath_v4, ) out_mf = runtmp.output("mf.mfsql") @@ -5212,7 +5219,7 @@ def test_sig_check_1_ksize_output_sql(runtmp, abspath): assert 31 in ksizes -def test_sig_check_2_output_missing(runtmp, abspath): +def test_sig_check_2_output_missing(runtmp, abspath_relpath_v4): # output missing all as identical to input picklist sigfiles = utils.get_test_data("gather/combined.sig") picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -5227,7 +5234,7 @@ def test_sig_check_2_output_missing(runtmp, abspath): "missing.csv", "-m", "mf.csv", - abspath, + abspath_relpath_v4, ) out_csv = runtmp.output("missing.csv") @@ -5288,7 +5295,7 @@ def test_sig_check_2_output_missing_error_exit(runtmp): ("name", "identprefix"), ), ) -def test_sig_check_2_output_missing_column(runtmp, column, coltype, abspath): +def test_sig_check_2_output_missing_column(runtmp, column, coltype, abspath_relpath_v4): # output missing all as identical to input picklist sigfiles = utils.get_test_data("gather/combined.sig") picklist = utils.get_test_data("gather/salmonella-picklist.csv") @@ -5301,7 +5308,7 @@ def test_sig_check_2_output_missing_column(runtmp, column, coltype, abspath): f"{picklist}::manifest", "-o", "missing.csv", - abspath, + abspath_relpath_v4, ) out_csv = runtmp.output("missing.csv") @@ -5352,7 +5359,7 @@ def test_sig_check_3_no_manifest(runtmp): assert "sig check requires a manifest by default, but no manifest present." in err -def test_sig_check_3_no_manifest_ok(runtmp, abspath): +def test_sig_check_3_no_manifest_ok(runtmp, abspath_relpath_v4): # generate manifest if --no-require-manifest sbt = utils.get_test_data("v6.sbt.zip") picklist = utils.get_test_data("v6.sbt.zip.mf.csv") @@ -5364,7 +5371,7 @@ def test_sig_check_3_no_manifest_ok(runtmp, abspath): "--no-require-manifest", "--picklist", f"{picklist}::manifest", - abspath, + abspath_relpath_v4, ) print(runtmp.last_result.out) @@ -5375,7 +5382,7 @@ def test_sig_check_3_no_manifest_ok(runtmp, abspath): ) -def test_sig_check_4_manifest_cwd_cwd(runtmp, abspath): +def test_sig_check_4_manifest_cwd_cwd(runtmp, abspath_relpath_v4): # check: manifest and sigs in cwd prot_zip = utils.get_test_data("prot/all.zip") @@ -5394,15 +5401,16 @@ def test_sig_check_4_manifest_cwd_cwd(runtmp, abspath): "--picklist", "picklist.csv::manifest", "prot.zip", - abspath, + abspath_relpath_v4, ) # check that it all works runtmp.sourmash("sig", "cat", "mf.csv") -def test_sig_check_4_manifest_subdir_cwd(runtmp): - # check: manifest in subdir and sigs in cwd +def test_sig_check_4_manifest_subdir_cwd(runtmp, abspath_or_relpath): + # check: manifest in subdir and sigs in cwd. note, + # fails with default v4 behavior. see #3008. prot_zip = utils.get_test_data("prot/all.zip") shutil.copyfile(prot_zip, runtmp.output("prot.zip")) @@ -5421,7 +5429,7 @@ def test_sig_check_4_manifest_subdir_cwd(runtmp): "--picklist", "picklist.csv::manifest", "prot.zip", - "--relpath", + abspath_or_relpath, ) print(runtmp.last_result.out) @@ -5431,7 +5439,7 @@ def test_sig_check_4_manifest_subdir_cwd(runtmp): runtmp.sourmash("sig", "cat", "mf_dir/mf.csv") -def test_sig_check_4_manifest_cwd_subdir(runtmp, abspath): +def test_sig_check_4_manifest_cwd_subdir(runtmp, abspath_relpath_v4): # check: manifest in cwd and sigs in subdir prot_zip = utils.get_test_data("prot/all.zip") @@ -5451,7 +5459,7 @@ def test_sig_check_4_manifest_cwd_subdir(runtmp, abspath): "--picklist", "picklist.csv::manifest", "zip_dir/prot.zip", - abspath, + abspath_relpath_v4, ) print(runtmp.last_result.out) @@ -5461,8 +5469,9 @@ def test_sig_check_4_manifest_cwd_subdir(runtmp, abspath): runtmp.sourmash("sig", "cat", "mf.csv") -def test_sig_check_4_manifest_subdir_subdir(runtmp): - # check: manifest and sigs in subdir +def test_sig_check_4_manifest_subdir_subdir(runtmp, abspath_or_relpath): + # check: manifest and sigs in subdir. note, fails with default v4 behavior. + # see #3008. prot_zip = utils.get_test_data("prot/all.zip") os.mkdir(runtmp.output("zip_dir")) @@ -5482,7 +5491,7 @@ def test_sig_check_4_manifest_subdir_subdir(runtmp): "--picklist", "picklist.csv::manifest", "zip_dir/prot.zip", - "--relpath", + abspath_or_relpath, ) print(runtmp.last_result.out) From d6dfe353f0f55d75a736b6ca02f07d86f23962ef Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 3 Mar 2024 09:41:38 -0500 Subject: [PATCH 12/30] add abspath/relpath to sig collect tests --- tests/conftest.py | 14 ++++++++++++++ tests/test_cmd_signature.py | 13 ------------- tests/test_cmd_signature_collect.py | 27 ++++++++++++++------------- 3 files changed, 28 insertions(+), 26 deletions(-) diff --git a/tests/conftest.py b/tests/conftest.py index 00476e8a7d..a4061e4102 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -89,6 +89,20 @@ def sig_save_extension_abund(request): return request.param +# these should both always succeed for 'sig check' and 'sig collect' output +# manifests. +@pytest.fixture(params=["--abspath", "--relpath"]) +def abspath_or_relpath(request): + return request.param + +# this will fail if subdirs used; see #3008. but ths ensures v4 behavior of +# sig collect/sig check works, where manifest paths interpreted relative +# to cwd. +@pytest.fixture(params=["--no-abspath", "--abspath", "--relpath"]) +def abspath_relpath_v4(request): + return request.param + + # --- BEGIN - Only run tests using a particular fixture --- # # Cribbed from: http://pythontesting.net/framework/pytest/pytest-run-tests-using-particular-fixture/ def pytest_collection_modifyitems(items, config): diff --git a/tests/test_cmd_signature.py b/tests/test_cmd_signature.py index 745e51cbac..921050b16e 100644 --- a/tests/test_cmd_signature.py +++ b/tests/test_cmd_signature.py @@ -30,19 +30,6 @@ def _write_file(runtmp, basename, lines, *, gz=False): fp.write("\n".join(lines)) return loc -# these should both always succeed for 'sig check' and 'sig collect' output -# manifests. -@pytest.fixture(params=["--abspath", "--relpath"]) -def abspath_or_relpath(request): - return request.param - -# this will fail if subdirs used; see #3008. but ths ensures v4 behavior of -# sig collect/sig check works, where manifest paths interpreted relative -# to cwd. -@pytest.fixture(params=["--no-abspath", "--abspath", "--relpath"]) -def abspath_relpath_v4(request): - return request.param - def test_run_sourmash_signature_cmd(): status, out, err = utils.runscript("sourmash", ["signature"], fail_ok=True) diff --git a/tests/test_cmd_signature_collect.py b/tests/test_cmd_signature_collect.py index edd7c16a29..6e81a96fd1 100644 --- a/tests/test_cmd_signature_collect.py +++ b/tests/test_cmd_signature_collect.py @@ -13,13 +13,13 @@ from sourmash_tst_utils import SourmashCommandFailed -def test_sig_collect_0_nothing(runtmp, manifest_db_format): +def test_sig_collect_0_nothing(runtmp, manifest_db_format, abspath_relpath_v4): # run with just output ext = "sqlmf" if manifest_db_format == "sql" else "csv" if manifest_db_format != "sql": return - runtmp.sourmash("sig", "collect", "-o", f"mf.{ext}", "-F", manifest_db_format) + runtmp.sourmash("sig", "collect", "-o", f"mf.{ext}", "-F", manifest_db_format, abspath_relpath_v4) manifest_fn = runtmp.output(f"mf.{ext}") manifest = BaseCollectionManifest.load_from_filename(manifest_fn) @@ -27,14 +27,14 @@ def test_sig_collect_0_nothing(runtmp, manifest_db_format): assert len(manifest) == 0 -def test_sig_collect_1_zipfile(runtmp, manifest_db_format): +def test_sig_collect_1_zipfile(runtmp, manifest_db_format, abspath_relpath_v4): # collect a manifest from a .zip file protzip = utils.get_test_data("prot/protein.zip") ext = "sqlmf" if manifest_db_format == "sql" else "csv" runtmp.sourmash( - "sig", "collect", protzip, "-o", f"mf.{ext}", "-F", manifest_db_format + "sig", "collect", protzip, "-o", f"mf.{ext}", "-F", manifest_db_format, abspath_relpath_v4 ) manifest_fn = runtmp.output(f"mf.{ext}") @@ -46,11 +46,11 @@ def test_sig_collect_1_zipfile(runtmp, manifest_db_format): assert "120d311cc785cc9d0df9dc0646b2b857" in md5_list -def test_sig_collect_1_zipfile_csv_gz(runtmp): +def test_sig_collect_1_zipfile_csv_gz(runtmp, abspath_relpath_v4): # collect a manifest from a .zip file, save to csv.gz protzip = utils.get_test_data("prot/protein.zip") - runtmp.sourmash("sig", "collect", protzip, "-o", "mf.csv.gz", "-F", "csv") + runtmp.sourmash("sig", "collect", protzip, "-o", "mf.csv.gz", "-F", "csv", abspath_relpath_v4) manifest_fn = runtmp.output("mf.csv.gz") @@ -67,11 +67,11 @@ def test_sig_collect_1_zipfile_csv_gz(runtmp): assert "120d311cc785cc9d0df9dc0646b2b857" in md5_list -def test_sig_collect_1_zipfile_csv_gz_roundtrip(runtmp): +def test_sig_collect_1_zipfile_csv_gz_roundtrip(runtmp, abspath_relpath_v4): # collect a manifest from a .zip file, save to csv.gz; then load again protzip = utils.get_test_data("prot/protein.zip") - runtmp.sourmash("sig", "collect", protzip, "-o", "mf.csv.gz", "-F", "csv") + runtmp.sourmash("sig", "collect", protzip, "-o", "mf.csv.gz", "-F", "csv", abspath_relpath_v4) manifest_fn = runtmp.output("mf.csv.gz") @@ -125,7 +125,7 @@ def test_sig_collect_2_exists_fail(runtmp, manifest_db_format): ) -def test_sig_collect_2_exists_merge(runtmp, manifest_db_format): +def test_sig_collect_2_exists_merge(runtmp, manifest_db_format, abspath_relpath_v4): # collect a manifest from two .zip files protzip = utils.get_test_data("prot/protein.zip") allzip = utils.get_test_data("prot/all.zip") @@ -133,7 +133,7 @@ def test_sig_collect_2_exists_merge(runtmp, manifest_db_format): ext = "sqlmf" if manifest_db_format == "sql" else "csv" runtmp.sourmash( - "sig", "collect", protzip, "-o", f"mf.{ext}", "-F", manifest_db_format + "sig", "collect", protzip, "-o", f"mf.{ext}", "-F", manifest_db_format, abspath_relpath_v4, ) manifest_fn = runtmp.output(f"mf.{ext}") @@ -205,7 +205,7 @@ def test_sig_collect_2_exists_csv_merge_sql(runtmp): assert "ERROR loading" in runtmp.last_result.err -def test_sig_collect_2_no_exists_merge(runtmp, manifest_db_format): +def test_sig_collect_2_no_exists_merge(runtmp, manifest_db_format, abspath_relpath_v4): # test 'merge' when args.output doesn't already exist => warning utils.get_test_data("prot/protein.zip") allzip = utils.get_test_data("prot/all.zip") @@ -215,7 +215,7 @@ def test_sig_collect_2_no_exists_merge(runtmp, manifest_db_format): # run with --merge but no previous: runtmp.sourmash( - "sig", "collect", allzip, "-o", manifest_fn, "-F", manifest_db_format, "--merge" + "sig", "collect", allzip, "-o", manifest_fn, "-F", manifest_db_format, "--merge", abspath_relpath_v4, ) manifest = BaseCollectionManifest.load_from_filename(manifest_fn) @@ -424,7 +424,7 @@ def test_sig_collect_5_no_manifest_sbt_fail(runtmp, manifest_db_format): ) -def test_sig_collect_5_no_manifest_sbt_succeed(runtmp, manifest_db_format): +def test_sig_collect_5_no_manifest_sbt_succeed(runtmp, manifest_db_format, abspath_relpath_v4): # generate a manifest from files that don't have one when --no-require sbt_zip = utils.get_test_data("v6.sbt.zip") @@ -439,6 +439,7 @@ def test_sig_collect_5_no_manifest_sbt_succeed(runtmp, manifest_db_format): f"mf.{ext}", "-F", manifest_db_format, + abspath_relpath_v4 ) manifest_fn = runtmp.output(f"mf.{ext}") From da3f16586503f990f4124e67d2fbff2639978fea Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Sun, 3 Mar 2024 14:41:49 +0000 Subject: [PATCH 13/30] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- tests/conftest.py | 1 + tests/test_cmd_signature_collect.py | 52 ++++++++++++++++++++++++----- 2 files changed, 45 insertions(+), 8 deletions(-) diff --git a/tests/conftest.py b/tests/conftest.py index a4061e4102..7ef08e9b71 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -95,6 +95,7 @@ def sig_save_extension_abund(request): def abspath_or_relpath(request): return request.param + # this will fail if subdirs used; see #3008. but ths ensures v4 behavior of # sig collect/sig check works, where manifest paths interpreted relative # to cwd. diff --git a/tests/test_cmd_signature_collect.py b/tests/test_cmd_signature_collect.py index 6e81a96fd1..3d584a10b8 100644 --- a/tests/test_cmd_signature_collect.py +++ b/tests/test_cmd_signature_collect.py @@ -19,7 +19,15 @@ def test_sig_collect_0_nothing(runtmp, manifest_db_format, abspath_relpath_v4): if manifest_db_format != "sql": return - runtmp.sourmash("sig", "collect", "-o", f"mf.{ext}", "-F", manifest_db_format, abspath_relpath_v4) + runtmp.sourmash( + "sig", + "collect", + "-o", + f"mf.{ext}", + "-F", + manifest_db_format, + abspath_relpath_v4, + ) manifest_fn = runtmp.output(f"mf.{ext}") manifest = BaseCollectionManifest.load_from_filename(manifest_fn) @@ -34,7 +42,14 @@ def test_sig_collect_1_zipfile(runtmp, manifest_db_format, abspath_relpath_v4): ext = "sqlmf" if manifest_db_format == "sql" else "csv" runtmp.sourmash( - "sig", "collect", protzip, "-o", f"mf.{ext}", "-F", manifest_db_format, abspath_relpath_v4 + "sig", + "collect", + protzip, + "-o", + f"mf.{ext}", + "-F", + manifest_db_format, + abspath_relpath_v4, ) manifest_fn = runtmp.output(f"mf.{ext}") @@ -50,7 +65,9 @@ def test_sig_collect_1_zipfile_csv_gz(runtmp, abspath_relpath_v4): # collect a manifest from a .zip file, save to csv.gz protzip = utils.get_test_data("prot/protein.zip") - runtmp.sourmash("sig", "collect", protzip, "-o", "mf.csv.gz", "-F", "csv", abspath_relpath_v4) + runtmp.sourmash( + "sig", "collect", protzip, "-o", "mf.csv.gz", "-F", "csv", abspath_relpath_v4 + ) manifest_fn = runtmp.output("mf.csv.gz") @@ -71,7 +88,9 @@ def test_sig_collect_1_zipfile_csv_gz_roundtrip(runtmp, abspath_relpath_v4): # collect a manifest from a .zip file, save to csv.gz; then load again protzip = utils.get_test_data("prot/protein.zip") - runtmp.sourmash("sig", "collect", protzip, "-o", "mf.csv.gz", "-F", "csv", abspath_relpath_v4) + runtmp.sourmash( + "sig", "collect", protzip, "-o", "mf.csv.gz", "-F", "csv", abspath_relpath_v4 + ) manifest_fn = runtmp.output("mf.csv.gz") @@ -133,7 +152,14 @@ def test_sig_collect_2_exists_merge(runtmp, manifest_db_format, abspath_relpath_ ext = "sqlmf" if manifest_db_format == "sql" else "csv" runtmp.sourmash( - "sig", "collect", protzip, "-o", f"mf.{ext}", "-F", manifest_db_format, abspath_relpath_v4, + "sig", + "collect", + protzip, + "-o", + f"mf.{ext}", + "-F", + manifest_db_format, + abspath_relpath_v4, ) manifest_fn = runtmp.output(f"mf.{ext}") @@ -215,7 +241,15 @@ def test_sig_collect_2_no_exists_merge(runtmp, manifest_db_format, abspath_relpa # run with --merge but no previous: runtmp.sourmash( - "sig", "collect", allzip, "-o", manifest_fn, "-F", manifest_db_format, "--merge", abspath_relpath_v4, + "sig", + "collect", + allzip, + "-o", + manifest_fn, + "-F", + manifest_db_format, + "--merge", + abspath_relpath_v4, ) manifest = BaseCollectionManifest.load_from_filename(manifest_fn) @@ -424,7 +458,9 @@ def test_sig_collect_5_no_manifest_sbt_fail(runtmp, manifest_db_format): ) -def test_sig_collect_5_no_manifest_sbt_succeed(runtmp, manifest_db_format, abspath_relpath_v4): +def test_sig_collect_5_no_manifest_sbt_succeed( + runtmp, manifest_db_format, abspath_relpath_v4 +): # generate a manifest from files that don't have one when --no-require sbt_zip = utils.get_test_data("v6.sbt.zip") @@ -439,7 +475,7 @@ def test_sig_collect_5_no_manifest_sbt_succeed(runtmp, manifest_db_format, abspa f"mf.{ext}", "-F", manifest_db_format, - abspath_relpath_v4 + abspath_relpath_v4, ) manifest_fn = runtmp.output(f"mf.{ext}") From b185be914d4a51b77f0657e8c945450d2021b856 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 3 Mar 2024 09:51:37 -0500 Subject: [PATCH 14/30] add abspath/relpath tests for sig collect --- src/sourmash/sig/__main__.py | 2 +- tests/test_cmd_signature_collect.py | 78 +++++++++++++++++++++++++++++ 2 files changed, 79 insertions(+), 1 deletion(-) diff --git a/src/sourmash/sig/__main__.py b/src/sourmash/sig/__main__.py index da7bb86278..1d4a861912 100644 --- a/src/sourmash/sig/__main__.py +++ b/src/sourmash/sig/__main__.py @@ -1615,7 +1615,7 @@ def collect(args): for n_files, loc in enumerate(args.locations): notify(f"Loading signature information from {loc}.") - if n_files % 100 == 0: + if n_files and n_files % 100 == 0: notify(f"... loaded {len(collected_mf)} sigs from {n_files} files") idx = sourmash.load_file_as_index(loc) if idx.manifest is None and require_manifest: diff --git a/tests/test_cmd_signature_collect.py b/tests/test_cmd_signature_collect.py index 6e81a96fd1..d063ef9759 100644 --- a/tests/test_cmd_signature_collect.py +++ b/tests/test_cmd_signature_collect.py @@ -449,3 +449,81 @@ def test_sig_collect_5_no_manifest_sbt_succeed(runtmp, manifest_db_format, abspa locations = set([row["internal_location"] for row in manifest.rows]) assert len(locations) == 1, locations assert sbt_zip in locations + + +def test_sig_collect_6_path_cwd_cwd(runtmp, manifest_db_format, abspath_relpath_v4): + # check: manifest and sigs in cwd + protzip = utils.get_test_data("prot/protein.zip") + + ext = "sqlmf" if manifest_db_format == "sql" else "csv" + + protzip_path = "protein.zip" + shutil.copyfile(protzip, runtmp.output(protzip_path)) + + mf_path = f"mf.{ext}" + + runtmp.sourmash( + "sig", "collect", protzip_path, "-o", mf_path, "-F", manifest_db_format, abspath_relpath_v4 + ) + + runtmp.sourmash("sig", "cat", mf_path) + + +def test_sig_collect_6_path_cwd_subdir(runtmp, manifest_db_format, abspath_relpath_v4): + # check: manifest in cwd, sigs in subdir + protzip = utils.get_test_data("prot/protein.zip") + + ext = "sqlmf" if manifest_db_format == "sql" else "csv" + + os.mkdir(runtmp.output("sigs_dir")) + protzip_path = "sigs_dir/protein.zip" + shutil.copyfile(protzip, runtmp.output(protzip_path)) + + mf_path = f"mf.{ext}" + + runtmp.sourmash( + "sig", "collect", protzip_path, "-o", mf_path, "-F", manifest_db_format, abspath_relpath_v4 + ) + + runtmp.sourmash("sig", "cat", mf_path) + + +def test_sig_collect_6_path_subdir_cwd(runtmp, manifest_db_format, abspath_or_relpath): + # check: manifest in cwd, sigs in subdir. note, fails with default v4 + # behavior. see #3008. + protzip = utils.get_test_data("prot/protein.zip") + + ext = "sqlmf" if manifest_db_format == "sql" else "csv" + + protzip_path = "protein.zip" + shutil.copyfile(protzip, runtmp.output(protzip_path)) + + os.mkdir(runtmp.output("mf_dir")) + mf_path = f"mf_dir/mf.{ext}" + + runtmp.sourmash( + "sig", "collect", protzip_path, "-o", mf_path, "-F", manifest_db_format, abspath_or_relpath, + ) + + runtmp.sourmash("sig", "cat", mf_path) + + +def test_sig_collect_6_path_subdir_subdir(runtmp, manifest_db_format, abspath_or_relpath): + # check: manifest and sigs in subdir. note, fails with default v4 + # behavior. see #3008. + protzip = utils.get_test_data("prot/protein.zip") + + ext = "sqlmf" if manifest_db_format == "sql" else "csv" + + os.mkdir(runtmp.output("sigs_dir")) + protzip_path = "sigs_dir/protein.zip" + shutil.copyfile(protzip, runtmp.output(protzip_path)) + + os.mkdir(runtmp.output("mf_dir")) + mf_path = f"mf_dir/mf.{ext}" + + runtmp.sourmash( + "sig", "collect", protzip_path, "-o", mf_path, "-F", manifest_db_format, abspath_or_relpath, + ) + + runtmp.sourmash("sig", "cat", mf_path) From 28e0a5b3516f718b9038fde80ed53f9f1c2dc9fc Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Sun, 3 Mar 2024 14:52:39 +0000 Subject: [PATCH 15/30] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- tests/test_cmd_signature_collect.py | 40 +++++++++++++++++++++++++---- 1 file changed, 35 insertions(+), 5 deletions(-) diff --git a/tests/test_cmd_signature_collect.py b/tests/test_cmd_signature_collect.py index 41301ecb5a..b30ac1b8a0 100644 --- a/tests/test_cmd_signature_collect.py +++ b/tests/test_cmd_signature_collect.py @@ -499,7 +499,14 @@ def test_sig_collect_6_path_cwd_cwd(runtmp, manifest_db_format, abspath_relpath_ mf_path = f"mf.{ext}" runtmp.sourmash( - "sig", "collect", protzip_path, "-o", mf_path, "-F", manifest_db_format, abspath_relpath_v4 + "sig", + "collect", + protzip_path, + "-o", + mf_path, + "-F", + manifest_db_format, + abspath_relpath_v4, ) runtmp.sourmash("sig", "cat", mf_path) @@ -518,7 +525,14 @@ def test_sig_collect_6_path_cwd_subdir(runtmp, manifest_db_format, abspath_relpa mf_path = f"mf.{ext}" runtmp.sourmash( - "sig", "collect", protzip_path, "-o", mf_path, "-F", manifest_db_format, abspath_relpath_v4 + "sig", + "collect", + protzip_path, + "-o", + mf_path, + "-F", + manifest_db_format, + abspath_relpath_v4, ) runtmp.sourmash("sig", "cat", mf_path) @@ -538,13 +552,22 @@ def test_sig_collect_6_path_subdir_cwd(runtmp, manifest_db_format, abspath_or_re mf_path = f"mf_dir/mf.{ext}" runtmp.sourmash( - "sig", "collect", protzip_path, "-o", mf_path, "-F", manifest_db_format, abspath_or_relpath, + "sig", + "collect", + protzip_path, + "-o", + mf_path, + "-F", + manifest_db_format, + abspath_or_relpath, ) runtmp.sourmash("sig", "cat", mf_path) -def test_sig_collect_6_path_subdir_subdir(runtmp, manifest_db_format, abspath_or_relpath): +def test_sig_collect_6_path_subdir_subdir( + runtmp, manifest_db_format, abspath_or_relpath +): # check: manifest and sigs in subdir. note, fails with default v4 # behavior. see #3008. protzip = utils.get_test_data("prot/protein.zip") @@ -559,7 +582,14 @@ def test_sig_collect_6_path_subdir_subdir(runtmp, manifest_db_format, abspath_or mf_path = f"mf_dir/mf.{ext}" runtmp.sourmash( - "sig", "collect", protzip_path, "-o", mf_path, "-F", manifest_db_format, abspath_or_relpath, + "sig", + "collect", + protzip_path, + "-o", + mf_path, + "-F", + manifest_db_format, + abspath_or_relpath, ) runtmp.sourmash("sig", "cat", mf_path) From e9b9a34386833e9455d21cb4b5eebc734fe4f25b Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 3 Mar 2024 10:26:35 -0500 Subject: [PATCH 16/30] fix/update comments --- src/sourmash/sig/__main__.py | 9 +++++---- tests/conftest.py | 4 ++-- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/src/sourmash/sig/__main__.py b/src/sourmash/sig/__main__.py index 1d4a861912..69e89cc56f 100644 --- a/src/sourmash/sig/__main__.py +++ b/src/sourmash/sig/__main__.py @@ -1460,10 +1460,11 @@ def check(args): # convert to abspath new_iloc = os.path.abspath(filename) elif args.relpath: - # interpret paths relative to manifest directory + # interpret paths relative to manifest directory. new_iloc = os.path.join(relpath, filename) else: - # default: paths are relative to cwd + # default: paths are relative to cwd. This breaks when sketches + # are in subdirectories; will be deprecated for v5. new_iloc = filename idx = sourmash_args.load_file_as_index(filename, yield_all_files=args.force) @@ -1483,7 +1484,6 @@ def check(args): # rewrite locations so that each signature can be found by filename # of its container; this follows `sig collect` logic. - # CTB: note that this is relative to cwd, not manifest location. for row in sub_manifest.rows: row["internal_location"] = new_iloc total_manifest_rows.add_row(row) @@ -1636,7 +1636,8 @@ def collect(args): # interpret paths relative to manifest directory new_iloc = os.path.join(relpath, loc) else: - # default: paths are relative to cwd + # default: paths are relative to cwd. This breaks when sketches + # are in subdirectories; will be deprecated for v5. new_iloc = loc for row in mf.rows: diff --git a/tests/conftest.py b/tests/conftest.py index 7ef08e9b71..4274382e74 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -96,8 +96,8 @@ def abspath_or_relpath(request): return request.param -# this will fail if subdirs used; see #3008. but ths ensures v4 behavior of -# sig collect/sig check works, where manifest paths interpreted relative +# this will fail if subdirs used; see #3008. but this ensures v4 behavior of +# sig collect/sig check works, where manifest paths are interpreted relative # to cwd. @pytest.fixture(params=["--no-abspath", "--abspath", "--relpath"]) def abspath_relpath_v4(request): From f88c662c135031f7c5974ba9a8e5a0b1ba5e2ae1 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 3 Mar 2024 14:54:10 -0500 Subject: [PATCH 17/30] update docs for sig check and sig collect --- doc/command-line.md | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index d511c35796..26f865d714 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -551,7 +551,7 @@ sourmash multigather --query --db ``` Note that multigather is single threaded, so it offers no substantial -efficiency gains over just running gather multiple times! Nontheless, it +efficiency gains over just running gather multiple times! Nonetheless, it is useful for situations where you have many sketches organized in a combined file, e.g. sketches built with `sourmash sketch ... --singleton`). @@ -1956,13 +1956,19 @@ picklist CSV. With `--save-manifest-matching`, `sig check` will save all of the _matched_ elements to a manifest file, which can then be used as a sourmash database. -When saving manifests with matched elements, sourmash will by default -not rewrite paths to the containers for the matched elements. This will -create buggy manifests ... @CTB. - `sourmash sig check` is particularly useful when working with large collections of signatures and identifiers. +With `-m/--save-manifest-matching`, `sig check` creates a standalone +manifest. In these manifests, sourmash v4 will by default write paths +to the matched elements that are relative to the current working +directory. In some cases - when the matched elements are in +subdirectories - this will create manifests that do not work properly +with sourmash. The `--relpath` argument will rewrite the paths to be +relative to the manifest, while the `--abspath` argument will rewrite +paths to be absolute. The `--relpath` behavior will be the default in +sourmash v5. + ### `sourmash signature collect` - collect manifests across databases Collect manifests from across (many) files and merge into a single @@ -1981,6 +1987,15 @@ This manifest file can be loaded directly from the command line by sourmash. particularly useful when working with large collections of signatures and identifiers, and has command line options for merging and updating manifests. +As with `sig check`, the standalone manifests created by `sig collect` +in sourmash v4 will by default write paths to the matched elements +relative to the current working directory. When the matched elements +are in subdirectories this will create manifests that do not work +properly with sourmash. The `--relpath` argument will rewrite the +paths to be relative to the manifest, while the `--abspath` argument +will rewrite paths to be absolute. The `--relpath` behavior will be +the default in sourmash v5. + ## Advanced command-line usage ### Loading signatures and databases From 4e108020d5f7507cf82facda02c5250eb17cad3b Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 3 Mar 2024 14:56:33 -0500 Subject: [PATCH 18/30] add in some documentation about relpath in sourmash internals --- doc/sourmash-internals.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index a615844290..4001f70336 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -368,7 +368,10 @@ Thus, while standalone manifests can point at any kind of container, including JSON files or LCA databases, they are most efficient when `internal_location` points at a file with either a single sketch in it, or a manifest that supports direct loading of sketches. Therefore, -we suggest using standalone manifest indices. +we suggest using standalone manifest indices. Note that sourmash +interprets paths to locations in standalone manifests relative to the +manifest filename; see the `--relpath` behavior in `sig check` and +`sig collect` for details. Note that searching a standalone manifest is currently done through a linear iteration, and does not use any features of indexed containers From 6dd3e3a45843781061b01cdaaa5917e64af764b7 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 3 Mar 2024 15:26:03 -0500 Subject: [PATCH 19/30] path rewriting tests for sig check --- tests/test_cmd_signature.py | 154 ++++++++++++++++++++++++++++++++++++ 1 file changed, 154 insertions(+) diff --git a/tests/test_cmd_signature.py b/tests/test_cmd_signature.py index 921050b16e..010459454d 100644 --- a/tests/test_cmd_signature.py +++ b/tests/test_cmd_signature.py @@ -5486,3 +5486,157 @@ def test_sig_check_4_manifest_subdir_subdir(runtmp, abspath_or_relpath): # check that it all works runtmp.sourmash("sig", "cat", "mf_dir/mf.csv") + + +def test_sig_check_5_relpath(runtmp): + # check path rewriting when sketches are in a subdir. + # this will be the default behavior in v5 => remove --relpath. + sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) + picklist = utils.get_test_data("gather/salmonella-picklist.csv") + + os.mkdir(runtmp.output('mf_dir')) + + os.mkdir(runtmp.output('sigs_dir')) + new_names = [] + for f in sigfiles: + basename = os.path.basename(f) + filename = os.path.join('sigs_dir', basename) + + shutil.copyfile(f, runtmp.output(filename)) + new_names.append(filename) + + runtmp.sourmash( + "sig", + "check", + *new_names, + "--picklist", + f"{picklist}::manifest", + "-m", + "mf_dir/mf.csv", + "--relpath" + ) + + out_mf = runtmp.output("mf_dir/mf.csv") + assert os.path.exists(out_mf) + + # all should match. + with open(out_mf, newline="") as fp: + mf = CollectionManifest.load_from_csv(fp) + assert len(mf) == 24 + + locations = [ row['internal_location'] for row in mf.rows ] + expected_names = [ '../' + f for f in new_names ] + assert set(locations).issubset(expected_names), (locations, expected_names) + + +def test_sig_check_5_relpath_subdir(runtmp): + # check path rewriting when both sigs and mf are in different subdirs. + # this will be the default behavior in v5 => can remove --relpath then. + sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) + picklist = utils.get_test_data("gather/salmonella-picklist.csv") + + os.mkdir(runtmp.output('sigs_dir')) + new_names = [] + for f in sigfiles: + basename = os.path.basename(f) + filename = os.path.join('sigs_dir', basename) + + shutil.copyfile(f, runtmp.output(filename)) + new_names.append(filename) + + runtmp.sourmash( + "sig", + "check", + *new_names, + "--picklist", + f"{picklist}::manifest", + "-m", + "mf.csv", + "--relpath" + ) + + out_mf = runtmp.output("mf.csv") + assert os.path.exists(out_mf) + + # all should match. + with open(out_mf, newline="") as fp: + mf = CollectionManifest.load_from_csv(fp) + assert len(mf) == 24 + + locations = [ row['internal_location'] for row in mf.rows ] + print('XXX', locations) + print('YYY', new_names) + expected_names = [ './' + f for f in new_names ] + assert set(locations).issubset(expected_names), (locations, expected_names) + + +def test_sig_check_5_abspath(runtmp): + # check path rewriting with `--abspath` => absolute paths. + sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) + picklist = utils.get_test_data("gather/salmonella-picklist.csv") + + for f in sigfiles: + shutil.copyfile(f, runtmp.output(os.path.basename(f))) + + # strip off abspath + sigfiles = [ os.path.basename(f) for f in sigfiles ] + + runtmp.sourmash( + "sig", + "check", + *sigfiles, + "--picklist", + f"{picklist}::manifest", + "-m", + "mf.csv", + "--abspath" + ) + + out_mf = runtmp.output("mf.csv") + assert os.path.exists(out_mf) + + # all should match. + with open(out_mf, newline="") as fp: + mf = CollectionManifest.load_from_csv(fp) + assert len(mf) == 24 + + locations = [ row['internal_location'] for row in mf.rows ] + for k in locations: + assert k.startswith('/') # absolute + assert os.path.basename(k) in sigfiles # converts back to basic + + +def test_sig_check_5_no_abspath(runtmp): + # check path rewriting for default (--no-relpath --no-abspath) + # this behavior will change in v5; specify `--no-abspath` then? + sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) + picklist = utils.get_test_data("gather/salmonella-picklist.csv") + + for f in sigfiles: + shutil.copyfile(f, runtmp.output(os.path.basename(f))) + + # strip off abspath + sigfiles = [ os.path.basename(f) for f in sigfiles ] + + runtmp.sourmash( + "sig", + "check", + *sigfiles, + "--picklist", + f"{picklist}::manifest", + "-m", + "mf.csv", + # "--no-abspath" # => default behavior + ) + + out_mf = runtmp.output("mf.csv") + assert os.path.exists(out_mf) + + # all should match. + with open(out_mf, newline="") as fp: + mf = CollectionManifest.load_from_csv(fp) + assert len(mf) == 24 + + locations = [ row['internal_location'] for row in mf.rows ] + # no rewriting + assert set(locations).issubset(sigfiles) From e3bb509729c6aa867d1c0d948ccd8c6e453b1854 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 3 Mar 2024 15:37:22 -0500 Subject: [PATCH 20/30] update docs --- doc/command-line.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index 26f865d714..71173792cf 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -1962,8 +1962,8 @@ collections of signatures and identifiers. With `-m/--save-manifest-matching`, `sig check` creates a standalone manifest. In these manifests, sourmash v4 will by default write paths to the matched elements that are relative to the current working -directory. In some cases - when the matched elements are in -subdirectories - this will create manifests that do not work properly +directory. In some cases - when the output manifest is in different +directory - this will create manifests that do not work properly with sourmash. The `--relpath` argument will rewrite the paths to be relative to the manifest, while the `--abspath` argument will rewrite paths to be absolute. The `--relpath` behavior will be the default in @@ -1989,8 +1989,8 @@ identifiers, and has command line options for merging and updating manifests. As with `sig check`, the standalone manifests created by `sig collect` in sourmash v4 will by default write paths to the matched elements -relative to the current working directory. When the matched elements -are in subdirectories this will create manifests that do not work +relative to the current working directory. When the output manifest +is in a different directory, this will create manifests that do not work properly with sourmash. The `--relpath` argument will rewrite the paths to be relative to the manifest, while the `--abspath` argument will rewrite paths to be absolute. The `--relpath` behavior will be From 703585cbca8d9bbd6ade6c15cd31a6c603764c34 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 3 Mar 2024 15:37:29 -0500 Subject: [PATCH 21/30] add --relpath tests for sig collect --- tests/test_cmd_signature_collect.py | 88 +++++++++++++++++++++++++++++ 1 file changed, 88 insertions(+) diff --git a/tests/test_cmd_signature_collect.py b/tests/test_cmd_signature_collect.py index b30ac1b8a0..fb44a64e19 100644 --- a/tests/test_cmd_signature_collect.py +++ b/tests/test_cmd_signature_collect.py @@ -445,6 +445,94 @@ def test_sig_collect_4_multiple_no_abspath(runtmp, manifest_db_format): assert "47.fa.sig" in locations assert "63.fa.sig" in locations + runtmp.sourmash("sig", "cat", f"mf.{ext}") + + +def test_sig_collect_4_multiple_subdir_subdir_no_abspath(runtmp, manifest_db_format): + # collect a manifest from sig files, no abspath; use a subdir for sketches + # this should work with default behavior. + sig43 = utils.get_test_data("47.fa.sig") + sig63 = utils.get_test_data("63.fa.sig") + + # copy files to tmp, where they will not have full paths + os.mkdir(runtmp.output('sigs_dir')) + shutil.copyfile(sig43, runtmp.output("sigs_dir/47.fa.sig")) + shutil.copyfile(sig63, runtmp.output("sigs_dir/63.fa.sig")) + + # put manifest in subdir too. + os.mkdir(runtmp.output('mf_dir')) + + ext = "sqlmf" if manifest_db_format == "sql" else "csv" + + runtmp.sourmash( + "sig", + "collect", + "sigs_dir/47.fa.sig", + "sigs_dir/63.fa.sig", + "-o", + f"mf_dir/mf.{ext}", + "-F", + manifest_db_format, + "--relpath" + ) + + manifest_fn = runtmp.output(f"mf_dir/mf.{ext}") + manifest = BaseCollectionManifest.load_from_filename(manifest_fn) + + assert len(manifest) == 2 + md5_list = [row["md5"] for row in manifest.rows] + assert "09a08691ce52952152f0e866a59f6261" in md5_list + assert "38729c6374925585db28916b82a6f513" in md5_list + + locations = set([row["internal_location"] for row in manifest.rows]) + print(locations) + assert len(locations) == 2, locations + assert "../sigs_dir/47.fa.sig" in locations + assert "../sigs_dir/63.fa.sig" in locations + + runtmp.sourmash("sig", "cat", f"mf_dir/mf.{ext}") + + +def test_sig_collect_4_multiple_cwd_subdir_no_abspath(runtmp, manifest_db_format): + # collect a manifest from sig files, no abspath; use a subdir for sketches + # this should work with default behavior. + sig43 = utils.get_test_data("47.fa.sig") + sig63 = utils.get_test_data("63.fa.sig") + + # copy files to tmp, where they will not have full paths + os.mkdir(runtmp.output('sigs_dir')) + shutil.copyfile(sig43, runtmp.output("sigs_dir/47.fa.sig")) + shutil.copyfile(sig63, runtmp.output("sigs_dir/63.fa.sig")) + + ext = "sqlmf" if manifest_db_format == "sql" else "csv" + + runtmp.sourmash( + "sig", + "collect", + "sigs_dir/47.fa.sig", + "sigs_dir/63.fa.sig", + "-o", + f"mf.{ext}", + "-F", + manifest_db_format, + ) + + manifest_fn = runtmp.output(f"mf.{ext}") + manifest = BaseCollectionManifest.load_from_filename(manifest_fn) + + assert len(manifest) == 2 + md5_list = [row["md5"] for row in manifest.rows] + assert "09a08691ce52952152f0e866a59f6261" in md5_list + assert "38729c6374925585db28916b82a6f513" in md5_list + + locations = set([row["internal_location"] for row in manifest.rows]) + print(locations) + assert len(locations) == 2, locations + assert "sigs_dir/47.fa.sig" in locations + assert "sigs_dir/63.fa.sig" in locations + + runtmp.sourmash("sig", "cat", f"mf.{ext}") + def test_sig_collect_5_no_manifest_sbt_fail(runtmp, manifest_db_format): # collect a manifest from files that don't have one From 346ed82bcda02cdb89af9950be6abb7effd36d81 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Sun, 3 Mar 2024 20:37:45 +0000 Subject: [PATCH 22/30] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- tests/test_cmd_signature.py | 40 ++++++++++++++--------------- tests/test_cmd_signature_collect.py | 8 +++--- 2 files changed, 24 insertions(+), 24 deletions(-) diff --git a/tests/test_cmd_signature.py b/tests/test_cmd_signature.py index 010459454d..1098c07f93 100644 --- a/tests/test_cmd_signature.py +++ b/tests/test_cmd_signature.py @@ -5494,13 +5494,13 @@ def test_sig_check_5_relpath(runtmp): sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") - os.mkdir(runtmp.output('mf_dir')) + os.mkdir(runtmp.output("mf_dir")) - os.mkdir(runtmp.output('sigs_dir')) + os.mkdir(runtmp.output("sigs_dir")) new_names = [] for f in sigfiles: basename = os.path.basename(f) - filename = os.path.join('sigs_dir', basename) + filename = os.path.join("sigs_dir", basename) shutil.copyfile(f, runtmp.output(filename)) new_names.append(filename) @@ -5513,7 +5513,7 @@ def test_sig_check_5_relpath(runtmp): f"{picklist}::manifest", "-m", "mf_dir/mf.csv", - "--relpath" + "--relpath", ) out_mf = runtmp.output("mf_dir/mf.csv") @@ -5524,8 +5524,8 @@ def test_sig_check_5_relpath(runtmp): mf = CollectionManifest.load_from_csv(fp) assert len(mf) == 24 - locations = [ row['internal_location'] for row in mf.rows ] - expected_names = [ '../' + f for f in new_names ] + locations = [row["internal_location"] for row in mf.rows] + expected_names = ["../" + f for f in new_names] assert set(locations).issubset(expected_names), (locations, expected_names) @@ -5535,11 +5535,11 @@ def test_sig_check_5_relpath_subdir(runtmp): sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") - os.mkdir(runtmp.output('sigs_dir')) + os.mkdir(runtmp.output("sigs_dir")) new_names = [] for f in sigfiles: basename = os.path.basename(f) - filename = os.path.join('sigs_dir', basename) + filename = os.path.join("sigs_dir", basename) shutil.copyfile(f, runtmp.output(filename)) new_names.append(filename) @@ -5552,7 +5552,7 @@ def test_sig_check_5_relpath_subdir(runtmp): f"{picklist}::manifest", "-m", "mf.csv", - "--relpath" + "--relpath", ) out_mf = runtmp.output("mf.csv") @@ -5563,10 +5563,10 @@ def test_sig_check_5_relpath_subdir(runtmp): mf = CollectionManifest.load_from_csv(fp) assert len(mf) == 24 - locations = [ row['internal_location'] for row in mf.rows ] - print('XXX', locations) - print('YYY', new_names) - expected_names = [ './' + f for f in new_names ] + locations = [row["internal_location"] for row in mf.rows] + print("XXX", locations) + print("YYY", new_names) + expected_names = ["./" + f for f in new_names] assert set(locations).issubset(expected_names), (locations, expected_names) @@ -5579,7 +5579,7 @@ def test_sig_check_5_abspath(runtmp): shutil.copyfile(f, runtmp.output(os.path.basename(f))) # strip off abspath - sigfiles = [ os.path.basename(f) for f in sigfiles ] + sigfiles = [os.path.basename(f) for f in sigfiles] runtmp.sourmash( "sig", @@ -5589,7 +5589,7 @@ def test_sig_check_5_abspath(runtmp): f"{picklist}::manifest", "-m", "mf.csv", - "--abspath" + "--abspath", ) out_mf = runtmp.output("mf.csv") @@ -5600,10 +5600,10 @@ def test_sig_check_5_abspath(runtmp): mf = CollectionManifest.load_from_csv(fp) assert len(mf) == 24 - locations = [ row['internal_location'] for row in mf.rows ] + locations = [row["internal_location"] for row in mf.rows] for k in locations: - assert k.startswith('/') # absolute - assert os.path.basename(k) in sigfiles # converts back to basic + assert k.startswith("/") # absolute + assert os.path.basename(k) in sigfiles # converts back to basic def test_sig_check_5_no_abspath(runtmp): @@ -5616,7 +5616,7 @@ def test_sig_check_5_no_abspath(runtmp): shutil.copyfile(f, runtmp.output(os.path.basename(f))) # strip off abspath - sigfiles = [ os.path.basename(f) for f in sigfiles ] + sigfiles = [os.path.basename(f) for f in sigfiles] runtmp.sourmash( "sig", @@ -5637,6 +5637,6 @@ def test_sig_check_5_no_abspath(runtmp): mf = CollectionManifest.load_from_csv(fp) assert len(mf) == 24 - locations = [ row['internal_location'] for row in mf.rows ] + locations = [row["internal_location"] for row in mf.rows] # no rewriting assert set(locations).issubset(sigfiles) diff --git a/tests/test_cmd_signature_collect.py b/tests/test_cmd_signature_collect.py index fb44a64e19..54e0602638 100644 --- a/tests/test_cmd_signature_collect.py +++ b/tests/test_cmd_signature_collect.py @@ -455,12 +455,12 @@ def test_sig_collect_4_multiple_subdir_subdir_no_abspath(runtmp, manifest_db_for sig63 = utils.get_test_data("63.fa.sig") # copy files to tmp, where they will not have full paths - os.mkdir(runtmp.output('sigs_dir')) + os.mkdir(runtmp.output("sigs_dir")) shutil.copyfile(sig43, runtmp.output("sigs_dir/47.fa.sig")) shutil.copyfile(sig63, runtmp.output("sigs_dir/63.fa.sig")) # put manifest in subdir too. - os.mkdir(runtmp.output('mf_dir')) + os.mkdir(runtmp.output("mf_dir")) ext = "sqlmf" if manifest_db_format == "sql" else "csv" @@ -473,7 +473,7 @@ def test_sig_collect_4_multiple_subdir_subdir_no_abspath(runtmp, manifest_db_for f"mf_dir/mf.{ext}", "-F", manifest_db_format, - "--relpath" + "--relpath", ) manifest_fn = runtmp.output(f"mf_dir/mf.{ext}") @@ -500,7 +500,7 @@ def test_sig_collect_4_multiple_cwd_subdir_no_abspath(runtmp, manifest_db_format sig63 = utils.get_test_data("63.fa.sig") # copy files to tmp, where they will not have full paths - os.mkdir(runtmp.output('sigs_dir')) + os.mkdir(runtmp.output("sigs_dir")) shutil.copyfile(sig43, runtmp.output("sigs_dir/47.fa.sig")) shutil.copyfile(sig63, runtmp.output("sigs_dir/63.fa.sig")) From 637b1fcda1280f22ace5b1f5aa47afd4e91ef1fe Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 3 Mar 2024 15:58:28 -0500 Subject: [PATCH 23/30] a few more tests --- doc/sourmash-internals.md | 3 ++- src/sourmash/sig/__main__.py | 5 ++++ tests/test_cmd_signature.py | 37 +++++++++++++++++++++++++++++ tests/test_cmd_signature_collect.py | 19 +++++++++++++++ 4 files changed, 63 insertions(+), 1 deletion(-) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 4001f70336..ebf66ae219 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -371,7 +371,8 @@ it, or a manifest that supports direct loading of sketches. Therefore, we suggest using standalone manifest indices. Note that sourmash interprets paths to locations in standalone manifests relative to the manifest filename; see the `--relpath` behavior in `sig check` and -`sig collect` for details. +`sig collect` to output manifests that deal with relative filenames +properly. Note that searching a standalone manifest is currently done through a linear iteration, and does not use any features of indexed containers diff --git a/src/sourmash/sig/__main__.py b/src/sourmash/sig/__main__.py index 69e89cc56f..8a8168607e 100644 --- a/src/sourmash/sig/__main__.py +++ b/src/sourmash/sig/__main__.py @@ -1562,6 +1562,11 @@ def collect(args): f"WARNING: --merge-previous specified, but output file '{args.output}' does not already exist?" ) + # abspath/relpath checks + if args.abspath and args.relpath: + error("** Cannot specify both --abspath and --relpath; pick one!") + sys.exit(-1) + # load previous manifest for --merge-previous. This gets tricky with # mismatched manifest types, which we forbid. try: diff --git a/tests/test_cmd_signature.py b/tests/test_cmd_signature.py index 1098c07f93..68ff5027bb 100644 --- a/tests/test_cmd_signature.py +++ b/tests/test_cmd_signature.py @@ -4833,6 +4833,43 @@ def test_sig_check_1(runtmp, abspath_relpath_v4): assert 31 in ksizes +def test_sig_check_1_fail_abspath_relpath(runtmp): + # basic check functionality + sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) + picklist = utils.get_test_data("gather/salmonella-picklist.csv") + + with pytest.raises(SourmashCommandFailed, + match="Cannot specify both --abspath and --relpath; pick one!"): + runtmp.sourmash( + "sig", + "check", + *sigfiles, + "--picklist", + f"{picklist}::manifest", + "-m", + "mf.csv", + "--abspath", "--relpath" + ) + + +def test_sig_check_1_warn_abspath_relpath(runtmp, abspath_or_relpath): + # warn that without -m, --abspath/--relpath are not helpful + sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) + picklist = utils.get_test_data("gather/salmonella-picklist.csv") + + runtmp.sourmash( + "sig", + "check", + *sigfiles, + "--picklist", + f"{picklist}::manifest", + abspath_or_relpath, + ) + + err = runtmp.last_result.err + assert " WARNING: --abspath and --relpath only have effects when saving a manifest" in err + + def test_sig_check_1_mf_csv_gz(runtmp, abspath_relpath_v4): # basic check functionality, with gzipped manifest output sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) diff --git a/tests/test_cmd_signature_collect.py b/tests/test_cmd_signature_collect.py index 54e0602638..6235918a96 100644 --- a/tests/test_cmd_signature_collect.py +++ b/tests/test_cmd_signature_collect.py @@ -35,6 +35,25 @@ def test_sig_collect_0_nothing(runtmp, manifest_db_format, abspath_relpath_v4): assert len(manifest) == 0 +def test_sig_collect_0_fail_abspath_relpath(runtmp, manifest_db_format): + # check that it complains if both --abspath and --relpath are specified + ext = "sqlmf" if manifest_db_format == "sql" else "csv" + if manifest_db_format != "sql": + return + + with pytest.raises(SourmashCommandFailed, + match="Cannot specify both --abspath and --relpath; pick one!"): + runtmp.sourmash( + "sig", + "collect", + "-o", + f"mf.{ext}", + "-F", + manifest_db_format, + "--abspath", "--relpath" + ) + + def test_sig_collect_1_zipfile(runtmp, manifest_db_format, abspath_relpath_v4): # collect a manifest from a .zip file protzip = utils.get_test_data("prot/protein.zip") From 8b1bc2ccff13cf8b29c3e5ccac9eaf7e96e4d7fa Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Sun, 3 Mar 2024 20:58:41 +0000 Subject: [PATCH 24/30] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- tests/test_cmd_signature.py | 14 ++++++++++---- tests/test_cmd_signature_collect.py | 9 ++++++--- 2 files changed, 16 insertions(+), 7 deletions(-) diff --git a/tests/test_cmd_signature.py b/tests/test_cmd_signature.py index 68ff5027bb..9f14b6df58 100644 --- a/tests/test_cmd_signature.py +++ b/tests/test_cmd_signature.py @@ -4838,8 +4838,10 @@ def test_sig_check_1_fail_abspath_relpath(runtmp): sigfiles = glob.glob(utils.get_test_data("gather/GCF*.sig")) picklist = utils.get_test_data("gather/salmonella-picklist.csv") - with pytest.raises(SourmashCommandFailed, - match="Cannot specify both --abspath and --relpath; pick one!"): + with pytest.raises( + SourmashCommandFailed, + match="Cannot specify both --abspath and --relpath; pick one!", + ): runtmp.sourmash( "sig", "check", @@ -4848,7 +4850,8 @@ def test_sig_check_1_fail_abspath_relpath(runtmp): f"{picklist}::manifest", "-m", "mf.csv", - "--abspath", "--relpath" + "--abspath", + "--relpath", ) @@ -4867,7 +4870,10 @@ def test_sig_check_1_warn_abspath_relpath(runtmp, abspath_or_relpath): ) err = runtmp.last_result.err - assert " WARNING: --abspath and --relpath only have effects when saving a manifest" in err + assert ( + " WARNING: --abspath and --relpath only have effects when saving a manifest" + in err + ) def test_sig_check_1_mf_csv_gz(runtmp, abspath_relpath_v4): diff --git a/tests/test_cmd_signature_collect.py b/tests/test_cmd_signature_collect.py index 6235918a96..917286e881 100644 --- a/tests/test_cmd_signature_collect.py +++ b/tests/test_cmd_signature_collect.py @@ -41,8 +41,10 @@ def test_sig_collect_0_fail_abspath_relpath(runtmp, manifest_db_format): if manifest_db_format != "sql": return - with pytest.raises(SourmashCommandFailed, - match="Cannot specify both --abspath and --relpath; pick one!"): + with pytest.raises( + SourmashCommandFailed, + match="Cannot specify both --abspath and --relpath; pick one!", + ): runtmp.sourmash( "sig", "collect", @@ -50,7 +52,8 @@ def test_sig_collect_0_fail_abspath_relpath(runtmp, manifest_db_format): f"mf.{ext}", "-F", manifest_db_format, - "--abspath", "--relpath" + "--abspath", + "--relpath", ) From c9bf80e664623edcdde000a9327690389f98d1cf Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 3 Mar 2024 20:38:25 -0500 Subject: [PATCH 25/30] more documentation foo --- doc/command-line.md | 24 ++++++++++++++++-------- doc/databases-advanced.md | 2 +- 2 files changed, 17 insertions(+), 9 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index 86ffea8018..36cb365469 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -1944,8 +1944,10 @@ CSV and SQLite manifest files. ### `sourmash signature check` - compare picklists and manifests -Compare picklists and manifests across databases, and optionally output matches -and missing items. +Compare picklists and manifests across databases, and optionally +output matches and missing items. In particular, `sig check` can be +used to create standalone manifests for a subset of a large collection, +using picklists. For example, ``` @@ -1964,7 +1966,7 @@ collections of signatures and identifiers. With `-m/--save-manifest-matching`, `sig check` creates a standalone manifest. In these manifests, sourmash v4 will by default write paths to the matched elements that are relative to the current working -directory. In some cases - when the output manifest is in different +directory. In some cases - when the output manifest is in a different directory - this will create manifests that do not work properly with sourmash. The `--relpath` argument will rewrite the paths to be relative to the manifest, while the `--abspath` argument will rewrite @@ -2293,14 +2295,20 @@ signatures themselves) and can then be used as a database target for most sourmash operations - search, gather, etc. Manifests support fast selection and lazy loading of sketches in many situations. -Note that `sig collect` will generate manifests containing the -pathnames given to it - so if you use relative paths, the references -will be relative to the working directory in which `sig collect` was +The `sig check` command can also be used to create standalone manifests +from collections using a picklist, with the `-m/--save-manifest-matching` +option. This is useful for commands that don't support picklists natively, +e.g. plugins and extensions. + +Note that `sig collect` and `sig check` will generate manifests containing the +pathnames given to them - so if you use relative paths, the references +will be relative to the working directory in which the command was run. You can use `sig collect --abspath` to rewrite the paths -into absolute paths. +into absolute paths, or `sig collect --relpath` to rewrite the paths +relative to the manifest file. **Our advice:** We suggest using zip file collections for most -situations; we stronlgy recommend using standalone manifests for +situations; we strongly recommend using standalone manifests for situations where you have **very large** sketches or a **very large** collection of sketches (1000s or more), and don't want to make multiple copies of signatures in the collection (as you would have to, diff --git a/doc/databases-advanced.md b/doc/databases-advanced.md index f3249c10ea..8107195686 100644 --- a/doc/databases-advanced.md +++ b/doc/databases-advanced.md @@ -75,7 +75,7 @@ To read from a directory, specify the directory name on the sourmash command lin When directories are specified as outputs, the signatures will be saved by their complete md5sum underneath the directory. -We don't recommend storing signatures in directory hierarchies, since most of their use cases are now covered by other approaches. +We don't recommend storing signatures in directory hierarchies, since the implementation is not particularly memory efficient most of the use cases for directories are now covered by other approaches - in particular, standalone manifests. ### Pathlists From 5d9f2193614a94a17362ef48fa6f046be9a07ed9 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 3 Mar 2024 21:02:55 -0500 Subject: [PATCH 26/30] more docs --- doc/command-line.md | 81 +++++++++++++++++++----------------- src/sourmash/sig/__main__.py | 2 - 2 files changed, 43 insertions(+), 40 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index 36cb365469..b54a163b0e 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -2012,7 +2012,9 @@ the default in sourmash v5. ### Loading signatures and databases -sourmash uses several different command-line styles. +sourmash uses several different command-line styles. Most sourmash +commands can load sketches from any standard collection; we primarily +recommend using zipfiles (but read on!) Briefly, @@ -2023,22 +2025,18 @@ Briefly, need to provide a selector (ksize with `-k`, moltype with `--dna` etc, or md5sum with `--query-md5`) that picks out a single signature. -* `compare` takes multiple signatures and can load them from files, - directories, and indexed databases (SBT or LCA). It can also take - a list of file paths in a text file, using `--from-file` (see below). +* `compare` takes multiple signatures and can load them from any + sourmash collection type. * the `lca classify` and `lca summarize` commands take multiple signatures with `--query`, and multiple LCA databases, with `--db`. `sourmash multigather` also uses this style. This allows these commands to specify multiple queries **and** multiple databases without - (too much) confusion. These commands will take files containing - signature files using `--query-from-file` (see below). + (too much) confusion. The database must be LCA databases. * `index` and `lca index` take a few fixed parameters (database name, and for `lca index`, a taxonomy file) and then an arbitrary number of - other files that contain signatures, including files, directories, - and indexed databases. These commands will also take `--from-file` - (see below). + other files that contain signatures. None of these commands currently support searching, comparing, or indexing signatures with multiple ksizes or moltypes at the same time; you need @@ -2156,7 +2154,9 @@ slow, especially for many (100s or 1000s) of signatures. All of the `sourmash` commands support loading collections of signatures from zip files. You can create a compressed collection of signatures using `sourmash sig cat *.sig -o collections.zip` and then -specifying `collections.zip` on the command line in place of `*.sig`. +specifying `collections.zip` on the command line in place of `*.sig`; +you can also sketch FASTA/FASTQ files directly into a zip file with +`-o collections.zip`. ### Choosing signature output formats @@ -2183,7 +2183,7 @@ to stdout. All of these save formats can be loaded by sourmash commands. **We strongly suggest using .zip files to store signatures: they are fast, -small, and fully supported by all the sourmash commands.** +small, and fully supported by all the sourmash commands and API.** Note that when outputting large collections of signatures, some save formats require holding all the sketches in memory until they can be @@ -2198,19 +2198,6 @@ databases!](databases-advanced.md) ### Loading many signatures -#### Loading signatures within a directory hierarchy - -All of the `sourmash` commands support loading signatures from -beneath directories; provide the paths on the command line. - -#### Passing in lists of files - -Most sourmash commands will also take a `--from-file` or -`--query-from-file`, which will take the location of a text file containing -a list of file paths. This can be useful for situations where you want -to specify thousands of queries, or a subset of signatures produced by -some other command. - #### Indexed databases Indexed databases can make searching signatures much faster. SBT @@ -2221,9 +2208,6 @@ SQLite databases (new in sourmash v4.4.0) are typically larger on disk than SBTs and LCAs, but in turn are fast to load and support very low memory search. -(LCA databases also directly permit taxonomic searches using `sourmash lca` -functions.) - Commands that take multiple signatures or collections of signatures will also work with indexed databases. @@ -2235,9 +2219,9 @@ only at one scaled value. If the database signature type is incompatible with the other signatures, sourmash will complain appropriately. -In contrast, signature files, zip collections, and directory -hierarchies can contain many different types of signatures, and -compatible ones will be selected automatically. +In contrast, signature files and zip collections can contain many +different types of signatures, and compatible ones will be selected +automatically. Use the `sourmash index` command to create an SBT. @@ -2247,6 +2231,26 @@ database can be saved in JSON or SQL format with `-F json` or `-F sql`. Use `sourmash sig cat -o .sqldb` to create a SQLite indexed database. +#### Loading signatures within a directory hierarchy + +All of the `sourmash` commands support loading signatures from +within directories; provide the paths on the command line. + +This is no longer recommended; we instead suggest passing all of the +sketch files in the directory into `sig collect` to build a standalone +manifest, or using `sig cat` on the directory to generate a zip file. + +#### Passing in lists of files + +Most sourmash commands will also take a `--from-file` or +`--query-from-file`, which will take the location of a text file containing +a list of file paths. This can be useful for situations where you want +to specify thousands of queries, or a subset of signatures produced by +some other command. + +This is no longer recommended; we instead suggest using standalone manifests +built with `sig collect`. + ### Combining search databases on the command line All of the commands in sourmash operate in "online" mode, so you can @@ -2254,7 +2258,7 @@ combine multiple databases and signatures on the command line and get the same answer as if you built a single large database from all of them. The only caveat to this rule is that if you have multiple identical matches present across the databases, the order in which -they are found will differ depending on the order that the files are +they are used may depend on the order that the files are passed in on the command line. ### Using stdin @@ -2262,11 +2266,12 @@ passed in on the command line. Most commands will take signature JSON data via stdin using the usual UNIX convention, `-`. Moreover, `sourmash sketch` and the `sourmash sig` commands will output to stdout. So, for example, +``` +sourmash sketch ... -o - | sourmash sig describe - +``` +will describe the signatures that were just created. -`sourmash sketch ... -o - | sourmash sig describe -` will describe the -signatures that were just created. - -### Using manifests to explicitly refer to collections of files +### Using standalone manifests to explicitly refer to collections of files (sourmash v4.4 and later) @@ -2276,9 +2281,9 @@ internals to speed up signature selection through picklists and pattern matching. Manifests can _also_ be used externally (via the command-line), and -may be useful for organizing large collections of signatures. They can -be generated with the `sig collect`, `sig manifest`, and `sig check` -subcommands. +these "standalone manifests" may be useful for organizing large +collections of signatures. They can be generated with the `sig +collect`, `sig manifest`, and `sig check` subcommands. Suppose you have a large collection of signatures (`.sig` or `.sig.gz` files) in a location (e.g., under a directory, or in a zip file). You diff --git a/src/sourmash/sig/__main__.py b/src/sourmash/sig/__main__.py index 8a8168607e..261721e371 100644 --- a/src/sourmash/sig/__main__.py +++ b/src/sourmash/sig/__main__.py @@ -1546,8 +1546,6 @@ def check(args): def collect(args): "Collect signature metadata across many locations, save to manifest" - # TODO: - # test what happens with directories :) set_quiet(False, args.debug) if os.path.exists(args.output): From 3741203be8b04adfa7e274f28f1bb78931c0e597 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 17 Mar 2024 17:41:07 -0400 Subject: [PATCH 27/30] more updates --- doc/command-line.md | 61 ++++++++++++++++++++++++++++----------------- 1 file changed, 38 insertions(+), 23 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index b54a163b0e..8bf4816ef0 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -1914,9 +1914,10 @@ will continue processing input sequences. ### `sourmash signature manifest` - output a manifest for a file -Output a manifest for a file, database, or collection. Note that these -manifests are not always suitable for use as standalone manifests; -the `sourmash sig collect` command produces standalone manifests. +Output a manifest for a file, database, or collection. Note that +these manifests are not usually suitable for use as standalone +manifests; the `sourmash sig collect` and `sourmash sig check` +commands produce standalone manifests. For example, ``` @@ -1961,7 +1962,7 @@ of the _matched_ elements to a manifest file, which can then be used as a sourmash database. `sourmash sig check` is particularly useful when working with large -collections of signatures and identifiers. +collections of signatures and identifiers. With `-m/--save-manifest-matching`, `sig check` creates a standalone manifest. In these manifests, sourmash v4 will by default write paths @@ -1973,6 +1974,12 @@ relative to the manifest, while the `--abspath` argument will rewrite paths to be absolute. The `--relpath` behavior will be the default in sourmash v5. +Standalone manifests created with `-m/--save-manifest-matching` will +use the paths given to `sig check` on the command line; we recommend +using zip files and sig files, and avoiding directory hierarchies or +path lists. You can also use `--from-file` to pass in long lists of +filenames. + ### `sourmash signature collect` - collect manifests across databases Collect manifests from across (many) files and merge into a single @@ -1996,17 +2003,22 @@ This manifest file can be loaded directly from the command line by sourmash. particularly useful when working with large collections of signatures and identifiers, and has command line options for merging and updating manifests. -Standalone manifests produced by `sig collect` work most efficiently when -constructed from many small zip file collections. +The standalone manifests created by `sig collect` will reference the +paths given on the command line; we recommend using zip files and sig +files, and avoiding directory hierarchies or path lists. You can also +use `--from-file` to pass in long lists of filenames. + +Standalone manifests produced by `sig collect` work most efficiently +when constructed from many small zip file collections. As with `sig check`, the standalone manifests created by `sig collect` in sourmash v4 will by default write paths to the matched elements relative to the current working directory. When the output manifest -is in a different directory, this will create manifests that do not work -properly with sourmash. The `--relpath` argument will rewrite the -paths to be relative to the manifest, while the `--abspath` argument -will rewrite paths to be absolute. The `--relpath` behavior will be -the default in sourmash v5. +is in a different directory, this will create manifests that do not +work properly with sourmash. The `--relpath` argument will rewrite +the paths to be relative to the manifest, while the `--abspath` +argument will rewrite paths to be absolute. The `--relpath` behavior +will be the default in sourmash v5. ## Advanced command-line usage @@ -2233,23 +2245,26 @@ a SQLite indexed database. #### Loading signatures within a directory hierarchy -All of the `sourmash` commands support loading signatures from -within directories; provide the paths on the command line. +All of the `sourmash` commands support loading signatures (`.sig` or +`.sig.gz` files) from within directory hierarchies; you can just +provide the paths to the top-level directory on the command line. -This is no longer recommended; we instead suggest passing all of the -sketch files in the directory into `sig collect` to build a standalone -manifest, or using `sig cat` on the directory to generate a zip file. +However, this is no longer recommended because it can lead to +inefficiencies; we instead suggest passing all of the sketch files in +the directory into `sig collect` to build a standalone manifest, or +using `sig cat` on the directory to generate a zip file. #### Passing in lists of files -Most sourmash commands will also take a `--from-file` or -`--query-from-file`, which will take the location of a text file containing -a list of file paths. This can be useful for situations where you want -to specify thousands of queries, or a subset of signatures produced by -some other command. +sourmash commands support `--from-file` or `--query-from-file`, which +will take the location of a text file containing a list of file +paths. This can be useful for situations where you want to specify +thousands of queries, or a subset of signatures produced by some other +command. -This is no longer recommended; we instead suggest using standalone manifests -built with `sig collect`. +This is no longer recommended when using large collections; we instead +suggest using standalone manifests built with `sig collect` and `sig +check`, which will include extra metadata that supports fast loading. ### Combining search databases on the command line From eba2541db8ef9c638e1c257ec1e837b16650f8b1 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 17 Mar 2024 17:53:03 -0400 Subject: [PATCH 28/30] more --- doc/command-line.md | 4 +- doc/databases-advanced.md | 78 ++++++++++++++++++++++++++++++--------- 2 files changed, 63 insertions(+), 19 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index 8bf4816ef0..9861da039b 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -1962,7 +1962,7 @@ of the _matched_ elements to a manifest file, which can then be used as a sourmash database. `sourmash sig check` is particularly useful when working with large -collections of signatures and identifiers. +collections of signatures and identifiers. With `-m/--save-manifest-matching`, `sig check` creates a standalone manifest. In these manifests, sourmash v4 will by default write paths @@ -2114,7 +2114,7 @@ The following `coltype`s are currently supported for picklists: * `gather` - use the CSV output of `sourmash gather` as a picklist * `prefetch` - use the CSV output of `sourmash prefetch` as a picklist * `search` - use the CSV output of `sourmash prefetch` as a picklist -* `manifest` - use CSV manifests as a picklist +* `manifest` - use CSV manifests produced by `sig manfiest` as a picklist Identifiers are constructed by using the first space delimited word in the signature name. diff --git a/doc/databases-advanced.md b/doc/databases-advanced.md index 8107195686..2a1f61fd28 100644 --- a/doc/databases-advanced.md +++ b/doc/databases-advanced.md @@ -56,37 +56,81 @@ We recommend SBT and LCA databases for use only in specific situations - e.g. SB ### Standalone manifests -Manifests are catalogs of signature metadata - name, molecule type, k-mer size, and other information - that can be used to select specific signatures for searching or processing. Typically when using manifests the actual signatures themselves are not loaded until they are needed, although the efficiency of this depends on the signature storage mechanism; for example, JSON-format containers (`.sig` and `.lca.json` files) must be entirely loaded before any signature in the file them can be used, unlike zip containers. - -As of sourmash 4.4 manifests can be *directly* loaded from the command line as standalone collections. This lets manifests serve as a catalog of signatures stored in many different locations. Sketches can be selected by name, k-mer size, molecule type, and other features without loading the actual sketch data. - -Standalone manifests are preferable to both directory storage and pathlists (below), because they support fast selection and direct lazy loading. They are the most effective solution for managing custom collections of thousands to millions of signatures, as well as working with multiple large sketches. - -Standalone manifests can be created with `sourmash sig collect` -(sourmash v4.4 and later). - -Sourmash supports two manifest file formats - CSV and SQLite. SQLite manifests are much faster and lower-memory than CSV manifests. +Manifests are catalogs of signature metadata - name, molecule type, +k-mer size, and other information - that can be used to select +specific signatures for searching or processing. Typically when using +manifests the actual signatures themselves are not loaded until they +are needed, although the efficiency of this depends on the signature +storage mechanism; for example, JSON-format containers (`.sig` and +`.lca.json` files) must be entirely loaded before any signature in the +file them can be used, unlike zip containers. + +As of sourmash 4.4 manifests can be *directly* loaded from the command +line as standalone collections. This lets manifests serve as a catalog +of signatures stored in many different locations. Sketches can be +selected by name, k-mer size, molecule type, and other features +without loading the actual sketch data. + +Standalone manifests are preferable to both directory storage and +pathlists (below), because they support fast selection and direct lazy +loading. This means that sourmash operations that support streaming or +online search (such as `prefetch` and `gather`, among others) can +avoid loading everything all at once. + +Standalone manifests are the most effective solution for managing custom +collections of thousands to millions of signatures, as well as working +with multiple large sketches. + +They can be created with `sourmash sig collect` and `sourmash sig +check` (sourmash v4.4 and later). + +Sourmash supports two manifest file formats - CSV and SQLite. SQLite +manifests are much faster and lower-memory than CSV manifests. ### Directories -Directory hierarchies of signatures are read natively by sourmash, and can be created or extended by specifying `-o dirname/` (with a trailing slash). +Directory hierarchies of signatures are read natively by sourmash, and +can be created or extended by specifying `-o dirname/` (with a +trailing slash). -To read from a directory, specify the directory name on the sourmash command line. When reading from directories, the entire directory hierarchy is traversed and all `.sig` and `.sig.gz` files are loaded as signatures. If `--force` is specified, _all_ files will be read, and failures will be ignored. +To read from a directory, specify the directory name on the sourmash +command line. When reading from directories, the entire directory +hierarchy is traversed and all `.sig` and `.sig.gz` files are loaded +as signatures. If `--force` is specified, _all_ files will be read, +and failures will be ignored. -When directories are specified as outputs, the signatures will be saved by their complete md5sum underneath the directory. +When directories are specified as outputs, the signatures will be +saved by their complete md5sum underneath the directory. -We don't recommend storing signatures in directory hierarchies, since the implementation is not particularly memory efficient most of the use cases for directories are now covered by other approaches - in particular, standalone manifests. +We don't recommend loading signatures from directory hierarchies, +since the implementation is not particularly memory efficient and most +of the use cases for directories are now covered by other approaches - +in particular, standalone manifests. ### Pathlists -Pathlists are text files containing paths to one or more sourmash databases; any type of sourmash-readable collection can be listed. +Pathlists are text files containing paths to one or more sourmash +databases; any type of sourmash-readable collection can be listed. -The paths in pathlists can be relative or absolute within the file system. If they are relative, they must resolve with respect to the current working directory of the sourmash command. +The paths in pathlists can be relative or absolute within the file +system. If they are relative, they must resolve with respect to the +current working directory of the sourmash command. -We don't recommend using pathlists, since the original use cases are now supported with picklists and standalone manifests, but they are still supported. +We don't recommend using pathlists, since the original use cases are +now supported with picklists and standalone manifests, but they are +still supported. Loading sketches from pathlists is also not very +efficient. Pathlists are not output by any sourmash commands. +Many commands support `--query-from-file` or `--from-file` as a way to +pass in a file containing many paths to sketches or collections. The +internal implementation of sourmash simply adds these to the +command-line arguments, and this is an effective and efficient way to +provide long lists of files to commands like `sig check` and `sig +collect` that create standalone manifests to support efficient lazy +loading. + ## Storing taxonomies sourmash supports taxonomic information output via the `sourmash lca` and `sourmash tax` subcommands. Both sets of commands rely on the same 7 taxonomic ranks: superkingdom, phylum, class, order, family, genus, and species (with limited support for a 'strain' rank). And both sets of subcommands take lineage spreadsheets that link specific identifiers to taxonomic lineages. From c34421d7664db2bc53211ed436d7854ca0602092 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 19 Mar 2024 14:21:53 -0700 Subject: [PATCH 29/30] misc fixes --- doc/command-line.md | 12 ++++++------ doc/faq.md | 2 +- doc/release-notes/sourmash-2.0.md | 2 +- doc/sourmash-sketch.md | 6 +++--- doc/using-sourmash-a-guide.md | 2 +- 5 files changed, 12 insertions(+), 12 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index 9861da039b..1260afb69c 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -1977,8 +1977,8 @@ sourmash v5. Standalone manifests created with `-m/--save-manifest-matching` will use the paths given to `sig check` on the command line; we recommend using zip files and sig files, and avoiding directory hierarchies or -path lists. You can also use `--from-file` to pass in long lists of -filenames. +path lists. You can use `--from-file` to pass in long lists of +filenames via a text file. ### `sourmash signature collect` - collect manifests across databases @@ -2025,8 +2025,8 @@ will be the default in sourmash v5. ### Loading signatures and databases sourmash uses several different command-line styles. Most sourmash -commands can load sketches from any standard collection; we primarily -recommend using zipfiles (but read on!) +commands can load sketches from any standard collection type; we +primarily recommend using zipfiles (but read on!) Briefly, @@ -2249,8 +2249,8 @@ All of the `sourmash` commands support loading signatures (`.sig` or `.sig.gz` files) from within directory hierarchies; you can just provide the paths to the top-level directory on the command line. -However, this is no longer recommended because it can lead to -inefficiencies; we instead suggest passing all of the sketch files in +However, this is no longer recommended because it can be very +inefficient; we instead suggest passing all of the sketch files in the directory into `sig collect` to build a standalone manifest, or using `sig cat` on the directory to generate a zip file. diff --git a/doc/faq.md b/doc/faq.md index d8d9da0622..227952ff40 100644 --- a/doc/faq.md +++ b/doc/faq.md @@ -139,7 +139,7 @@ you use [the precomputed databases](databases.md), you will always end up using your query sketches at a minimum scaled of 1000, even if you created them with a lower scaled value. -Please also see [What resolution should my signatures be?](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-create-them). +Please also see [What resolution should my signatures be?](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-and-how-should-i-create-them). ## What threshold-bp value should I use with `sourmash prefetch` and `sourmash gather`? diff --git a/doc/release-notes/sourmash-2.0.md b/doc/release-notes/sourmash-2.0.md index c3b8647dd5..fbb541ad49 100644 --- a/doc/release-notes/sourmash-2.0.md +++ b/doc/release-notes/sourmash-2.0.md @@ -23,7 +23,7 @@ This is a list of substantial new features and functionality in sourmash 2.0. * Created [precomputed databases](../databases.md) for most of GenBank genomes. * Added taxonomic reporting functionality in the `sourmash lca` submodule - [see command-line docs](../command-line.md#sourmash-lca-subcommands-for-in-memory-taxonomy-integration). * Added signature manipulation utilities in the `sourmash signature` submodule - [see command-line docs](../command-line.md#sourmash-signature-subcommands-for-signature-manipulation) -* Introduced new modulo hash or "scaled" signatures for containment analysis; see [Using sourmash: a practical guide](../using-sourmash-a-guide.md#what-resolution-should-my-signatures-be--how-should-i-create-them) and [more details in the Python API examples](../api-example.md#advanced-features-of-sourmash-minhash-objects---scaled-and-num). +* Introduced new modulo hash or "scaled" signatures for containment analysis; see [Using sourmash: a practical guide](../using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-and-how-should-i-create-them) and [more details in the Python API examples](../api-example.md#advanced-features-of-sourmash-minhash-objects---scaled-and-num). * Switched to using JSON instead of YAML for signatures. * Many performance optimizations! * Many more tests! diff --git a/doc/sourmash-sketch.md b/doc/sourmash-sketch.md index caba1a19a8..5ad43d266e 100644 --- a/doc/sourmash-sketch.md +++ b/doc/sourmash-sketch.md @@ -146,7 +146,7 @@ Some of the key command-line options supported by `fromfile` are: * `-o/--output-signatures` will save generated signatures to any of the [standard supported output formats](command-line.md#choosing-signature-output-formats). * `-o/--output-csv-info` will save a CSV file of input filenames and parameter strings for use with the `sourmash sketch` command line; this can be used to construct signatures in parallel. * `--already-done` will take a list of existing signatures/databases to check against; signatures with matching names and parameter strings will not be rebuilt. -* `--output-manifest-matching` will output a manifest of already-existing signatures, which can then be used with `sourmash sig cat` to collate signatures across databases; see [using manifests](command-line.md#using-manifests-to-explicitly-refer-to-collections-of-files). (This provides [`sourmash sig check` functionality](command-line.md#sourmash-signature-check---compare-picklists-and-manifests) in `sketch fromfile`.) +* `--output-manifest-matching` will output a manifest of already-existing signatures, which can then be used with `sourmash sig cat` to collate signatures across databases; see [using manifests](command-line.md#using-standalone-manifests-to-explicitly-refer-to-collections-of-files). (This provides [`sourmash sig check` functionality](command-line.md#sourmash-signature-check---compare-picklists-and-manifests) in `sketch fromfile`.) If you would like help and advice on constructing large databases, or pointers to code for generating the `fromfile` CSV format, please ask @@ -200,8 +200,8 @@ The `-p` argument to `sourmash sketch` provides parameter strings to sourmash, a A parameter string is a space-delimited collection that can contain one or more fields, comma-separated. * `k=` - create a sketch at this k-mer size; can provide more than one time in a parameter string. Typically `ksize` is between 4 and 100. -* `scaled=` - create a scaled MinHash with k-mers sampled deterministically at 1 per `` value. This controls sketch compression rates and resolution; for example, a 5 Mbp genome sketched with a scaled of 1000 would yield approximately 5,000 k-mers. `scaled` is incompatible with `num`. See [our guide to signature resolution](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be--how-should-i-create-them) for more information. -* `num=` - create a standard MinHash with no more than `` k-mers kept. This will produce sketches identical to [mash sketches](https://mash.readthedocs.io/en/latest/). `num` is incompatible with `scaled`. See [our guide to signature resolution](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be--how-should-i-create-them) for more information. +* `scaled=` - create a scaled MinHash with k-mers sampled deterministically at 1 per `` value. This controls sketch compression rates and resolution; for example, a 5 Mbp genome sketched with a scaled of 1000 would yield approximately 5,000 k-mers. `scaled` is incompatible with `num`. See [our guide to signature resolution](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-and-how-should-i-create-them) for more information. +* `num=` - create a standard MinHash with no more than `` k-mers kept. This will produce sketches identical to [mash sketches](https://mash.readthedocs.io/en/latest/). `num` is incompatible with `scaled`. See [our guide to signature resolution](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-and-how-should-i-create-them) for more information. * `abund` / `noabund` - create abundance-weighted (or not) sketches. See [Classify signatures: Abundance Weighting](classifying-signatures.md#abundance-weighting) for details of how this works. * `dna`, `protein`, `dayhoff`, `hp` - create this kind of sketch. Note that `sourmash sketch dna -p protein` and `sourmash sketch protein -p dna` are invalid; please use `sourmash sketch translate` for the former. * `seed=` - set the random number seed used for k-mer hashing. This is for advanced users who want to choose a completely different set of k-mers for sketches! The default is 42. diff --git a/doc/using-sourmash-a-guide.md b/doc/using-sourmash-a-guide.md index 29ccc52ec1..a3600c1337 100644 --- a/doc/using-sourmash-a-guide.md +++ b/doc/using-sourmash-a-guide.md @@ -41,7 +41,7 @@ however, and it probably doesn't really matter. (When we have blog posts or publications providing more formal guidance, we'll link to them here!) -## What resolution should my signatures be / how should I create them? +## What resolution should my signatures be and how should I create them? sourmash supports two ways of choosing the resolution or size of your signatures: using `num` to specify the maximum number of hashes, From cb6bae3728f373cedc01ceca58afbc798350be8b Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Wed, 20 Mar 2024 14:12:59 -0700 Subject: [PATCH 30/30] Update doc/command-line.md Co-authored-by: Tessa Pierce Ward --- doc/command-line.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/command-line.md b/doc/command-line.md index 1260afb69c..90633d342e 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -2114,7 +2114,7 @@ The following `coltype`s are currently supported for picklists: * `gather` - use the CSV output of `sourmash gather` as a picklist * `prefetch` - use the CSV output of `sourmash prefetch` as a picklist * `search` - use the CSV output of `sourmash prefetch` as a picklist -* `manifest` - use CSV manifests produced by `sig manfiest` as a picklist +* `manifest` - use CSV manifests produced by `sig manifest` as a picklist Identifiers are constructed by using the first space delimited word in the signature name.