Refactor gather functionality for speed & modularity; provide `pref…

…etch` functionality. (#1370) * more refactor - filename stuff * add 'location' to SBT objects * finish removing filename * fix prefetch after merging in #1373 * implement a CounterGatherIndex * remove sort * update counter logic to remove proper intersection * make 'find' a generator * remove comment * begin refactoring 'categorize' * have the 'find' function for SBTs return signatures * fix majority of tests * comment & then fix test * torture the tests into working * split find and _find_nodes to take different kinds of functions * redo 'find' on index * refactor lca_db to use new find * refactor SBT to use new find * comment/cleanup * refactor out common code * fix up gather * use 'passes' properly * attempted cleanup * minor fixes * get a start on correct downsampling * adjust tree downsampling for regular minhashes, too * remove now-unused search functions in sbtmh * refactor categorize to use new find * cleanup and removal * remove redundant code in lca_db * remove redundant code in SBT * add notes * remove more unused code * refactor most of the test_sbt tests * fix one minor issue * fix jaccard calculation in sbt * check for compatibility of search fn and query signature * switch tests over to jaccard similarity, not containment * fix test * remove test for unimplemented LCA_Database.find method * document threshold change; update test * refuse to run abund signatures * flatten sigs internally for gather * reinflate abundances for saving * fix problem where sbt indices coudl be created with abund signatures * more * split flat and abund search * make ignore_abundance work again for categorize * turn off best-only, since it triggers on self-hits. * add test: 'sourmash index' flattens sigs * add note about something to test * fix typo; still broken tho * location is now a property * move search code into search.py * remove redundant scaled checking code * best-only now works properly for two tests * 'fix' tests by removing v1 and v2 SBT compatibility * simplify (?) downsampling code * require keyword args in MinHash.downsample(...) * fix bug with downsample * require keyword args in MinHash.downsample(...) * fix test to use proper downsampling, reverse order to match scaled * add test for revealed bug * remove unnecessary comment * flatten subject MinHash, too * add testme comment * clean up sbt find * clean up lca find * add IndexSearchResult namedtuple for search and gather results * add more tests for Index classes * add tests for subj & query num downsampling * tests for Index.search_abund * refactor a bit * refactor make_jaccard_search_query; start tests * even more tests * test collect, best_only * more search tests * remove unnec space * add minor comment * deal with status == None on SystemExit * upgrade and simplify categorize * restore test * merge * fix abundance search in SBT for categorize * code cleanup and refactoring; check for proper error messages * add explicit test for incompatible num * refactor MinHash.downsample * deal with status == None on SystemExit * fix test * fix comment mispelling * properly pass kwargs; fix search_sbt_index * add simple tests for SBT load and search API * allow arbitrary kwargs for LCA_DAtabase.find * add testing of passthru-kwargs * re-enable test * add notes to update docstrings * docstring updates * fix test * fix location reporting in prefetch * fix prefetch location by fixing MultiIndex * temporary prefetch_gather intervention * 'gather' only returns best match * turn prefetch on by default, for now * better tests for gather --save-unassigned * remove unused print * remove unnecessary check-me comment * clear out docstring * SBT search doesn't work on v1 and v2 SBTs b/c no min_n_below * start adding tests * test some basic prefetch stuff * update index for prefetch * add fairly thorough tests * fix my dumb mistake with gather * simplify, refactor, fix * fix remaining tests * propogate ValueErrors better * fix tests * flatten prefetch queries * fix for genome-grist alpha test * fix threshold bugarooni * fix gather/prefetch interactions * fix sourmash prefetch return value * minor fixes * pay proper attention to threshold * cleanup and refactoring * remove unnecessary 'scaled' * minor cleanup * added LazyLinearLindex and prefetch --linear * fix abundance problem * save matches to a directory * test for saving matches to a directory * add a flexible progressive signature output class * add tests for .sig.gz and .zip outputs * update save_signatures code; add tests; use in gather and search too * update comment * cleanup and refactor of SaveSignaturesToLocation code * docstrings & cleanup * add 'run' and 'runtmp' test fixtures * remove unnecessary track_abundance fixture call * restore original; * linear and prefetch fixtures + runtmp * fix use of runtmp * copy over SaveSignaturesToLocation code from other branch * docs for sourmash prefetch * more doc * minor edits * Re-implement the actual gather protocol with a cleaner interface. (#1489) * initial refactor of CounterGather stuff * refactor into peek and consume * move next method over to query specific class * replace gather implementation with new CounterGather * many more tests for CounterGather * remove scaled arg from peek * open-box test for counter internal data structures * add num query & subj tests * add repr; add tests; support stdout * refactor signature saving to use new sourmash_args collection saving * specify utf-8 encoding for output * add flexible output to compute/sketch * add test to trigger rust panic * test search --save-matches * add --save-prefetch to sourmash gather * remove --no-prefetch option :) * added --save-prefetch functionality * add back a mostly-functioning --no-prefetch argument :) * add --no-prefetch back in * check for JSON in first byte of LCA DB file * start adding linear tests * use fixtures to test prefetch and linear more thoroughly * comments, etc * upgrade docs for --linear and --prefetch * 'fix' issue and test * fix a last test ;) * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> * Update src/sourmash/cli/sig/rename.py Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> * Update tests/test_sourmash_args.py Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> * write tests for LazyLinearIndex * add some basic prefetch tests * properly test linear! * add more tests for LazyLinearIndex * test zipfile bool * remove unnecessary try/except; comment * fix signatures() call * fix --prefetch snafu; doc * do not overwrite signature even if duplicate md5sum (#1497) * try adding loc to return values from Index.find * made use of new IndexSearchResult.find throughout * adjust note * provide signatures_with_location on all Index objects * cleanup and fix * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> * fix bug around --save-prefetch with multiple databases * comment/doc minor updates Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com> Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
sourmash-bio · May 10, 2021 · f60c44d · f60c44d
1 parent 18cd040
commit f60c44d
Show file tree

Hide file tree

Showing 17 changed files with 2,350 additions and 252 deletions.
diff --git a/doc/command-line.md b/doc/command-line.md
@@ -57,16 +57,17 @@ species, while the third is from a completely different genus.
 
 To get a list of subcommands, run `sourmash` without any arguments.
 
-There are six main subcommands: `sketch`, `compare`, `plot`,
-`search`, `gather`, and `index`.  See [the tutorial](tutorials.md) for a
-walkthrough of these commands.
+There are seven main subcommands: `sketch`, `compare`, `plot`,
+`search`, `gather`, `index`, and `prefetch`.  See
+[the tutorial](tutorials.md) for a walkthrough of these commands.
 
 * `sketch` creates signatures.
 * `compare` compares signatures and builds a distance matrix.
 * `plot` plots distance matrices created by `compare`.
 * `search` finds matches to a query signature in a collection of signatures.
 * `gather` finds the best reference genomes for a metagenome, using the provided collection of signatures.
 * `index` builds a fast index for many (thousands) of signatures.
+* `prefetch` selects signatures of interest from a very large collection of signatures, for later processing.
 
 There are also a number of commands that work with taxonomic
 information; these are grouped under the `sourmash lca`
@@ -295,6 +296,29 @@ genomes with no (or incomplete) taxonomic information.  Use `sourmash
 lca summarize` to classify a metagenome using a collection of genomes
 with taxonomic information.
 
+### Alternative search mode for low-memory (but slow) search: `--linear`
+
+By default, `sourmash gather` uses all information available for
+faster search. In particular, for SBTs, `prefetch` will prune the search
+tree.  This can be slow and/or memory intensive for very large databases,
+and `--linear` asks `sourmash prefetch` to instead use a linear search
+across all leaf nodes in the tree.
+
+The results are the same whether `--no-linear` or `--linear` is
+used.
+
+### Alternative search mode: `--no-prefetch`
+
+By default, `sourmash gather` does a "prefetch" to find *all* candidate
+signatures across all databases, before removing overlaps between the
+candidates. In rare circumstances, depending on the databases and parameters
+used, this may be slower or more memory intensive than doing iterative
+overlap removal. Prefetch behavior can be turned off with `--no-prefetch`.
+
+The results are the same whether `--prefetch` or `--no-prefetch` is
+used.  This option can be used with or without `--linear` (although
+`--no-prefetch --linear` will generally be MUCH slower).
+
 ### `sourmash index` - build an SBT index of signatures
 
 The `sourmash index` command creates a Zipped SBT database
@@ -305,11 +329,11 @@ used to create databases for e.g. subsets of GenBank.
 These databases support fast search and gather on large collections
 of signatures in low memory.
 
-SBTs can only be created on scaled signatures, and all signatures in
+All signatures in
 an SBT must be of compatible types (i.e. the same k-mer size and
 molecule type). You can specify the usual command line selectors
 (`-k`, `--scaled`, `--dna`, `--protein`, etc.) to pick out the types
-of signatures to include.
+of signatures to include when running `index`.
 
 Usage:
 ```
@@ -326,6 +350,58 @@ containing a list of file names to index; you can also provide individual
 signature files, directories full of signatures, or other sourmash
 databases.
 
+### `sourmash prefetch` - select subsets of very large databases for more processing
+
+The `prefetch` subcommand searches a collection of scaled signatures
+for matches in a large database, using containment. It is similar to
+`search --containment`, while taking a `--threshold-bp` argument like
+`gather` does for thresholding matches (instead of using Jaccard
+similarity or containment).
+
+`sourmash prefetch` is intended to select a subset of a large database
+for further processing. As such, it can search very large collections
+of signatures (potentially millions or more), operates in very low
+memory (see `--linear` option, below), and does no post-processing of signatures.
+
+`prefetch` has four main output options, which can all be used individually
+or together:
+* `-o/--output` produces a CSV summary file;
+* `--save-matches` saves all matching signatures;
+* `-save-matching-hashes` saves a single signature containing all of the hashes that matched any signature in the database at or above the specified threshold;
+* `--save-unmatched-hashes` saves a single signature containing the complement of `--save-matching-hashes`.
+
+Other options include:
+* the usual `-k/--ksize` and `--dna`/`--protein`/`--dayhoff`/`--hp` signature selectors;
+* `--threshold-bp` to require a minimum estimated bp overlap for output;
+* `--scaled` for downsampling;
+* `--force` to continue past survivable errors;
+
+### Alternative search mode for low-memory (but slow) search: `--linear`
+
+By default, `sourmash prefetch` uses all information available for
+faster search. In particular, for SBTs, `prefetch` will prune the search
+tree.  This can be slow and/or memory intensive for very large databases,
+and `--linear` asks `sourmash prefetch` to instead use a linear search
+across all leaf nodes in the tree.
+
+### Caveats and comments
+
+`sourmash prefetch` provides no guarantees on output order. It runs in
+"streaming mode" on its inputs, in that each input file is loaded,
+searched, and then unloaded.  And `sourmash prefetch` can be run
+separately on multiple databases, after which the results can be
+searched in combination with `search`, `gather`, `compare`, etc.
+
+A motivating use case for `sourmash prefetch` is to run it on multiple
+large databases with a metagenome query using `--threshold-bp=0`,
+`--save-matching-hashes matching_hashes.sig`, and `--save-matches
+db-matches.sig`, and then run `sourmash gather matching-hashes.sig
+db-matches.sig`. 
+
+This combination of commands ensures that the more time- and
+memory-intensive `gather` step is run only on a small set of relevant
+signatures, rather than all the signatures in the database.
+
 ## `sourmash lca` subcommands for taxonomic classification
 
 These commands use LCA databases (created with `lca index`, below, or

diff --git a/src/sourmash/cli/__init__.py b/src/sourmash/cli/__init__.py
@@ -26,6 +26,7 @@
 from . import migrate
 from . import multigather
 from . import plot
+from . import prefetch
 from . import sbt_combine
 from . import search
 from . import watch

diff --git a/src/sourmash/cli/gather.py b/src/sourmash/cli/gather.py
@@ -27,9 +27,14 @@ def subparser(subparsers):
     )
     subparser.add_argument(
         '--save-matches', metavar='FILE',
-        help='save the matched signatures from the database to the '
+        help='save gather matched signatures from the database to the '
         'specified file'
     )
+    subparser.add_argument(
+        '--save-prefetch', metavar='FILE',
+        help='save all prefetch-matched signatures from the databases to the '
+        'specified file or directory'
+    )
     subparser.add_argument(
         '--threshold-bp', metavar='REAL', type=float, default=5e4,
         help='reporting threshold (in bp) for estimated overlap with remaining query (default=50kb)'
@@ -58,6 +63,23 @@ def subparser(subparsers):
     add_ksize_arg(subparser, 31)
     add_moltype_args(subparser)
 
+    # advanced parameters
+    subparser.add_argument(
+        '--linear', dest="linear", action='store_true',
+        help="force a low-memory but maybe slower database search",
+    )
+    subparser.add_argument(
+        '--no-linear', dest="linear", action='store_false',
+    )
+    subparser.add_argument(
+        '--no-prefetch', dest="prefetch", action='store_false',
+        help="do not use prefetch before gather; see documentation",
+    )
+    subparser.add_argument(
+        '--prefetch', dest="prefetch", action='store_true',
+        help="use prefetch before gather; see documentation",
+    )
+
 
 def main(args):
     import sourmash

diff --git a/src/sourmash/cli/prefetch.py b/src/sourmash/cli/prefetch.py
@@ -0,0 +1,70 @@
+"""search a signature against dbs, find all overlaps"""
+
+from sourmash.cli.utils import add_ksize_arg, add_moltype_args
+
+
+def subparser(subparsers):
+    subparser = subparsers.add_parser('prefetch')
+    subparser.add_argument('query', help='query signature')
+    subparser.add_argument("databases",
+        nargs="*",
+        help="one or more databases to search",
+    )
+    subparser.add_argument(
+        "--db-from-file",
+        default=None,
+        help="list of paths containing signatures to search"
+    )
+    subparser.add_argument(
+        "--linear", action='store_true',
+        help="force linear traversal of indexes to minimize loading time and memory use"
+    )
+    subparser.add_argument(
+        '--no-linear', dest="linear", action='store_false',
+    )
+
+    subparser.add_argument(
+        '-q', '--quiet', action='store_true',
+        help='suppress non-error output'
+    )
+    subparser.add_argument(
+        '-d', '--debug', action='store_true'
+    )
+    subparser.add_argument(
+        '-o', '--output', metavar='FILE',
+        help='output CSV containing matches to this file'
+    )
+    subparser.add_argument(
+        '--save-matches', metavar='FILE',
+        help='save all matching signatures from the databases to the '
+        'specified file or directory'
+    )
+    subparser.add_argument(
+        '--threshold-bp', metavar='REAL', type=float, default=5e4,
+        help='reporting threshold (in bp) for estimated overlap with remaining query hashes (default=50kb)'
+    )
+    subparser.add_argument(
+        '--save-unmatched-hashes', metavar='FILE',
+        help='output unmatched query hashes as a signature to the '
+        'specified file'
+    )
+    subparser.add_argument(
+        '--save-matching-hashes', metavar='FILE',
+        help='output matching query hashes as a signature to the '
+        'specified file'
+    )
+    subparser.add_argument(
+        '--scaled', metavar='FLOAT', type=float, default=None,
+        help='downsample signatures to the specified scaled factor'
+    )
+    subparser.add_argument(
+        '--md5', default=None,
+        help='select the signature with this md5 as query'
+    )
+    add_ksize_arg(subparser, 31)
+    add_moltype_args(subparser)
+
+
+def main(args):
+    import sourmash
+    return sourmash.commands.prefetch(args)