Skip to content

Commit

Permalink
Refactor gather functionality for speed & modularity; provide `pref…
Browse files Browse the repository at this point in the history
…etch` functionality. (#1370)

* more refactor - filename stuff

* add 'location' to SBT objects

* finish removing filename

* fix prefetch after merging in #1373

* implement a CounterGatherIndex

* remove sort

* update counter logic to remove proper intersection

* make 'find' a generator

* remove comment

* begin refactoring 'categorize'

* have the 'find' function for SBTs return signatures

* fix majority of tests

* comment & then fix test

* torture the tests into working

* split find and _find_nodes to take different kinds of functions

* redo 'find' on index

* refactor lca_db to use new find

* refactor SBT to use new find

* comment/cleanup

* refactor out common code

* fix up gather

* use 'passes' properly

* attempted cleanup

* minor fixes

* get a start on correct downsampling

* adjust tree downsampling for regular minhashes, too

* remove now-unused search functions in sbtmh

* refactor categorize to use new find

* cleanup and removal

* remove redundant code in lca_db

* remove redundant code in SBT

* add notes

* remove more unused code

* refactor most of the test_sbt tests

* fix one minor issue

* fix jaccard calculation in sbt

* check for compatibility of search fn and query signature

* switch tests over to jaccard similarity, not containment

* fix test

* remove test for unimplemented LCA_Database.find method

* document threshold change; update test

* refuse to run abund signatures

* flatten sigs internally for gather

* reinflate abundances for saving

* fix problem where sbt indices coudl be created with abund signatures

* more

* split flat and abund search

* make ignore_abundance work again for categorize

* turn off best-only, since it triggers on self-hits.

* add test: 'sourmash index' flattens sigs

* add note about something to test

* fix typo; still broken tho

* location is now a property

* move search code into search.py

* remove redundant scaled checking code

* best-only now works properly for two tests

* 'fix' tests by removing v1 and v2 SBT compatibility

* simplify (?) downsampling code

* require keyword args in MinHash.downsample(...)

* fix bug with downsample

* require keyword args in MinHash.downsample(...)

* fix test to use proper downsampling, reverse order to match scaled

* add test for revealed bug

* remove unnecessary comment

* flatten subject MinHash, too

* add testme comment

* clean up sbt find

* clean up lca find

* add IndexSearchResult namedtuple for search and gather results

* add more tests for Index classes

* add tests for subj & query num downsampling

* tests for Index.search_abund

* refactor a bit

* refactor make_jaccard_search_query; start tests

* even more tests

* test collect, best_only

* more search tests

* remove unnec space

* add minor comment

* deal with status == None on SystemExit

* upgrade and simplify categorize

* restore test

* merge

* fix abundance search in SBT for categorize

* code cleanup and refactoring; check for proper error messages

* add explicit test for incompatible num

* refactor MinHash.downsample

* deal with status == None on SystemExit

* fix test

* fix comment mispelling

* properly pass kwargs; fix search_sbt_index

* add simple tests for SBT load and search API

* allow arbitrary kwargs for LCA_DAtabase.find

* add testing of passthru-kwargs

* re-enable test

* add notes to update docstrings

* docstring updates

* fix test

* fix location reporting in prefetch

* fix prefetch location by fixing MultiIndex

* temporary prefetch_gather intervention

* 'gather' only returns best match

* turn prefetch on by default, for now

* better tests for gather --save-unassigned

* remove unused print

* remove unnecessary check-me comment

* clear out docstring

* SBT search doesn't work on v1 and v2 SBTs b/c no min_n_below

* start adding tests

* test some basic prefetch stuff

* update index for prefetch

* add fairly thorough tests

* fix my dumb mistake with gather

* simplify, refactor, fix

* fix remaining tests

* propogate ValueErrors better

* fix tests

* flatten prefetch queries

* fix for genome-grist alpha test

* fix threshold bugarooni

* fix gather/prefetch interactions

* fix sourmash prefetch return value

* minor fixes

* pay proper attention to threshold

* cleanup and refactoring

* remove unnecessary 'scaled'

* minor cleanup

* added LazyLinearLindex and prefetch --linear

* fix abundance problem

* save matches to a directory

* test for saving matches to a directory

* add a flexible progressive signature output class

* add tests for .sig.gz and .zip outputs

* update save_signatures code; add tests; use in gather and search too

* update comment

* cleanup and refactor of SaveSignaturesToLocation code

* docstrings & cleanup

* add 'run' and 'runtmp' test fixtures

* remove unnecessary track_abundance fixture call

* restore original;

* linear and prefetch fixtures + runtmp

* fix use of runtmp

* copy over SaveSignaturesToLocation code from other branch

* docs for sourmash prefetch

* more doc

* minor edits

* Re-implement the actual gather protocol with a cleaner interface. (#1489)

* initial refactor of CounterGather stuff

* refactor into peek and consume

* move next method over to query specific class

* replace gather implementation with new CounterGather

* many more tests for CounterGather

* remove scaled arg from peek

* open-box test for counter internal data structures

* add num query & subj tests

* add repr; add tests; support stdout

* refactor signature saving to use new sourmash_args collection saving

* specify utf-8 encoding for output

* add flexible output to compute/sketch

* add test to trigger rust panic

* test search --save-matches

* add --save-prefetch to sourmash gather

* remove --no-prefetch option :)

* added --save-prefetch functionality

* add back a mostly-functioning --no-prefetch argument :)

* add --no-prefetch back in

* check for JSON in first byte of LCA DB file

* start adding linear tests

* use fixtures to test prefetch and linear more thoroughly

* comments, etc

* upgrade docs for --linear and --prefetch

* 'fix' issue and test

* fix a last test ;)

* Update doc/command-line.md

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update src/sourmash/cli/sig/rename.py

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update tests/test_sourmash_args.py

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update tests/test_sourmash_args.py

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update tests/test_sourmash_args.py

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update tests/test_sourmash_args.py

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update tests/test_sourmash_args.py

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update doc/command-line.md

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* write tests for LazyLinearIndex

* add some basic prefetch tests

* properly test linear!

* add more tests for LazyLinearIndex

* test zipfile bool

* remove unnecessary try/except; comment

* fix signatures() call

* fix --prefetch snafu; doc

* do not overwrite signature even if duplicate md5sum (#1497)

* try adding loc to return values from Index.find

* made use of new IndexSearchResult.find throughout

* adjust note

* provide signatures_with_location on all Index objects

* cleanup and fix

* Update doc/command-line.md

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* Update doc/command-line.md

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* fix bug around --save-prefetch with multiple databases

* comment/doc minor updates

Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
  • Loading branch information
3 people committed May 10, 2021
1 parent 18cd040 commit f60c44d
Show file tree
Hide file tree
Showing 17 changed files with 2,350 additions and 252 deletions.
86 changes: 81 additions & 5 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,16 +57,17 @@ species, while the third is from a completely different genus.

To get a list of subcommands, run `sourmash` without any arguments.

There are six main subcommands: `sketch`, `compare`, `plot`,
`search`, `gather`, and `index`. See [the tutorial](tutorials.md) for a
walkthrough of these commands.
There are seven main subcommands: `sketch`, `compare`, `plot`,
`search`, `gather`, `index`, and `prefetch`. See
[the tutorial](tutorials.md) for a walkthrough of these commands.

* `sketch` creates signatures.
* `compare` compares signatures and builds a distance matrix.
* `plot` plots distance matrices created by `compare`.
* `search` finds matches to a query signature in a collection of signatures.
* `gather` finds the best reference genomes for a metagenome, using the provided collection of signatures.
* `index` builds a fast index for many (thousands) of signatures.
* `prefetch` selects signatures of interest from a very large collection of signatures, for later processing.

There are also a number of commands that work with taxonomic
information; these are grouped under the `sourmash lca`
Expand Down Expand Up @@ -295,6 +296,29 @@ genomes with no (or incomplete) taxonomic information. Use `sourmash
lca summarize` to classify a metagenome using a collection of genomes
with taxonomic information.

### Alternative search mode for low-memory (but slow) search: `--linear`

By default, `sourmash gather` uses all information available for
faster search. In particular, for SBTs, `prefetch` will prune the search
tree. This can be slow and/or memory intensive for very large databases,
and `--linear` asks `sourmash prefetch` to instead use a linear search
across all leaf nodes in the tree.

The results are the same whether `--no-linear` or `--linear` is
used.

### Alternative search mode: `--no-prefetch`

By default, `sourmash gather` does a "prefetch" to find *all* candidate
signatures across all databases, before removing overlaps between the
candidates. In rare circumstances, depending on the databases and parameters
used, this may be slower or more memory intensive than doing iterative
overlap removal. Prefetch behavior can be turned off with `--no-prefetch`.

The results are the same whether `--prefetch` or `--no-prefetch` is
used. This option can be used with or without `--linear` (although
`--no-prefetch --linear` will generally be MUCH slower).

### `sourmash index` - build an SBT index of signatures

The `sourmash index` command creates a Zipped SBT database
Expand All @@ -305,11 +329,11 @@ used to create databases for e.g. subsets of GenBank.
These databases support fast search and gather on large collections
of signatures in low memory.

SBTs can only be created on scaled signatures, and all signatures in
All signatures in
an SBT must be of compatible types (i.e. the same k-mer size and
molecule type). You can specify the usual command line selectors
(`-k`, `--scaled`, `--dna`, `--protein`, etc.) to pick out the types
of signatures to include.
of signatures to include when running `index`.

Usage:
```
Expand All @@ -326,6 +350,58 @@ containing a list of file names to index; you can also provide individual
signature files, directories full of signatures, or other sourmash
databases.

### `sourmash prefetch` - select subsets of very large databases for more processing

The `prefetch` subcommand searches a collection of scaled signatures
for matches in a large database, using containment. It is similar to
`search --containment`, while taking a `--threshold-bp` argument like
`gather` does for thresholding matches (instead of using Jaccard
similarity or containment).

`sourmash prefetch` is intended to select a subset of a large database
for further processing. As such, it can search very large collections
of signatures (potentially millions or more), operates in very low
memory (see `--linear` option, below), and does no post-processing of signatures.

`prefetch` has four main output options, which can all be used individually
or together:
* `-o/--output` produces a CSV summary file;
* `--save-matches` saves all matching signatures;
* `-save-matching-hashes` saves a single signature containing all of the hashes that matched any signature in the database at or above the specified threshold;
* `--save-unmatched-hashes` saves a single signature containing the complement of `--save-matching-hashes`.

Other options include:
* the usual `-k/--ksize` and `--dna`/`--protein`/`--dayhoff`/`--hp` signature selectors;
* `--threshold-bp` to require a minimum estimated bp overlap for output;
* `--scaled` for downsampling;
* `--force` to continue past survivable errors;

### Alternative search mode for low-memory (but slow) search: `--linear`

By default, `sourmash prefetch` uses all information available for
faster search. In particular, for SBTs, `prefetch` will prune the search
tree. This can be slow and/or memory intensive for very large databases,
and `--linear` asks `sourmash prefetch` to instead use a linear search
across all leaf nodes in the tree.

### Caveats and comments

`sourmash prefetch` provides no guarantees on output order. It runs in
"streaming mode" on its inputs, in that each input file is loaded,
searched, and then unloaded. And `sourmash prefetch` can be run
separately on multiple databases, after which the results can be
searched in combination with `search`, `gather`, `compare`, etc.

A motivating use case for `sourmash prefetch` is to run it on multiple
large databases with a metagenome query using `--threshold-bp=0`,
`--save-matching-hashes matching_hashes.sig`, and `--save-matches
db-matches.sig`, and then run `sourmash gather matching-hashes.sig
db-matches.sig`.

This combination of commands ensures that the more time- and
memory-intensive `gather` step is run only on a small set of relevant
signatures, rather than all the signatures in the database.

## `sourmash lca` subcommands for taxonomic classification

These commands use LCA databases (created with `lca index`, below, or
Expand Down
1 change: 1 addition & 0 deletions src/sourmash/cli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from . import migrate
from . import multigather
from . import plot
from . import prefetch
from . import sbt_combine
from . import search
from . import watch
Expand Down
24 changes: 23 additions & 1 deletion src/sourmash/cli/gather.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,14 @@ def subparser(subparsers):
)
subparser.add_argument(
'--save-matches', metavar='FILE',
help='save the matched signatures from the database to the '
help='save gather matched signatures from the database to the '
'specified file'
)
subparser.add_argument(
'--save-prefetch', metavar='FILE',
help='save all prefetch-matched signatures from the databases to the '
'specified file or directory'
)
subparser.add_argument(
'--threshold-bp', metavar='REAL', type=float, default=5e4,
help='reporting threshold (in bp) for estimated overlap with remaining query (default=50kb)'
Expand Down Expand Up @@ -58,6 +63,23 @@ def subparser(subparsers):
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)

# advanced parameters
subparser.add_argument(
'--linear', dest="linear", action='store_true',
help="force a low-memory but maybe slower database search",
)
subparser.add_argument(
'--no-linear', dest="linear", action='store_false',
)
subparser.add_argument(
'--no-prefetch', dest="prefetch", action='store_false',
help="do not use prefetch before gather; see documentation",
)
subparser.add_argument(
'--prefetch', dest="prefetch", action='store_true',
help="use prefetch before gather; see documentation",
)


def main(args):
import sourmash
Expand Down
70 changes: 70 additions & 0 deletions src/sourmash/cli/prefetch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
"""search a signature against dbs, find all overlaps"""

from sourmash.cli.utils import add_ksize_arg, add_moltype_args


def subparser(subparsers):
subparser = subparsers.add_parser('prefetch')
subparser.add_argument('query', help='query signature')
subparser.add_argument("databases",
nargs="*",
help="one or more databases to search",
)
subparser.add_argument(
"--db-from-file",
default=None,
help="list of paths containing signatures to search"
)
subparser.add_argument(
"--linear", action='store_true',
help="force linear traversal of indexes to minimize loading time and memory use"
)
subparser.add_argument(
'--no-linear', dest="linear", action='store_false',
)

subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-d', '--debug', action='store_true'
)
subparser.add_argument(
'-o', '--output', metavar='FILE',
help='output CSV containing matches to this file'
)
subparser.add_argument(
'--save-matches', metavar='FILE',
help='save all matching signatures from the databases to the '
'specified file or directory'
)
subparser.add_argument(
'--threshold-bp', metavar='REAL', type=float, default=5e4,
help='reporting threshold (in bp) for estimated overlap with remaining query hashes (default=50kb)'
)
subparser.add_argument(
'--save-unmatched-hashes', metavar='FILE',
help='output unmatched query hashes as a signature to the '
'specified file'
)
subparser.add_argument(
'--save-matching-hashes', metavar='FILE',
help='output matching query hashes as a signature to the '
'specified file'
)
subparser.add_argument(
'--scaled', metavar='FLOAT', type=float, default=None,
help='downsample signatures to the specified scaled factor'
)
subparser.add_argument(
'--md5', default=None,
help='select the signature with this md5 as query'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)


def main(args):
import sourmash
return sourmash.commands.prefetch(args)
Loading

0 comments on commit f60c44d

Please sign in to comment.