Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add sig collect command #2036

Merged
merged 27 commits into from
May 13, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 35 additions & 41 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -1436,6 +1436,24 @@ sourmash database.
`sourmash sig check` is particularly useful when working with large
collections of signatures and identifiers.

### `sourmash signature collect` - collect manifests across databases

Collect manifests from across (many) files and merge into a single
standalone manifest.

For example,
```
sourmash sig collect tests/test-data/gather/GCF*.sig -o mf.sqlmf
```
will load all of the `GCF` signatures and build a manifest file `mf.sqlmf`
that contains references to all of the signatures, but not the signatures
themselves.
This manifest file can be loaded directly from the command line by sourmash.

`sourmash sig collect` defaults to outputting SQLite manifests. It is
particularly useful when working with large collections of signatures and
identifiers, and has command line options for merging and updating manifests.

## Advanced command-line usage

### Loading signatures and databases
Expand Down Expand Up @@ -1689,7 +1707,7 @@ signatures that were just created.

### Using manifests to explicitly refer to collections of files

(sourmash v4.4.0 and later)
(sourmash v4.4 and later)

Manifests are metadata catalogs of signatures that are used for
signature selection and loading. They are used extensively by sourmash
Expand All @@ -1698,52 +1716,28 @@ pattern matching.

Manifests can _also_ be used externally (via the command-line), and
may be useful for organizing large collections of signatures. They can
be generated with `sourmash sig manifest` as well as `sourmash sig check`.
be generated with the `sig collect`, `sig manifest`, and `sig check`
subcommands.

Suppose you have a large collection of signature (`.sig` or `.sig.gz`
files) under a directory. You can create a manifest file for them like so:
Suppose you have a large collection of signatures (`.sig` or `.sig.gz`
files) in a location (e.g., under a directory, or in a zip file). You
can create a manifest file for them like so:
```
sourmash sig manifest <dir> -o <dir>/manifest.csv
sourmash sig collect <dir> <zipfile> -o manifest.sqlmf
```
and then use the manifest directly for sourmash operations:
and then use the manifest directly for sourmash operations, for example:
```
sourmash sig fileinfo <dir>/manifest.csv
sourmash sig fileinfo manifest.sqlmf
```
This manifest can be used as a database target for most sourmash
operations - search, gather, etc. Note that manifests for directories
must be placed within (and loaded from) the directory from which the
manifest was generated; the specific manifest filename does not
matter.

A more advanced and slightly tricky way to use explicit manifest files
is with lists of files. If you create a file with a path list
containing the locations of loadable sourmash collections, you can run
`sourmash sig manifest pathlist.txt -o mf.csv` to generate a manifest
of all of the files. The resulting manifest in `mf.csv` can then be
loaded directly. This is very handy when you have many sourmash
signatures, or large signature files. The tricky part in doing this
is that the manifest will store the same paths listed in the pathlist
file - whether they are relative or absolute paths - and these paths
must be resolvable by sourmash from the current working directory.
This makes explicit manifests built from pathlist files less portable
within or across systems than the other sourmash collections, which
are all relocatable.
This manifest contains _references_ to the signatures (but not the
signatures themselves) and can then be used as a database target for most
sourmash operations - search, gather, etc.

For example, if you create a pathlist file `paths.txt` containing the
following:
```
/path/to/zipfile.zip
local_directory/some_signature.sig.gz
local_dir2/
```
and then run:
```
sourmash sig manifest paths.txt -o mf.csv
```
you will be able to use `mf.csv` as a database for `sourmash search`
and `sourmash gather` commands. But, because it contains two relative paths,
you will only be able to use it _from the directory that contains those
two relative paths_.
Note that `sig collect` will generate manifests containing the
pathnames given to it - so if you use relative paths, the references
will be relative to the working directory in which `sig collect` was
run. You can use `sig collect --abspath` to rewrite the paths
into absolute paths.

**Our advice:** We suggest using zip file collections for most
situations; we primarily recommend using explicit manifests for
Expand Down
11 changes: 6 additions & 5 deletions doc/databases-advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

sourmash uses a variety of different mechanisms and formats for storing, organizing, and searching signatures. Some of these mechanisms, "collections", just store the signatures; others ("indexed" databases) provide indices on the signatures for fast content-based search. _Most_ of the mechanisms now use manifests that permit fast selection and loading of signatures based on metadata. Below we refer to "databases" generically as any on-disk storage mechanism for sourmash signatures.

Which database type is best to use depends on what you're doing - which is what this document is about! In general, however, sourmash should be fast enough that database choice will only impacts performance when searching 1000s of signatures, or doing many 1000s of searches.
Which database type is best to use depends on what you're doing - which is what this document is about! In general, however, sourmash should be fast enough that database choice will only impact performance when searching thousands of signatures, or doing thousands of searches.

The recommended file extensions below are conventions used to signal the output format when using `-o` with `sourmash sketch` and the `sourmash sig` subcommands; so, for example, `sourmash sketch dna *.fa -o xyz.zip` will output signatures in the .zip format.

sourmash will automatically detect and load the database, based on the database _content_ in most cases.
sourmash will automatically detect and load the database, based on the database _content_ and not the database extension, in most cases.

Unless noted otherwise, the below database formats are supported in all release since sourmash v3.5.

Expand Down Expand Up @@ -56,15 +56,16 @@ We recommend SBT and LCA databases for use only in specific situations - e.g. SB

### Manifests

Manifests are catalogs of signature metadata - name, molecule type, k-mer size, and other information - that can be used to select specific signatures for searching or processing. Typically when using manifests the actual signatures themselves are not loaded until they are needed.
Manifests are catalogs of signature metadata - name, molecule type, k-mer size, and other information - that can be used to select specific signatures for searching or processing. Typically when using manifests the actual signatures themselves are not loaded until they are needed, although the efficiency of this depends on the signature storage mechanism; for example, JSON-format containers (`.sig` and `.lca.json` files) must be entirely loaded before any signature in the file them can be used, unlike zip containers.

As of sourmash 4.4 manifests can be *directly* loaded from the command line as standalone collections. This lets manifests serve as a catalog of signatures stored in many different locations.

Standalone manifests are preferable to both directory storage and pathlists (below), because they support fast selection and direct lazy loading. They are the most effective solution for managing custom collections of thousands to millions of signatures.

Manifests can be created with `sourmash sig manifest` and `sourmash sig check`. For complex situations, we recommend using custom Python scripts to manage them - for example, see [sigs-to-manifest.py in database-examples](https://github.com/sourmash-bio/database-examples/blob/main/sigs-to-manifest.py).
Standalone manifests can be created with `sourmash sig collect`
(sourmash v4.4 and later).

Sourmash supports two manifest file formats - CSV and SQLite. SQLite manifests are much faster than CSV manifests in exchange for extra disk space.
Sourmash supports two manifest file formats - CSV and SQLite. SQLite manifests are much faster and lower-memory than CSV manifests in exchange for consuming some extra disk space.

### Directories

Expand Down
1 change: 1 addition & 0 deletions src/sourmash/cli/sig/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from . import grep
from . import kmers
from . import check
from . import collect
from . import intersect
from . import inflate
from . import manifest
Expand Down
62 changes: 62 additions & 0 deletions src/sourmash/cli/sig/collect.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
"""collect manifest information across many files"""

usage="""

sourmash sig collect <filenames> -o all.sqlmf

This will collect manifests from across many files and save the information
into a standalone manifest database.

By default, 'sig collect' requires a pre-existing manifest for collections;
this prevents potentially slow manifest rebuilding. You
can turn this check off with '--no-require-manifest'.

"""

from sourmash.cli.utils import (add_moltype_args, add_ksize_arg,
add_picklist_args, add_pattern_args)


def subparser(subparsers):
subparser = subparsers.add_parser('collect', usage=usage)
subparser.add_argument('locations', nargs='*',
help='locations of input signatures')
subparser.add_argument('-o', '--output', help='manifest output file',
required=True)
subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-d', '--debug', action='store_true',
help='provide debugging output'
)
subparser.add_argument(
'--from-file',
help='a text file containing a list of files to load signatures from'
)
subparser.add_argument(
'--no-require-manifest',
help='do not require a manifest; generate dynamically if needed',
action='store_true'
)
subparser.add_argument(
'-F', '--manifest-format',
help="format of manifest output file; default is 'csv')",
default='sql',
choices=['csv', 'sql'],
)

subparser.add_argument('--merge-previous', action='store_true',
help='merge new manifests into existing')
subparser.add_argument('--abspath',
help="convert all locations to absolute paths",
action='store_true')

add_ksize_arg(subparser, 31)
add_moltype_args(subparser)


def main(args):
import sourmash
return sourmash.sig.__main__.collect(args)
32 changes: 31 additions & 1 deletion src/sourmash/index/sqlite_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ def load_sqlite_index(filename, *, request_manifest=False):
is_lca_db = True
debug_literal("load_sqlite_index: it's got a lineage table!")

if internal_d['SqliteManifest']:
if 'SqliteManifest' in internal_d:
v = internal_d['SqliteManifest']
if v != '1.0':
raise IndexNotSupported
Expand Down Expand Up @@ -598,6 +598,17 @@ def create(cls, filename):
cls._create_tables(cursor)
return cls(conn)

@classmethod
def create_or_open(cls, filename):
"Connect to 'filename' and create tables if not exist."
conn = sqlite3.connect(filename)
cursor = conn.cursor()
try:
cls._create_tables(cursor)
except sqlite3.OperationalError:
pass
return cls(conn)

@classmethod
def load_from_manifest(cls, manifest, *, dbfile=":memory:", append=False):
"Create a new sqlite manifest from an existing manifest object."
Expand Down Expand Up @@ -646,6 +657,10 @@ def _create_tables(cls, cursor):
)
""")

def add_row(self, row):
c = self.conn.cursor()
self._insert_row(c, row)

def _insert_row(self, cursor, row, *, call_is_from_index=False):
"Insert a new manifest row."
# check - is this manifest managed by SqliteIndex? If so, prevent
Expand Down Expand Up @@ -699,6 +714,21 @@ def __len__(self):
self._num_rows = sum(1 for _ in self.rows)
return self._num_rows

def __iadd__(self, other):
c = self.conn.cursor()
for row in other.rows:
self._insert_row(c, row)
return self

def __add__(self, other):
new_mf = self.create(":memory:")
new_mf += self
new_mf += other
return new_mf

def close(self):
self.conn.commit()

def _make_select(self):
"""Build a set of SQL SELECT conditions and matching value tuple
that can be used to select the right sketches from the
Expand Down
11 changes: 8 additions & 3 deletions src/sourmash/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,15 +78,17 @@ def load_from_csv(cls, fp):
row['signature'] = None
manifest_list.append(row)

return cls(manifest_list)
return CollectionManifest(manifest_list)

@classmethod
def load_from_sql(cls, filename):
from sourmash.index.sqlite_index import load_sqlite_index
db = load_sqlite_index(filename, request_manifest=True)
if db:
if db is not None:
return db.manifest

return None

def write_to_filename(self, filename, *, database_format='csv',
ok_if_exists=False):
if database_format == 'csv':
Expand Down Expand Up @@ -207,7 +209,7 @@ class CollectionManifest(BaseCollectionManifest):
"""
An in-memory manifest that simply stores the rows in a list.
"""
def __init__(self, rows):
def __init__(self, rows=[]):
"Initialize from an iterable of metadata dictionaries."
self.rows = []
self._md5_set = set()
Expand All @@ -219,6 +221,9 @@ def load_from_manifest(cls, manifest, **kwargs):
"Load this manifest from another manifest object."
return cls(manifest.rows)

def add_row(self, row):
self._add_rows([row])

def _add_rows(self, rows):
self.rows.extend(rows)

Expand Down
Loading