Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: rework the manifest documentation; do misc cleanup #3027

Merged
merged 39 commits into from
Mar 20, 2024
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
7ee0052
rework the manifest documentation
ctb Feb 22, 2024
5f6ef82
load manifest paths relative to cwd
ctb Mar 1, 2024
2405e9d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 1, 2024
df87087
more better tests
ctb Mar 1, 2024
d7da0aa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 1, 2024
0caee7f
implement --abspath, --relpath for sig check
ctb Mar 2, 2024
1fba4ee
clean up relpath a bit
ctb Mar 2, 2024
f3079b9
add --relpath to sig collect
ctb Mar 2, 2024
b17cd4f
implement --relpath for sig collect too
ctb Mar 3, 2024
72a9062
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2024
5cf4749
straighten out tests
ctb Mar 3, 2024
d6dfe35
add abspath/relpath to sig collect tests
ctb Mar 3, 2024
da3f165
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2024
b185be9
add abspath/relpath tests for sig collect
ctb Mar 3, 2024
9d2018c
Merge branch 'manifest_relpath' of https://github.com/sourmash-bio/so…
ctb Mar 3, 2024
28e0a5b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2024
e9b9a34
fix/update comments
ctb Mar 3, 2024
f88c662
update docs for sig check and sig collect
ctb Mar 3, 2024
4e10802
add in some documentation about relpath in sourmash internals
ctb Mar 3, 2024
6dd3e3a
path rewriting tests for sig check
ctb Mar 3, 2024
e3bb509
update docs
ctb Mar 3, 2024
703585c
add --relpath tests for sig collect
ctb Mar 3, 2024
346ed82
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2024
637b1fc
a few more tests
ctb Mar 3, 2024
8b1bc2c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2024
5937cb0
Merge branch 'manifest_relpath' into update_sig_collect
ctb Mar 3, 2024
c9bf80e
more documentation foo
ctb Mar 4, 2024
5d9f219
more docs
ctb Mar 4, 2024
a13b77e
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 8, 2024
cbc7fbc
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 9, 2024
bad1a71
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 10, 2024
104042a
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 12, 2024
3741203
more updates
ctb Mar 17, 2024
eba2541
more
ctb Mar 17, 2024
c42fd00
Merge branch 'latest' of github.com:sourmash-bio/sourmash into update…
ctb Mar 17, 2024
0d9ec6d
Merge branch 'latest' of github.com:sourmash-bio/sourmash into update…
ctb Mar 19, 2024
c34421d
misc fixes
ctb Mar 19, 2024
cb6bae3
Update doc/command-line.md
ctb Mar 20, 2024
8711c3d
Merge branch 'latest' into update_sig_collect
ctb Mar 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 104 additions & 61 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -1914,7 +1914,10 @@ will continue processing input sequences.

### `sourmash signature manifest` - output a manifest for a file

Output a manifest for a file, database, or collection.
Output a manifest for a file, database, or collection. Note that
these manifests are not usually suitable for use as standalone
manifests; the `sourmash sig collect` and `sourmash sig check`
commands produce standalone manifests.

For example,
```
Expand Down Expand Up @@ -1942,8 +1945,10 @@ CSV and SQLite manifest files.

### `sourmash signature check` - compare picklists and manifests

Compare picklists and manifests across databases, and optionally output matches
and missing items.
Compare picklists and manifests across databases, and optionally
output matches and missing items. In particular, `sig check` can be
used to create standalone manifests for a subset of a large collection,
using picklists.

For example,
```
Expand All @@ -1962,17 +1967,28 @@ collections of signatures and identifiers.
With `-m/--save-manifest-matching`, `sig check` creates a standalone
manifest. In these manifests, sourmash v4 will by default write paths
to the matched elements that are relative to the current working
directory. In some cases - when the output manifest is in different
directory. In some cases - when the output manifest is in a different
directory - this will create manifests that do not work properly
with sourmash. The `--relpath` argument will rewrite the paths to be
relative to the manifest, while the `--abspath` argument will rewrite
paths to be absolute. The `--relpath` behavior will be the default in
sourmash v5.

Standalone manifests created with `-m/--save-manifest-matching` will
use the paths given to `sig check` on the command line; we recommend
using zip files and sig files, and avoiding directory hierarchies or
path lists. You can use `--from-file` to pass in long lists of
filenames via a text file.

### `sourmash signature collect` - collect manifests across databases

Collect manifests from across (many) files and merge into a single
standalone manifest.
standalone manifest. Standalone manifests can be used directly as a
sourmash database; they support efficient searching and selection of
sketches, as well as lazy loading of individual sketches from large
collections. See
[advanced usage information on sourmash databases](databases-advanced.md)
for more information.

For example,
```
Expand All @@ -1987,20 +2003,30 @@ This manifest file can be loaded directly from the command line by sourmash.
particularly useful when working with large collections of signatures and
identifiers, and has command line options for merging and updating manifests.

The standalone manifests created by `sig collect` will reference the
paths given on the command line; we recommend using zip files and sig
files, and avoiding directory hierarchies or path lists. You can also
use `--from-file` to pass in long lists of filenames.

Standalone manifests produced by `sig collect` work most efficiently
when constructed from many small zip file collections.

As with `sig check`, the standalone manifests created by `sig collect`
in sourmash v4 will by default write paths to the matched elements
relative to the current working directory. When the output manifest
is in a different directory, this will create manifests that do not work
properly with sourmash. The `--relpath` argument will rewrite the
paths to be relative to the manifest, while the `--abspath` argument
will rewrite paths to be absolute. The `--relpath` behavior will be
the default in sourmash v5.
is in a different directory, this will create manifests that do not
work properly with sourmash. The `--relpath` argument will rewrite
the paths to be relative to the manifest, while the `--abspath`
argument will rewrite paths to be absolute. The `--relpath` behavior
will be the default in sourmash v5.

## Advanced command-line usage

### Loading signatures and databases

sourmash uses several different command-line styles.
sourmash uses several different command-line styles. Most sourmash
commands can load sketches from any standard collection type; we
primarily recommend using zipfiles (but read on!)

Briefly,

Expand All @@ -2011,22 +2037,18 @@ Briefly,
need to provide a selector (ksize with `-k`, moltype with `--dna` etc,
or md5sum with `--query-md5`) that picks out a single signature.

* `compare` takes multiple signatures and can load them from files,
directories, and indexed databases (SBT or LCA). It can also take
a list of file paths in a text file, using `--from-file` (see below).
* `compare` takes multiple signatures and can load them from any
sourmash collection type.

* the `lca classify` and `lca summarize` commands take multiple
signatures with `--query`, and multiple LCA databases, with
`--db`. `sourmash multigather` also uses this style. This allows these
commands to specify multiple queries **and** multiple databases without
(too much) confusion. These commands will take files containing
signature files using `--query-from-file` (see below).
(too much) confusion. The database must be LCA databases.

* `index` and `lca index` take a few fixed parameters (database name,
and for `lca index`, a taxonomy file) and then an arbitrary number of
other files that contain signatures, including files, directories,
and indexed databases. These commands will also take `--from-file`
(see below).
other files that contain signatures.

None of these commands currently support searching, comparing, or indexing
signatures with multiple ksizes or moltypes at the same time; you need
Expand Down Expand Up @@ -2092,7 +2114,7 @@ The following `coltype`s are currently supported for picklists:
* `gather` - use the CSV output of `sourmash gather` as a picklist
* `prefetch` - use the CSV output of `sourmash prefetch` as a picklist
* `search` - use the CSV output of `sourmash prefetch` as a picklist
* `manifest` - use the CSV output of `sourmash sig manifest` as a picklist
* `manifest` - use CSV manifests produced by `sig manfiest` as a picklist
ctb marked this conversation as resolved.
Show resolved Hide resolved

Identifiers are constructed by using the first space delimited word in
the signature name.
Expand All @@ -2101,7 +2123,7 @@ One way to build a picklist is to use `sourmash sig grep <pattern>
<collection> --csv out.csv` to construct a CSV file containing a list
of all sketches that match the pattern (which can be a string or
regexp). The `out.csv` file can be used as a picklist via the picklist
manifest format with `--picklist out.csv::manifest`.
manifest CSV format with `--picklist out.csv::manifest`.

You can also use `sourmash sig describe --csv out.csv <signatures>` or
`sourmash sig manifest -o out.csv <filename_or_db>` to construct an
Expand Down Expand Up @@ -2144,7 +2166,9 @@ slow, especially for many (100s or 1000s) of signatures.
All of the `sourmash` commands support loading collections of
signatures from zip files. You can create a compressed collection of
signatures using `sourmash sig cat *.sig -o collections.zip` and then
specifying `collections.zip` on the command line in place of `*.sig`.
specifying `collections.zip` on the command line in place of `*.sig`;
you can also sketch FASTA/FASTQ files directly into a zip file with
`-o collections.zip`.

### Choosing signature output formats

Expand All @@ -2171,7 +2195,7 @@ to stdout.
All of these save formats can be loaded by sourmash commands.

**We strongly suggest using .zip files to store signatures: they are fast,
small, and fully supported by all the sourmash commands.**
small, and fully supported by all the sourmash commands and API.**

Note that when outputting large collections of signatures, some save
formats require holding all the sketches in memory until they can be
Expand All @@ -2186,19 +2210,6 @@ databases!](databases-advanced.md)

### Loading many signatures

#### Loading signatures within a directory hierarchy

All of the `sourmash` commands support loading signatures from
beneath directories; provide the paths on the command line.

#### Passing in lists of files

Most sourmash commands will also take a `--from-file` or
`--query-from-file`, which will take the location of a text file containing
a list of file paths. This can be useful for situations where you want
to specify thousands of queries, or a subset of signatures produced by
some other command.

#### Indexed databases

Indexed databases can make searching signatures much faster. SBT
Expand All @@ -2209,9 +2220,6 @@ SQLite databases (new in sourmash v4.4.0) are typically larger on disk
than SBTs and LCAs, but in turn are fast to load and support very low
memory search.

(LCA databases also directly permit taxonomic searches using `sourmash lca`
functions.)

Commands that take multiple signatures or collections of signatures
will also work with indexed databases.

Expand All @@ -2223,9 +2231,9 @@ only at one scaled value. If the database signature type is
incompatible with the other signatures, sourmash will complain
appropriately.

In contrast, signature files, zip collections, and directory
hierarchies can contain many different types of signatures, and
compatible ones will be selected automatically.
In contrast, signature files and zip collections can contain many
different types of signatures, and compatible ones will be selected
automatically.

Use the `sourmash index` command to create an SBT.

Expand All @@ -2235,26 +2243,50 @@ database can be saved in JSON or SQL format with `-F json` or `-F sql`.
Use `sourmash sig cat <list of signatures> -o <output>.sqldb` to create
a SQLite indexed database.

#### Loading signatures within a directory hierarchy

All of the `sourmash` commands support loading signatures (`.sig` or
`.sig.gz` files) from within directory hierarchies; you can just
provide the paths to the top-level directory on the command line.

However, this is no longer recommended because it can be very
inefficient; we instead suggest passing all of the sketch files in
the directory into `sig collect` to build a standalone manifest, or
using `sig cat` on the directory to generate a zip file.

#### Passing in lists of files

sourmash commands support `--from-file` or `--query-from-file`, which
will take the location of a text file containing a list of file
paths. This can be useful for situations where you want to specify
thousands of queries, or a subset of signatures produced by some other
command.

This is no longer recommended when using large collections; we instead
suggest using standalone manifests built with `sig collect` and `sig
check`, which will include extra metadata that supports fast loading.

### Combining search databases on the command line

All of the commands in sourmash operate in "online" mode, so you can
combine multiple databases and signatures on the command line and get
the same answer as if you built a single large database from all of
them. The only caveat to this rule is that if you have multiple
identical matches present across the databases, the order in which
they are found will differ depending on the order that the files are
they are used may depend on the order that the files are
passed in on the command line.

### Using stdin

Most commands will take signature JSON data via stdin using the usual
UNIX convention, `-`. Moreover, `sourmash sketch` and the `sourmash
sig` commands will output to stdout. So, for example,
```
sourmash sketch ... -o - | sourmash sig describe -
```
will describe the signatures that were just created.

`sourmash sketch ... -o - | sourmash sig describe -` will describe the
signatures that were just created.

### Using manifests to explicitly refer to collections of files
### Using standalone manifests to explicitly refer to collections of files

(sourmash v4.4 and later)

Expand All @@ -2264,9 +2296,9 @@ internals to speed up signature selection through picklists and
pattern matching.

Manifests can _also_ be used externally (via the command-line), and
may be useful for organizing large collections of signatures. They can
be generated with the `sig collect`, `sig manifest`, and `sig check`
subcommands.
these "standalone manifests" may be useful for organizing large
collections of signatures. They can be generated with the `sig
collect`, `sig manifest`, and `sig check` subcommands.

Suppose you have a large collection of signatures (`.sig` or `.sig.gz`
files) in a location (e.g., under a directory, or in a zip file). You
Expand All @@ -2280,21 +2312,32 @@ sourmash sig fileinfo manifest.sqlmf
```
This manifest contains _references_ to the signatures (but not the
signatures themselves) and can then be used as a database target for most
sourmash operations - search, gather, etc.
sourmash operations - search, gather, etc. Manifests support
fast selection and lazy loading of sketches in many situations.

The `sig check` command can also be used to create standalone manifests
from collections using a picklist, with the `-m/--save-manifest-matching`
option. This is useful for commands that don't support picklists natively,
e.g. plugins and extensions.

Note that `sig collect` will generate manifests containing the
pathnames given to it - so if you use relative paths, the references
will be relative to the working directory in which `sig collect` was
Note that `sig collect` and `sig check` will generate manifests containing the
pathnames given to them - so if you use relative paths, the references
will be relative to the working directory in which the command was
run. You can use `sig collect --abspath` to rewrite the paths
into absolute paths.
into absolute paths, or `sig collect --relpath` to rewrite the paths
relative to the manifest file.

**Our advice:** We suggest using zip file collections for most
situations; we primarily recommend using explicit manifests for
situations where you have a **very large** collection of signatures
(1000s or more), and don't want to make multiple copies of signatures
in the collection (as you would have to, with a zipfile). This can be
useful if you want to refer to different subsets of the collection
without making multiple copies in a zip file.
situations; we strongly recommend using standalone manifests for
situations where you have **very large** sketches or a **very large**
collection of sketches (1000s or more), and don't want to make
multiple copies of signatures in the collection (as you would have to,
with a zipfile). This is particularly useful if you want to refer to different
subsets of the collection without making multiple copies in a zip
file.

You can read more about the details of zip files and manifests in
[the advanced usage information for databases](databases-advanced.md).

### Using sourmash plugins

Expand Down
Loading
Loading