Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Remove lca gather #1307

Merged
merged 2 commits into from
Feb 6, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 2 additions & 5 deletions doc/classifying-signatures.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,9 +81,6 @@ differences between the `sourmash lca` subcommands and the basic
output structured taxonomic information, and these are what you should look
to if you are interested in doing classification.

The command `lca gather` applies the `gather` algorithm to search an
LCA database; it reports taxonomy.

It's important to note that taxonomy based on k-mers is very, very
specific and if you get a match, it's pretty reliable. On the
converse, however, k-mer identification is very brittle with respect
Expand Down Expand Up @@ -120,8 +117,8 @@ containment queries against genome databases. This will give you
numbers that (approximately) match what you get from counting mapped
reads.

If you compute your input signatures with `--track-abundance`, both
`sourmash gather` and `sourmash lca gather` will use that information
If you compute your input signatures with `--track-abundance`,
`sourmash gather` will use that information
to calculate an abundance-weighted result. This will weight
each match to a hash value by the multiplicity of the hash value in
the query signature. You can turn off this behavior with
Expand Down
38 changes: 2 additions & 36 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,6 @@ walkthrough of these commands.

* `lca classify` classifies many signatures against an LCA database.
* `lca summarize` summarizes the content of metagenomes using an LCA database.
* `lca gather` finds non-overlapping matches to a metagenome in an LCA database.
* `lca index` creates a database for use with LCA subcommands.
* `lca rankinfo` summarizes the content of a database.
* `lca compare_csv` compares lineage spreadsheets, e.g. those output by `lca classify`.
Expand Down Expand Up @@ -261,8 +260,8 @@ Note:

Use `sourmash gather` to classify a metagenome against a collection of
genomes with no (or incomplete) taxonomic information. Use `sourmash
lca summarize` and `sourmash lca gather` to classify a metagenome
using a collection of genomes with taxonomic information.
lca summarize` to classify a metagenome using a collection of genomes
with taxonomic information.

## `sourmash lca` subcommands for taxonomic classification

Expand Down Expand Up @@ -431,39 +430,6 @@ text file passed to `sourmash lca summarize` with the
`--query-from-file` flag; these files will be appended to the `--query`
input.

### `sourmash lca gather` - find metagenome taxonomy (DEPRECATED for 4.0)

The `sourmash lca gather` command finds all non-overlapping
matches to the query, similar to the `sourmash gather` command. This
is specifically meant for metagenome and genome bin analysis. (See
[Classifying Signatures](classifying-signatures.md) for more
information on the different approaches that can be used here.)

If the input signature was computed with `--track-abundance`, output
will be abundance weighted (unless `--ignore-abundances` is
specified). `-o/--output` will create a CSV file containing the
matches.

Usage:

```
sourmash lca gather query.sig [<lca database> ...]
```

Example output:

```
overlap p_query p_match
--------- ------- --------
1.8 Mbp 14.6% 9.1% Fusobacterium nucleatum
1.0 Mbp 7.8% 16.3% Proteiniclasticum ruminis
1.0 Mbp 7.7% 25.9% Haloferax volcanii
0.9 Mbp 7.4% 11.8% Nostoc sp. PCC 7120
0.9 Mbp 7.0% 5.8% Shewanella baltica
0.8 Mbp 6.0% 8.6% Desulfovibrio vulgaris
0.6 Mbp 4.9% 12.6% Thermus thermophilus
```

### `sourmash lca index` - build an LCA database

The `sourmash lca index` command creates an LCA database from
Expand Down
1 change: 0 additions & 1 deletion src/sourmash/cli/lca/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@

from . import classify
from . import compare_csv
from . import gather
from . import index
from . import rankinfo
from . import summarize
Expand Down
32 changes: 0 additions & 32 deletions src/sourmash/cli/lca/gather.py

This file was deleted.

1 change: 0 additions & 1 deletion src/sourmash/lca/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,5 @@
from .command_classify import classify
from .command_summarize import summarize_main
from .command_rankinfo import rankinfo_main
from .command_gather import gather_main
from .__main__ import main

3 changes: 1 addition & 2 deletions src/sourmash/lca/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import sys
import argparse

from . import classify, index, summarize_main, rankinfo_main, gather_main
from . import classify, index, summarize_main, rankinfo_main
from .command_compare_csv import compare_csv
from ..logging import set_quiet, error

Expand All @@ -16,7 +16,6 @@

index <taxonomy.csv> <output_db name> <signature [...]> - create LCA database
classify --db <db_name [...]> --query <signature [...]> - classify genomes
gather <signature> <db_name [...]> - classify metagenomes
summarize --db <db_name [...]> --query <signature [...]> - summarize mixture
rankinfo <db_name [...]> - database rank info
compare_csv <csv1> <csv2> - compare spreadsheets
Expand Down
Loading