Skip to content

Commit

Permalink
[MRG] implement tax grep to produce identifier picklists from taxon…
Browse files Browse the repository at this point in the history
…omies (#2178)

* add basics of sourmash tax grep command

* initial implementation

* fixed bug; added various args

* more tests

* support gzipped taxonomy files

* let 'prepare' save gzip csv files, too

* switch to outputting lineage; refactor

* fix and test invert match

* more tests

* add --count

* comment

* upd docs

* add test for tax prepare combining

* test multiple, duplicate tax

* finish? tests

* implement --force

* update usage string

* Update doc/command-line.md

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
  • Loading branch information
ctb and bluegenes authored Aug 8, 2022
1 parent 6ac4862 commit 010718c
Show file tree
Hide file tree
Showing 8 changed files with 512 additions and 6 deletions.
59 changes: 55 additions & 4 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ information; these are grouped under the `sourmash tax` and
* `tax metagenome` - summarize metagenome gather results at each taxonomic rank.
* `tax genome` - summarize single-genome gather results and report most likely classification.
* `tax annotate` - annotate gather results with lineage information (no summarization or classification).
* `tax grep` - subset taxonomies and create picklists based on taxonomy string matches.

`sourmash lca` commands:

Expand Down Expand Up @@ -706,19 +707,22 @@ To produce multiple output types from the same command, add the types into the
for each database match to gather output. Do not summarize or classify.
Note that this is not required for either `summarize` or `classify`.

By default, `annotate` uses the name of each input gather csv to write an updated
version with lineages information. For example, annotating `sample1.gather.csv`
would produce `sample1.gather.with-lineages.csv`
By default, `annotate` uses the name of each input gather csv to write
an updated version with lineages information. For example, annotating
`sample1.gather.csv` would produce `sample1.gather.with-lineages.csv`.

This will produce an annotated gather CSV, `Sb47+63_gather_x_gtdbrs202_k31.with-lineages.csv`:
```
sourmash tax annotate
--gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
--taxonomy gtdb-rs202.taxonomy.v2.csv
```
> This will produce an annotated gather CSV, `Sb47+63_gather_x_gtdbrs202_k31.with-lineages.csv`

### `sourmash tax prepare` - prepare and/or combine taxonomy files

`sourmash tax prepare` prepares taxonomy files for other `sourmash tax`
commands.

All `sourmash tax` commands must be given one or more taxonomy files as
parameters to the `--taxonomy` argument. These files can be either CSV
files or (as of sourmash 4.2.1) sqlite3 databases. sqlite3 databases
Expand All @@ -743,6 +747,53 @@ can be set to CSV like so:
sourmash tax prepare --taxonomy file1.csv file2.db -o tax.csv -F csv
```

### `sourmash tax grep` - subset taxonomies and create picklists based on taxonomy string matches

(`sourmash tax grep` is a new command as of sourmash v4.5.0.)

`sourmash tax grep` searches taxonomies for matching strings,
optionally restricting the string search to a specific taxonomic rank.
It creates new files containing matching taxonomic entries; these new
files can serve as taxonomies and can also be used as
[picklists to restrict database matches](#using-picklists-to-subset-large-collections-of-signatures).

Usage:
```
sourmash tax grep <pattern> -t <taxonomy-db> [<taxonomy-db> ...]
```
where `pattern` is a regular expression; see Python's
[Regular Expression HOWTO for details on supported regexp features](https://docs.python.org/3/howto/regex.html#regex-howto).

For example,
```
sourmash tax grep Shew -t gtdb-rs207.taxonomy.sqldb -o shew-picklist.csv
```
will search for a string match to `Shew` within the entire GTDB RS207
taxonomy, and will output a subset taxonomy in `shew-picklist.csv`.
This picklist can be used with the GTDB
RS207 databases like so:
```
sourmash search query.sig gtdb-rs207.genomic.k31.zip \
--picklist shew-picklist.csv:ident:ident
```


`tax grep` can also restrict string matching to a specific taxonomic rank
with `-r/--rank`; for examplem
```
sourmash tax grep Shew -t gtdb-rs207.taxonomy.sqldb \
-o shew-picklist.csv -r genus
```
will restrict matches to the rank of genus. Available ranks are
superkingdom, phylum, class, order, family, genus, and species.

`tax grep` also takes several standard grep arguments, including `-i`
to ignore case and `-v` to output only taxonomic lineages that do
_not_ match the pattern.

Currently only CSV output (optionally gzipped) is supported; use `sourmash tax prepare` to
convert CSV output from `tax grep` into a sqlite3 taxonomy database.

## `sourmash lca` subcommands for in-memory taxonomy integration

These commands use LCA databases (created with `lca index`, below, or
Expand Down
2 changes: 2 additions & 0 deletions src/sourmash/cli/tax/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
from . import genome
from . import annotate
from . import prepare
from . import grep

from ..utils import command_list
from argparse import SUPPRESS, RawDescriptionHelpFormatter
import os
Expand Down
72 changes: 72 additions & 0 deletions src/sourmash/cli/tax/grep.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
"""search taxonomies and output picklists.."""

usage="""
sourmash tax grep <term> --taxonomy-csv <taxonomy_file> [ ... ]
`sourmash tax grep` searches taxonomies for matching strings,
optionally restricting the string search to a specific taxonomic rank.
It creates new files containing matching taxonomic entries; these new
files can serve as taxonomies and can also be used as picklists.
Please see the 'tax grep' documentation for more details:
https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-tax-grep-subset-taxonomies-and-create-picklists-based-on-taxonomy-string-matches
"""

import sourmash
from sourmash.logging import notify, print_results, error


def subparser(subparsers):
subparser = subparsers.add_parser('grep', usage=usage)
subparser.add_argument('pattern')
subparser.add_argument('-r', '--rank',
help="search only this rank",
choices=['superkingdom',
'phylum',
'class',
'order',
'family',
'genus',
'species'])
subparser.add_argument(
'-v', '--invert-match',
help="select non-matching lineages",
action="store_true"
)
subparser.add_argument(
'-i', '--ignore-case',
help="ignore case distinctions (search lower and upper case both)",
action="store_true"
)
subparser.add_argument(
'--silent', '--no-picklist-output',
help="do not output picklist",
action='store_true',
)
subparser.add_argument(
'-c', '--count',
help="only output a count of discovered lineages; implies --silent",
action='store_true'
)
subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-t', '--taxonomy-csv', '--taxonomy', metavar='FILE',
nargs="+", required=True,
help='database lineages'
)
subparser.add_argument(
'-o', '--output', default='-',
help='output file (defaults to stdout)',
)
subparser.add_argument(
'-f', '--force', action = 'store_true',
help='continue past errors in file and taxonomy loading',
)

def main(args):
import sourmash
return sourmash.tax.__main__.grep(args)
54 changes: 54 additions & 0 deletions src/sourmash/tax/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import csv
import os
from collections import defaultdict
import re

import sourmash
from ..sourmash_args import FileOutputCSV, FileOutput
Expand Down Expand Up @@ -375,6 +376,59 @@ def prepare(args):
notify("done!")


def grep(args):
term = args.pattern
tax_assign = MultiLineageDB.load(args.taxonomy_csv,
force=args.force)

silent = args.silent or args.count

notify(f"searching {len(args.taxonomy_csv)} taxonomy files for '{term}'")
if args.invert_match:
notify("-v/--invert-match specified; returning only lineages that do not match.")
if args.rank:
notify(f"limiting matches to {args.rank} level")

# build the search pattern
pattern = args.pattern
if args.ignore_case:
pattern = re.compile(pattern, re.IGNORECASE)
else:
pattern = re.compile(pattern)

# determine if lineage matches.
def find_pattern(lineage, select_rank):
for (rank, name) in lineage:
if select_rank is None or rank == select_rank:
if pattern.search(name):
return True
return False

if args.invert_match:
def search_pattern(l, r):
return not find_pattern(l, r)
else:
search_pattern = find_pattern

match_ident = []
for ident, lineage in tax_assign.items():
if search_pattern(lineage, args.rank):
match_ident.append((ident, lineage))

if silent:
notify(f"found {len(match_ident)} matches.")
notify("(no matches will be saved because of --silent/--count")
else:
with FileOutputCSV(args.output) as fp:
w = csv.writer(fp)

w.writerow(['ident'] + list(sourmash.lca.taxlist(include_strain=False)))
for ident, lineage in sorted(match_ident):
w.writerow([ident] + [ x.name for x in lineage ])

notify(f"found {len(match_ident)} matches; saved identifiers to picklist file '{args.output}'")


def main(arglist=None):
args = sourmash.cli.get_parser().parse_args(arglist)
submod = getattr(sourmash.cli.sig, args.subcmd)
Expand Down
17 changes: 15 additions & 2 deletions src/sourmash/tax/tax_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import csv
from collections import namedtuple, defaultdict
from collections import abc
import gzip

from sourmash import sqlite_utils
from sourmash.exceptions import IndexNotSupported
Expand Down Expand Up @@ -641,7 +642,16 @@ def load(cls, filename, *, delimiter=',', force=False,
if os.path.isdir(filename):
raise ValueError(f"'{filename}' is a directory")

with open(filename, newline='') as fp:
xopen = open
try:
with gzip.open(filename, 'r') as fp:
# succesful open/read? use gzip
fp.read(1)
xopen = gzip.open
except gzip.BadGzipFile:
pass

with xopen(filename, 'rt', newline='') as fp:
r = csv.DictReader(fp, delimiter=delimiter)
header = r.fieldnames
if not header:
Expand Down Expand Up @@ -931,7 +941,10 @@ def save(self, filename_or_fp, file_format):
# we need a file handle; open file.
fp = filename_or_fp
if is_filename:
fp = open(filename_or_fp, 'w', newline="")
if filename_or_fp.endswith('.gz'):
fp = gzip.open(filename_or_fp, 'wt', newline="")
else:
fp = open(filename_or_fp, 'w', newline="")

try:
self._save_csv(fp)
Expand Down
Binary file added tests/test-data/tax/gtdb-tax-grep.sigs.zip
Binary file not shown.
Loading

0 comments on commit 010718c

Please sign in to comment.