[MRG] implement tax grep to produce identifier picklists from taxon…

…omies (#2178) * add basics of sourmash tax grep command * initial implementation * fixed bug; added various args * more tests * support gzipped taxonomy files * let 'prepare' save gzip csv files, too * switch to outputting lineage; refactor * fix and test invert match * more tests * add --count * comment * upd docs * add test for tax prepare combining * test multiple, duplicate tax * finish? tests * implement --force * update usage string * Update doc/command-line.md Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
sourmash-bio · Aug 8, 2022 · 010718c · 010718c
1 parent 6ac4862
commit 010718c
Show file tree

Hide file tree

Showing 8 changed files with 512 additions and 6 deletions.
diff --git a/doc/command-line.md b/doc/command-line.md
@@ -91,6 +91,7 @@ information; these are grouped under the `sourmash tax` and
 * `tax metagenome` - summarize metagenome gather results at each taxonomic rank.
 * `tax genome`     - summarize single-genome gather results and report most likely classification.
 * `tax annotate`   - annotate gather results with lineage information (no summarization or classification).
+* `tax grep` - subset taxonomies and create picklists based on taxonomy string matches.
 
 `sourmash lca` commands:
 
@@ -706,19 +707,22 @@ To produce multiple output types from the same command, add the types into the
  for each database match to gather output. Do not summarize or classify.
  Note that this is not required for either `summarize` or `classify`.
 
-By default, `annotate` uses the name of each input gather csv to write an updated
- version with lineages information. For example, annotating `sample1.gather.csv`
- would produce `sample1.gather.with-lineages.csv`
+By default, `annotate` uses the name of each input gather csv to write
+an updated version with lineages information. For example, annotating
+`sample1.gather.csv` would produce `sample1.gather.with-lineages.csv`.
 
+This will produce an annotated gather CSV, `Sb47+63_gather_x_gtdbrs202_k31.with-lineages.csv`:
 ```
 sourmash tax annotate
     --gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
     --taxonomy gtdb-rs202.taxonomy.v2.csv
 ```
-> This will produce an annotated gather CSV, `Sb47+63_gather_x_gtdbrs202_k31.with-lineages.csv`
 
 ### `sourmash tax prepare` - prepare and/or combine taxonomy files
 
+`sourmash tax prepare` prepares taxonomy files for other `sourmash tax`
+commands.
+
 All `sourmash tax` commands must be given one or more taxonomy files as
 parameters to the `--taxonomy` argument. These files can be either CSV
 files or (as of sourmash 4.2.1) sqlite3 databases. sqlite3 databases
@@ -743,6 +747,53 @@ can be set to CSV like so:
 sourmash tax prepare --taxonomy file1.csv file2.db -o tax.csv -F csv
 ```
 
+### `sourmash tax grep` - subset taxonomies and create picklists based on taxonomy string matches
+
+(`sourmash tax grep` is a new command as of sourmash v4.5.0.)
+
+`sourmash tax grep` searches taxonomies for matching strings,
+optionally restricting the string search to a specific taxonomic rank.
+It creates new files containing matching taxonomic entries; these new
+files can serve as taxonomies and can also be used as
+[picklists to restrict database matches](#using-picklists-to-subset-large-collections-of-signatures).
+
+Usage:
+```
+sourmash tax grep <pattern> -t <taxonomy-db> [<taxonomy-db> ...]
+```
+where `pattern` is a regular expression; see Python's
+[Regular Expression HOWTO for details on supported regexp features](https://docs.python.org/3/howto/regex.html#regex-howto).
+
+For example,
+```
+sourmash tax grep Shew -t gtdb-rs207.taxonomy.sqldb -o shew-picklist.csv
+```
+will search for a string match to `Shew` within the entire GTDB RS207
+taxonomy, and will output a subset taxonomy in `shew-picklist.csv`.
+This picklist can be used with the GTDB
+RS207 databases like so:
+```
+sourmash search query.sig gtdb-rs207.genomic.k31.zip \
+    --picklist shew-picklist.csv:ident:ident
+```
+
+
+`tax grep` can also restrict string matching to a specific taxonomic rank
+with `-r/--rank`; for examplem
+```
+sourmash tax grep Shew -t gtdb-rs207.taxonomy.sqldb \
+    -o shew-picklist.csv -r genus
+```
+will restrict matches to the rank of genus. Available ranks are
+superkingdom, phylum, class, order, family, genus, and species.
+
+`tax grep` also takes several standard grep arguments, including `-i`
+to ignore case and `-v` to output only taxonomic lineages that do
+_not_ match the pattern.
+
+Currently only CSV output (optionally gzipped) is supported; use `sourmash tax prepare` to
+convert CSV output from `tax grep` into a sqlite3 taxonomy database.
+
 ## `sourmash lca` subcommands for in-memory taxonomy integration
 
 These commands use LCA databases (created with `lca index`, below, or

diff --git a/src/sourmash/cli/tax/__init__.py b/src/sourmash/cli/tax/__init__.py
@@ -8,6 +8,8 @@
 from . import genome
 from . import annotate
 from . import prepare
+from . import grep
+
 from ..utils import command_list
 from argparse import SUPPRESS, RawDescriptionHelpFormatter
 import os

diff --git a/src/sourmash/cli/tax/grep.py b/src/sourmash/cli/tax/grep.py
@@ -0,0 +1,72 @@
+"""search taxonomies and output picklists.."""
+
+usage="""
+
+    sourmash tax grep <term> --taxonomy-csv <taxonomy_file> [ ... ]
+
+`sourmash tax grep` searches taxonomies for matching strings,
+optionally restricting the string search to a specific taxonomic rank.
+It creates new files containing matching taxonomic entries; these new
+files can serve as taxonomies and can also be used as picklists.
+
+Please see the 'tax grep' documentation for more details:
+  https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-tax-grep-subset-taxonomies-and-create-picklists-based-on-taxonomy-string-matches
+"""
+
+import sourmash
+from sourmash.logging import notify, print_results, error
+
+
+def subparser(subparsers):
+    subparser = subparsers.add_parser('grep', usage=usage)
+    subparser.add_argument('pattern')
+    subparser.add_argument('-r', '--rank',
+                           help="search only this rank",
+                           choices=['superkingdom',
+                                    'phylum',
+                                    'class',
+                                    'order',
+                                    'family',
+                                    'genus',
+                                    'species'])
+    subparser.add_argument(
+        '-v', '--invert-match',
+        help="select non-matching lineages",
+        action="store_true"
+    )
+    subparser.add_argument(
+        '-i', '--ignore-case',
+        help="ignore case distinctions (search lower and upper case both)",
+        action="store_true"
+    )
+    subparser.add_argument(
+        '--silent', '--no-picklist-output',
+        help="do not output picklist",
+        action='store_true',
+    )
+    subparser.add_argument(
+        '-c', '--count',
+        help="only output a count of discovered lineages; implies --silent",
+        action='store_true'
+    )
+    subparser.add_argument(
+        '-q', '--quiet', action='store_true',
+        help='suppress non-error output'
+    )
+    subparser.add_argument(
+        '-t', '--taxonomy-csv', '--taxonomy', metavar='FILE',
+        nargs="+", required=True,
+        help='database lineages'
+    )
+    subparser.add_argument(
+        '-o', '--output', default='-',
+        help='output file (defaults to stdout)',
+    )
+    subparser.add_argument(
+        '-f', '--force', action = 'store_true',
+        help='continue past errors in file and taxonomy loading',
+    )
+
+def main(args):
+    import sourmash
+    return sourmash.tax.__main__.grep(args)
diff --git a/src/sourmash/tax/__main__.py b/src/sourmash/tax/__main__.py
@@ -5,6 +5,7 @@
 import csv
 import os
 from collections import defaultdict
+import re
 
 import sourmash
 from ..sourmash_args import FileOutputCSV, FileOutput
@@ -375,6 +376,59 @@ def prepare(args):
     notify("done!")
 
 
+def grep(args):
+    term = args.pattern
+    tax_assign = MultiLineageDB.load(args.taxonomy_csv,
+                                     force=args.force)
+
+    silent = args.silent or args.count
+
+    notify(f"searching {len(args.taxonomy_csv)} taxonomy files for '{term}'")
+    if args.invert_match:
+        notify("-v/--invert-match specified; returning only lineages that do not match.")
+    if args.rank:
+        notify(f"limiting matches to {args.rank} level")
+
+    # build the search pattern
+    pattern = args.pattern
+    if args.ignore_case:
+        pattern = re.compile(pattern, re.IGNORECASE)
+    else:
+        pattern = re.compile(pattern)
+
+    # determine if lineage matches.
+    def find_pattern(lineage, select_rank):
+        for (rank, name) in lineage:
+            if select_rank is None or rank == select_rank:
+                if pattern.search(name):
+                    return True
+        return False
+
+    if args.invert_match:
+        def search_pattern(l, r):
+            return not find_pattern(l, r)
+    else:
+        search_pattern = find_pattern
+
+    match_ident = []
+    for ident, lineage in tax_assign.items():
+        if search_pattern(lineage, args.rank):
+            match_ident.append((ident, lineage))
+
+    if silent:
+        notify(f"found {len(match_ident)} matches.")
+        notify("(no matches will be saved because of --silent/--count")
+    else:
+        with FileOutputCSV(args.output) as fp:
+            w = csv.writer(fp)
+
+            w.writerow(['ident'] + list(sourmash.lca.taxlist(include_strain=False)))
+            for ident, lineage in sorted(match_ident):
+                w.writerow([ident] + [ x.name for x in lineage ])
+
+        notify(f"found {len(match_ident)} matches; saved identifiers to picklist file '{args.output}'")
+
+
 def main(arglist=None):
     args = sourmash.cli.get_parser().parse_args(arglist)
     submod = getattr(sourmash.cli.sig, args.subcmd)

diff --git a/src/sourmash/tax/tax_utils.py b/src/sourmash/tax/tax_utils.py
@@ -5,6 +5,7 @@
 import csv
 from collections import namedtuple, defaultdict
 from collections import abc
+import gzip
 
 from sourmash import sqlite_utils
 from sourmash.exceptions import IndexNotSupported
@@ -641,7 +642,16 @@ def load(cls, filename, *, delimiter=',', force=False,
         if os.path.isdir(filename):
             raise ValueError(f"'{filename}' is a directory")
 
-        with open(filename, newline='') as fp:
+        xopen = open
+        try:
+            with gzip.open(filename, 'r') as fp:
+                # succesful open/read? use gzip
+                fp.read(1)
+                xopen = gzip.open
+        except gzip.BadGzipFile:
+            pass
+
+        with xopen(filename, 'rt', newline='') as fp:
             r = csv.DictReader(fp, delimiter=delimiter)
             header = r.fieldnames
             if not header:
@@ -931,7 +941,10 @@ def save(self, filename_or_fp, file_format):
             # we need a file handle; open file.
             fp = filename_or_fp
             if is_filename:
-                fp = open(filename_or_fp, 'w', newline="")
+                if filename_or_fp.endswith('.gz'):
+                    fp = gzip.open(filename_or_fp, 'wt', newline="")
+                else:
+                    fp = open(filename_or_fp, 'w', newline="")
 
             try:
                 self._save_csv(fp)

diff --git a/tests/test-data/tax/gtdb-tax-grep.sigs.zip b/tests/test-data/tax/gtdb-tax-grep.sigs.zip