Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: modify annotate for flexibility #2518

Merged
merged 102 commits into from
Apr 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
7cc6e3f
fix LineagePair usage?
ctb Feb 7, 2023
95bcf8e
read in taxids if avail and use for kreport
bluegenes Feb 8, 2023
3418594
fix comment
bluegenes Feb 8, 2023
956c158
mod lineage_dict init for taxpath
bluegenes Feb 8, 2023
50619cd
use RankLineageInfo to read and lineages csv
bluegenes Feb 9, 2023
03cf9e3
addl tests
bluegenes Feb 9, 2023
63d1d9d
init cami output
bluegenes Jan 17, 2023
01f8196
Merge branch 'latest' into alt-lindb
bluegenes Feb 9, 2023
71b901a
Merge branch 'alt-lindb' into add-cami-output
bluegenes Feb 9, 2023
8f722af
err if n_positions insufficient for provided lineage_str
bluegenes Jan 17, 2023
4ae1c7c
wording
bluegenes Jan 17, 2023
0db767e
test init fail
bluegenes Jan 17, 2023
a3cf4a1
fix
bluegenes Feb 9, 2023
7386e5b
fix2
bluegenes Feb 9, 2023
6b5f2cd
resolve issues from merge
bluegenes Feb 9, 2023
0f882d7
test for missing taxids; taxpath shorter than provided rank names
bluegenes Feb 9, 2023
842ba39
clarify comment
bluegenes Feb 9, 2023
ce3c991
clarify comment2
bluegenes Feb 9, 2023
87f7e50
undelete line
bluegenes Feb 10, 2023
5580bfc
Merge branch 'alt-lindb' into add-cami-output
bluegenes Feb 10, 2023
de496c4
Merge branch 'alt-lindb' into lins-v2
bluegenes Feb 10, 2023
9b139b6
add filled_pos
bluegenes Feb 10, 2023
d72df57
read LIN into LineageDB
bluegenes Feb 10, 2023
f39aa54
actually add LIN test taxonomy
bluegenes Feb 10, 2023
360cfd9
allow LIN with tax metagenome
bluegenes Feb 10, 2023
6a80bc1
Merge branch 'latest' into lins-v2
bluegenes Feb 13, 2023
5c3e0b5
Merge branch 'latest' into add-cami-output
bluegenes Feb 13, 2023
ee2bb20
actually save conflict resolution
bluegenes Feb 13, 2023
9c72740
add init LINSLineageInfo from tuples (for LineageDB compatibility)
bluegenes Feb 13, 2023
97e52cb
naming
bluegenes Feb 14, 2023
e7efbf7
tmp save
bluegenes Feb 14, 2023
d573cb1
add LINgroup summarization utilities
bluegenes Feb 15, 2023
69ed6a9
add LINgroup summarization
bluegenes Feb 15, 2023
871708b
add lingroup summarization method
bluegenes Feb 15, 2023
49558d9
add fn to read LINgroups file into dict
bluegenes Feb 15, 2023
6f26e0b
fix assigned; add full lg test
bluegenes Feb 15, 2023
a040a4b
test more lg reading failures
bluegenes Feb 15, 2023
7cb5700
test bad cli inputs
bluegenes Feb 15, 2023
10ad4e6
Merge branch 'latest' into lins-v2
bluegenes Feb 16, 2023
6e6a34c
rm print
bluegenes Feb 16, 2023
e0eff6e
Merge branch 'latest' into lins-v2
bluegenes Feb 17, 2023
acfc843
lingroup output as tsv
bluegenes Feb 17, 2023
bb1aea3
rm remaining lca_utils usage in tax
bluegenes Feb 17, 2023
7dba708
rm remaining lca_utils usage in tax main
bluegenes Feb 18, 2023
1bb6990
rm print st
bluegenes Feb 18, 2023
4558644
enable LIN for summarize to rm lca utilities
bluegenes Feb 18, 2023
6fe08f1
LIN pos for human summary; test LIN pos
bluegenes Feb 20, 2023
8fcd26b
allow completely empty LIN initialization
bluegenes Feb 20, 2023
8ebdfe2
add find_lca method to LineageInfo classes
bluegenes Feb 20, 2023
7379b02
Merge branch 'latest' into lins-v2
bluegenes Feb 28, 2023
6a4449c
enable LIN for tax annotate
bluegenes Mar 3, 2023
3402c73
punt tax genome to separate PR
bluegenes Mar 3, 2023
ad367f2
change LINs test filename
bluegenes Mar 3, 2023
2e82b19
clean up
bluegenes Mar 3, 2023
95ef04b
Merge branch 'latest' into lins-v2
bluegenes Mar 3, 2023
caa42f6
add some docs
bluegenes Mar 3, 2023
2dd45b6
MRG: LineageTree class to help with LINGroup ordering (#2496)
bluegenes Mar 6, 2023
d117fca
Merge branch 'latest' into lins-v2
bluegenes Mar 6, 2023
a08b46b
simplify linputs
bluegenes Mar 6, 2023
c57f688
allow --lins or --lin-taxonomy
bluegenes Mar 6, 2023
f21eb7c
add demo as tutorial
bluegenes Mar 7, 2023
ac32ff3
Merge branch 'latest' into lins-v2
bluegenes Mar 7, 2023
68e9afa
add data ref
bluegenes Mar 7, 2023
b89f826
fix typo
bluegenes Mar 7, 2023
edd360a
Merge branch 'latest' into lins-v2
bluegenes Mar 7, 2023
9617bea
fix typo
bluegenes Mar 7, 2023
035e4b2
better content headers
bluegenes Mar 7, 2023
39b6010
add refs for sourmash tax
bluegenes Mar 7, 2023
8f8d9a6
simplify lingroup file colnames and lingroup report name
bluegenes Mar 7, 2023
10e5dda
flex
bluegenes Mar 7, 2023
aac8669
more description
bluegenes Mar 7, 2023
7fcef3c
Merge branch 'latest' into lins-v2
bluegenes Mar 7, 2023
2751ebe
more description for tutorial
bluegenes Mar 7, 2023
e8cb7a0
better lingroup output documentation
bluegenes Mar 7, 2023
907b74c
rank arg tests
bluegenes Mar 7, 2023
6b2d15c
mod annotate for flexibility
bluegenes Mar 8, 2023
4831ffe
add tests
bluegenes Mar 8, 2023
fc84132
Merge branch 'lins-v2' into add-cami-output
bluegenes Mar 8, 2023
eca0f87
add tests; output bioboxes from tax metagenome
bluegenes Mar 8, 2023
d59fdc7
rm blank
bluegenes Mar 8, 2023
78fbbe0
rm comment
bluegenes Mar 8, 2023
512f237
add bioboxes lins test
bluegenes Mar 8, 2023
22ac224
add bioboxes to docs
bluegenes Mar 8, 2023
10c67e5
fix typo
bluegenes Mar 9, 2023
2a28569
one more cli test
bluegenes Mar 9, 2023
aaadb3f
init genome lins
bluegenes Mar 9, 2023
3ba7865
lingroups
bluegenes Mar 9, 2023
6ecef9c
Update doc/command-line.md
bluegenes Mar 9, 2023
fb4d213
clean up
bluegenes Mar 9, 2023
fd60be6
add lins/lingroup to docs
bluegenes Mar 9, 2023
521bc11
less forbidding :)
bluegenes Mar 9, 2023
a04a060
rm irrelevant comment
bluegenes Mar 9, 2023
e606a30
add note about lin positions starting at 0
bluegenes Mar 9, 2023
5804124
Merge branch 'latest' into lins-v2
bluegenes Apr 5, 2023
8a22ee0
Merge branch 'lins-v2' into add-cami-output
bluegenes Apr 5, 2023
76a3782
Merge branch 'lins-v2' into genome-lins
bluegenes Apr 5, 2023
250f40e
Merge branch 'lins-v2' into flex-annotate
bluegenes Apr 5, 2023
ce4b8c4
Merge branch 'latest' into add-cami-output
bluegenes Apr 5, 2023
6e36729
Merge branch 'add-cami-output' into genome-lins
bluegenes Apr 5, 2023
b71336c
Merge branch 'genome-lins' into flex-annotate
bluegenes Apr 5, 2023
4fe3b87
Merge branch 'latest' into flex-annotate
bluegenes Apr 5, 2023
4b6c535
Merge branch 'latest' into flex-annotate
ctb Apr 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 63 additions & 30 deletions src/sourmash/tax/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@
import re

import sourmash
from ..sourmash_args import FileOutputCSV, FileOutput
from ..sourmash_args import FileOutputCSV, FileInputCSV, FileOutput
from sourmash.logging import set_quiet, error, notify, print_results

from . import tax_utils
from .tax_utils import MultiLineageDB, GatherRow, RankLineageInfo, LINLineageInfo
from .tax_utils import MultiLineageDB, RankLineageInfo, LINLineageInfo, AnnotateTaxResult

usage='''
sourmash taxonomy <command> [<args>] - manipulate/work with taxonomy information.
Expand Down Expand Up @@ -302,12 +302,13 @@ def annotate(args):

set_quiet(args.quiet)

# first, load taxonomic_assignments
try:
# first, load taxonomic_assignments
tax_assign = MultiLineageDB.load(args.taxonomy_csv,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions=args.keep_identifier_versions,
force=args.force, lins=args.lins)

except ValueError as exc:
error(f"ERROR: {str(exc)}")
sys.exit(-1)
Expand All @@ -316,36 +317,68 @@ def annotate(args):
error(f'ERROR: No taxonomic assignments loaded from {",".join(args.taxonomy_csv)}. Exiting.')
sys.exit(-1)

# get gather_csvs from args
gather_csvs = tax_utils.collect_gather_csvs(args.gather_csv, from_file=args.from_file)
# get csv from args
input_csvs = tax_utils.collect_gather_csvs(args.gather_csv, from_file=args.from_file)

# handle each gather csv separately
for n, g_csv in enumerate(gather_csvs):
query_gather_results = tax_utils.check_and_load_gather_csvs(gather_csvs, tax_assign, force=args.force,
fail_on_missing_taxonomy=args.fail_on_missing_taxonomy,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions = args.keep_identifier_versions,
lins=args.lins)

if not query_gather_results:
continue

out_base = os.path.basename(g_csv.rsplit('.csv')[0])
this_outfile, limit_float = make_outfile(out_base, "annotate", output_dir=args.output_dir)

header = [field.name for field in fields(GatherRow)]
with FileOutputCSV(this_outfile) as out_fp:
header.append("lineage")
w = csv.DictWriter(out_fp, header, delimiter=',')
w.writeheader()

for gather_res in query_gather_results:
for taxres in gather_res.raw_taxresults:
gr = asdict(taxres.raw)
write_gr = {key: gr[key] for key in gr if key in header}
write_gr['lineage'] = taxres.lineageInfo.display_lineage(truncate_empty=True)
w.writerow(write_gr)
for n, in_csv in enumerate(input_csvs):
bluegenes marked this conversation as resolved.
Show resolved Hide resolved
try:
# Check for a column we can use to find lineage information:
with FileInputCSV(in_csv) as r:
header = r.fieldnames
# check for empty file
if not header:
raise ValueError(f"Cannot read from '{in_csv}'. Is file empty?")

# look for the column to match with taxonomic identifier
id_col = None
col_options = ['name', 'match_name', 'ident', 'accession']
for colname in col_options:
if colname in header:
id_col = colname
break

if not id_col:
raise ValueError(f"Cannot find taxonomic identifier column in '{in_csv}'. Tried: {', '.join(col_options)}")

notify(f"Starting annotation on '{in_csv}'. Using ID column: '{id_col}'")

# make output file for this input
out_base = os.path.basename(in_csv.rsplit('.csv')[0])
this_outfile, _ = make_outfile(out_base, "annotate", output_dir=args.output_dir)

out_header = header + ['lineage']

with FileOutputCSV(this_outfile) as out_fp:
w = csv.DictWriter(out_fp, out_header)
w.writeheader()

n = 0
n_missed = 0
for n, row in enumerate(r):
# find lineage and write annotated row
taxres = AnnotateTaxResult(raw=row, id_col=id_col, lins=args.lins,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions=args.keep_identifier_versions)
taxres.get_match_lineage(tax_assignments=tax_assign, fail_on_missing_taxonomy=args.fail_on_missing_taxonomy)

if taxres.missed_ident: # could not assign taxonomy
n_missed+=1
w.writerow(taxres.row_with_lineages())

rows_annotated = (n+1) - n_missed
if not rows_annotated:
raise ValueError(f"Could not annotate any rows from '{in_csv}'.")
else:
notify(f"Annotated {rows_annotated} of {n+1} total rows from '{in_csv}'.")

except ValueError as exc:
if args.force:
notify(str(exc))
notify('--force is set. Attempting to continue to next file.')
else:
error(f"ERROR: {str(exc)}")
sys.exit(-1)


def prepare(args):
Expand Down
113 changes: 74 additions & 39 deletions src/sourmash/tax/tax_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1566,8 +1566,80 @@ def __post_init__(self):
def total_weighted_bp(self):
return self.total_weighted_hashes * self.scaled


@dataclass
class BaseTaxResult:
"""
Base class for sourmash taxonomic annotation.
"""
raw: dict # csv row
keep_full_identifiers: bool = False
keep_identifier_versions: bool = False
match_ident: str = field(init=False)
skipped_ident: bool = False
missed_ident: bool = False
match_lineage_attempted: bool = False
lins: bool = False

def get_ident(self, id_col=None):
# split identifiers = split on whitespace
# keep identifiers = don't split .[12] from assembly accessions
"Hack and slash identifiers."
if id_col:
self.match_ident = self.raw[id_col]
else:
self.match_ident = self.raw.name
if not self.keep_full_identifiers:
self.match_ident = self.match_ident.split(' ')[0]
else:
#overrides version bc can't keep full without keeping version
self.keep_identifier_versions = True
if not self.keep_identifier_versions:
self.match_ident = self.match_ident.split('.')[0]


def get_match_lineage(self, tax_assignments, skip_idents=None, fail_on_missing_taxonomy=False):
if skip_idents and self.match_ident in skip_idents:
self.skipped_ident = True
else:
lin = tax_assignments.get(self.match_ident)
if lin:
if self.lins:
self.lineageInfo = LINLineageInfo(lineage = lin)
else:
self.lineageInfo = RankLineageInfo(lineage = lin)
else:
self.missed_ident=True
self.match_lineage_attempted = True
if self.missed_ident and fail_on_missing_taxonomy:
raise ValueError(f"Error: ident '{self.match_ident}' is not in the taxonomy database. Failing, as requested via --fail-on-missing-taxonomy")


@dataclass
class TaxResult:
class AnnotateTaxResult(BaseTaxResult):
"""
Class to enable taxonomic annotation of any sourmash CSV.
"""
id_col: str = 'name'

def __post_init__(self):
if self.id_col not in self.raw.keys():
raise ValueError(f"ID column '{self.id_col}' not found.")
self.get_ident(id_col=self.id_col)
if self.lins:
self.lineageInfo = LINLineageInfo()
else:
self.lineageInfo = RankLineageInfo()

def row_with_lineages(self):
lineage = self.lineageInfo.display_lineage(truncate_empty=True)
rl = {"lineage": lineage}
rl.update(self.raw)
return rl


@dataclass
class TaxResult(BaseTaxResult):
"""
Class to store taxonomic result of a single row from a gather CSV, including accessible
query information (QueryInfo) and matched taxonomic lineage. TaxResult tracks whether
Expand All @@ -1589,19 +1661,11 @@ class TaxResult:
# get match lineage
tax_res.get_match_lineage(taxD=taxonomic_assignments)

Uses RankLineageInfo to store lineage information; this may need to be modified in the future.
Use RankLineageInfo or LINLineageInfo to store lineage information.
"""
raw: GatherRow
keep_full_identifiers: bool = False
keep_identifier_versions: bool = False

query_name: str = field(init=False)
query_info: QueryInfo = field(init=False)
match_ident: str = field(init=False)
skipped_ident: bool = False
missed_ident: bool = False
match_lineage_attempted: bool = False
lins: bool = False

def __post_init__(self):
self.get_ident()
Expand All @@ -1624,35 +1688,6 @@ def __post_init__(self):
else:
self.lineageInfo = RankLineageInfo()

def get_ident(self):
# split identifiers = split on whitespace
# keep identifiers = don't split .[12] from assembly accessions
"Hack and slash identifiers."
self.match_ident = self.raw.name
if not self.keep_full_identifiers:
self.match_ident = self.raw.name.split(' ')[0]
else:
#overrides version bc can't keep full without keeping version
self.keep_identifier_versions = True
if not self.keep_identifier_versions:
self.match_ident = self.match_ident.split('.')[0]


def get_match_lineage(self, tax_assignments, skip_idents=None, fail_on_missing_taxonomy=False):
if skip_idents and self.match_ident in skip_idents:
self.skipped_ident = True
else:
lin = tax_assignments.get(self.match_ident)
if lin:
if self.lins:
self.lineageInfo = LINLineageInfo(lineage = lin)
else:
self.lineageInfo = RankLineageInfo(lineage = lin)
else:
self.missed_ident=True
self.match_lineage_attempted = True
if self.missed_ident and fail_on_missing_taxonomy:
raise ValueError(f"Error: ident '{self.match_ident}' is not in the taxonomy database. Failing, as requested via --fail-on-missing-taxonomy")

@dataclass
class SummarizedGatherResult:
Expand Down
86 changes: 81 additions & 5 deletions tests/test_tax.py
Original file line number Diff line number Diff line change
Expand Up @@ -2458,28 +2458,104 @@ def test_annotate_empty_gather_results(runtmp):
with pytest.raises(SourmashCommandFailed) as exc:
runtmp.run_sourmash('tax', 'annotate', '-g', g_csv, '--taxonomy-csv', tax)

assert f"Cannot read gather results from '{g_csv}'. Is file empty?" in str(exc.value)
assert f"Cannot read from '{g_csv}'. Is file empty?" in str(exc.value)
assert runtmp.last_result.status == -1


def test_annotate_bad_gather_header(runtmp):
def test_annotate_prefetch_or_other_header(runtmp):
tax = utils.get_test_data('tax/test.taxonomy.csv')
g_csv = utils.get_test_data('tax/test1.gather.csv')

alt_csv = runtmp.output('g.csv')
for alt_col in ['match_name', 'ident', 'accession']:
#modify 'name' to other acceptable id_columns result
alt_g = [x.replace("name", alt_col) for x in open(g_csv, 'r')]
with open(alt_csv, 'w') as fp:
for line in alt_g:
fp.write(line)

runtmp.run_sourmash('tax', 'annotate', '-g', alt_csv, '--taxonomy-csv', tax)

assert runtmp.last_result.status == 0
print(runtmp.last_result.out)
print(runtmp.last_result.err)
assert f"Starting annotation on '{alt_csv}'. Using ID column: '{alt_col}'" in runtmp.last_result.err
assert f"Annotated 4 of 4 total rows from '{alt_csv}'" in runtmp.last_result.err


def test_annotate_bad_header(runtmp):
tax = utils.get_test_data('tax/test.taxonomy.csv')
g_csv = utils.get_test_data('tax/test1.gather.csv')

bad_g_csv = runtmp.output('g.csv')

#creates bad gather result
bad_g = [x.replace("query_name", "nope") for x in open(g_csv, 'r')]
bad_g = [x.replace("name", "nope") for x in open(g_csv, 'r')]
with open(bad_g_csv, 'w') as fp:
for line in bad_g:
fp.write(line)
print("bad_gather_results: \n", bad_g)
# print("bad_gather_results: \n", bad_g)

with pytest.raises(SourmashCommandFailed) as exc:
runtmp.run_sourmash('tax', 'annotate', '-g', bad_g_csv, '--taxonomy-csv', tax)

assert 'is missing columns needed for taxonomic summarization.' in str(exc.value)
assert f"ERROR: Cannot find taxonomic identifier column in '{bad_g_csv}'. Tried: name, match_name, ident, accession" in str(exc.value)
assert runtmp.last_result.status == -1
print(runtmp.last_result.out)
print(runtmp.last_result.err)


def test_annotate_no_tax_matches(runtmp):
tax = utils.get_test_data('tax/test.taxonomy.csv')
g_csv = utils.get_test_data('tax/test1.gather.csv')

bad_g_csv = runtmp.output('g.csv')

#mess up tax idents
bad_g = [x.replace("GCF_", "GGG_") for x in open(g_csv, 'r')]
with open(bad_g_csv, 'w') as fp:
for line in bad_g:
fp.write(line)
# print("bad_gather_results: \n", bad_g)

with pytest.raises(SourmashCommandFailed) as exc:
runtmp.run_sourmash('tax', 'annotate', '-g', bad_g_csv, '--taxonomy-csv', tax)

assert f"ERROR: Could not annotate any rows from '{bad_g_csv}'" in str(exc.value)
assert runtmp.last_result.status == -1
print(runtmp.last_result.out)
print(runtmp.last_result.err)

runtmp.run_sourmash('tax', 'annotate', '-g', bad_g_csv, '--taxonomy-csv', tax, '--force')

assert runtmp.last_result.status == 0
assert f"Could not annotate any rows from '{bad_g_csv}'" in runtmp.last_result.err
assert f"--force is set. Attempting to continue to next file." in runtmp.last_result.err
print(runtmp.last_result.out)
print(runtmp.last_result.err)


def test_annotate_missed_tax_matches(runtmp):
tax = utils.get_test_data('tax/test.taxonomy.csv')
g_csv = utils.get_test_data('tax/test1.gather.csv')

bad_g_csv = runtmp.output('g.csv')

with open(g_csv, 'r') as gather_lines, open(bad_g_csv, 'w') as fp:
for n, line in enumerate(gather_lines):
if n > 2:
# mess up tax idents of lines 3, 4
line = line.replace("GCF_", "GGG_")
fp.write(line)
# print("bad_gather_results: \n", bad_g)

runtmp.run_sourmash('tax', 'annotate', '-g', bad_g_csv, '--taxonomy-csv', tax)

print(runtmp.last_result.out)
print(runtmp.last_result.err)

assert runtmp.last_result.status == 0
assert f"Annotated 2 of 4 total rows from '{bad_g_csv}'." in runtmp.last_result.err


def test_annotate_empty_tax_lineage_input(runtmp):
Expand Down
Loading