Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: fix multigather output by adding md5sum along with -U/--output-add-query-md5sum #2722

Merged
merged 28 commits into from
Feb 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
db08838
Uniquify csv output from multigather
olgabot May 28, 2022
3e4a2d3
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Aug 19, 2023
db99f6d
fix merge mistake
ctb Aug 19, 2023
4f73ab7
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Aug 19, 2023
29b7acf
update 2065
ctb Aug 19, 2023
096e116
deal with overwriting in tests
ctb Aug 19, 2023
8a54cc5
add tests for detecting overwrite
ctb Aug 19, 2023
5fcd12a
fix filename == '-' issue
ctb Aug 19, 2023
323d2c3
MRG: update #2065 (uniquify CSV output from multigather) (#2721)
ctb Aug 21, 2023
4e148a0
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Aug 21, 2023
6a8bf5f
Merge branch 'olgabot-patch-2' of https://github.com/sourmash-bio/sou…
ctb Aug 21, 2023
2c6b194
add some docs
ctb Aug 21, 2023
d6dd054
add new option + tests + docs
ctb Aug 22, 2023
0b7b7f6
cleanup on aisle 10
ctb Aug 23, 2023
53e7241
more cleanup of tests
ctb Aug 23, 2023
3d0467e
test -E/--extension
ctb Aug 23, 2023
5905e18
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Oct 19, 2023
eba4d2a
typo
ctb Oct 19, 2023
556bf3f
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Feb 27, 2024
7296cb7
fix multigather argparse
ctb Feb 28, 2024
e8bb12f
cleanup from merge
ctb Feb 28, 2024
3589780
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 28, 2024
7878886
text fix/cleanup attempt 1
ctb Feb 28, 2024
8b3f708
fix multigather test
ctb Feb 29, 2024
d436bca
fix up documentation a bit
ctb Feb 29, 2024
4d3059e
fix remaining @CTB
ctb Feb 29, 2024
839f4ef
add tests for -U
ctb Feb 29, 2024
1265d82
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -347,6 +347,10 @@ metagenome and genome bin analysis. (See
[Classifying Signatures](classifying-signatures.md) for more
information on the different approaches that can be used here.)

`sourmash gather` takes exactly one query and one or more
[collections of signatures](#storing-and-searching-signatures). Please see
[`sourmash multigather`](#sourmash-multigather-do-gather-with-many-queries) if you have multiple queries!

If the input signature was created with `-p abund`, output
will be abundance weighted (unless `--ignore-abundances` is
specified). `-o/--output` will create a CSV file containing the
Expand Down Expand Up @@ -534,6 +538,47 @@ This combination of commands ensures that the more time- and
memory-intensive `gather` step is run only on a small set of relevant
signatures, rather than all the signatures in the database.

### `sourmash multigather` - do gather with many queries

The `multigather` subcommand runs `sourmash gather` on multiple
queries. (See
[`sourmash gather` docs](#sourmash-gather-find-metagenome-members) for
specifics on what gather does, and how!)

Usage:
```
sourmash multigather --query <queries ...> --db <collections>
```

Note that multigather is single threaded, so it offers no substantial
efficiency gains over just running gather multiple times! Nontheless, it
is useful for situations where you have many sketches organized in a
combined file, e.g. sketches built with `sourmash sketch
... --singleton`).

#### `multigather` output files

multigather produces three output files for each
query:

* `<output_base>.csv` - gather CSV output
* `<output_base>.matches.sig` - all matching outputs
* `<output_base>.unassigned.sig` - all remaining unassigned hashes

As of sourmash v4.8.7, `<output_base>` is set as follows:
* the filename attribute of the query sketch, if it is not empty or `-`;
* the query sketch md5sum, if the query filename is empty or `-`;
* the query filename + the query sketch md5sum
(`<query_file>.<md5sum>`), if `-U/--output-add-query-md5sum` is
specified;

By default, `multigather` will complain and exit with an error if
the same `<output_base>` is used repeatedly and an output file is
going to be overwritten. With `-U/--output-add-query-md5sum` this
should only happen when identical sketches are present in a query
database. Use `--force-allow-overwrite-output`
to allow overwriting of output files without an error.

## `sourmash tax` subcommands for integrating taxonomic information into gather results

The `sourmash tax` subcommands support taxonomic analysis of genomes
Expand Down
18 changes: 18 additions & 0 deletions src/sourmash/cli/multigather.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,11 @@ def subparser(subparsers):
action="store_true",
help="stop at databases that contain no compatible signatures",
)
subparser.add_argument(
"--force-allow-overwrite-output",
action="store_true",
help="allow output files to be overwritten",
)
subparser.add_argument(
"--no-fail-on-empty-database",
action="store_false",
Expand All @@ -92,6 +97,19 @@ def subparser(subparsers):
"--outdir",
help="output CSV results to this directory",
)
subparser.add_argument(
"-U",
"--output-add-query-md5sum",
action="store_true",
help="add md5sum of each query to ensure unique output file names",
)
subparser.add_argument(
"-E",
"--extension",
type=str,
default=".sig",
help="write signature files with this extension ('.sig' by default)",
)

add_ksize_arg(subparser)
add_moltype_args(subparser)
Expand Down
71 changes: 47 additions & 24 deletions src/sourmash/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -1162,6 +1162,7 @@
# run gather on all the queries.
n = 0
size_may_be_inaccurate = False
output_base_tracking = set() # make sure we are not reusing 'output_base'
for queryfile in inp_files:
# load the query signature(s) & figure out all the things
for query in sourmash_args.load_file_as_signatures(
Expand Down Expand Up @@ -1228,21 +1229,42 @@
result = None

query_filename = query.filename
if not query_filename:
if not query_filename or query_filename == "-":
# use md5sum if query.filename not properly set
query_filename = query.md5sum()
output_base = query.md5sum()
elif args.output_add_query_md5sum:
# Uniquify the output file if all signatures were made from the same file (e.g. with --singleton)
assert query_filename and query_filename != "-" # first branch
output_base = os.path.basename(query_filename) + "." + query.md5sum()
Comment on lines +1234 to +1238
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens in the case of duplicated md5sums? I assume this will be much more likely with short sequences that are sketched with --singleton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the output_base will be flagged as a duplicate - which I think is appropriate, if not ideal ;). People can choose to force allow override, or... do something else.

else:
output_base = os.path.basename(query_filename)

output_base = os.path.basename(query_filename)
if args.output_dir:
output_base = os.path.join(args.output_dir, output_base)
output_csv = output_base + ".csv"

# track overwrites of output files!
if output_base in output_base_tracking:
error(
f"ERROR: detected overwritten outputs! '{output_base}' has already been used. Failing."
)
if args.force_allow_overwrite_output:
error("continuing because --force-allow-overwrite was specified")
else:
error(
"Consider using '-U/--output-add-query-md5sum' to build unique outputs"
)
error("and/or '--force-allow-overwrite-output'")
sys.exit(-1)

output_base_tracking.add(output_base)

output_matches = output_base + ".matches.sig"
save_sig_obj = SaveSignaturesToLocation(output_matches)
save_sig = save_sig_obj.__enter__()
notify(f"saving all matching signatures to '{output_matches}'")

# track matches
# write out basic CSV file
output_csv = output_base + ".csv"
notify(f'saving all CSV matches to "{output_csv}"')
csv_out_obj = FileOutputCSV(output_csv)
csv_outfp = csv_out_obj.__enter__()
Expand Down Expand Up @@ -1330,31 +1352,32 @@
notify("nothing found... skipping.")
continue

output_unassigned = output_base + ".unassigned.sig"
with open(output_unassigned, "w"):
remaining_query = gather_iter.query
if noident_mh:
remaining_mh = remaining_query.minhash.to_mutable()
remaining_mh += noident_mh.downsample(scaled=remaining_mh.scaled)
remaining_query.minhash = remaining_mh
output_unassigned = output_base + f".unassigned{args.extension}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the open fixes a bug I introduced in

with open(output_unassigned, 'wt') as fp:
, where I mangle tessa's original code by using SaveSignaturesToLocation on an already opened filename. The switch to using args.extension revealed the bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other changes here are just indentation based.

remaining_query = gather_iter.query
if noident_mh:
remaining_mh = remaining_query.minhash.to_mutable()
remaining_mh += noident_mh.downsample(scaled=remaining_mh.scaled)
remaining_query.minhash = remaining_mh

if is_abundance:
abund_query_mh = remaining_query.minhash.inflate(orig_query_mh)
remaining_query.minhash = abund_query_mh
if is_abundance:
abund_query_mh = remaining_query.minhash.inflate(orig_query_mh)
remaining_query.minhash = abund_query_mh

if found == 0:
notify("nothing found - entire query signature unassigned.")
elif not remaining_query:
notify("no unassigned hashes! not saving.")
else:
notify(f'saving unassigned hashes to "{output_unassigned}"')
if found == 0:
notify("nothing found - entire query signature unassigned.")

Check warning on line 1367 in src/sourmash/commands.py

View check run for this annotation

Codecov / codecov/patch

src/sourmash/commands.py#L1367

Added line #L1367 was not covered by tests
elif not remaining_query:
notify("no unassigned hashes! not saving.")

Check warning on line 1369 in src/sourmash/commands.py

View check run for this annotation

Codecov / codecov/patch

src/sourmash/commands.py#L1369

Added line #L1369 was not covered by tests
else:
notify(f'saving unassigned hashes to "{output_unassigned}"')

with SaveSignaturesToLocation(output_unassigned) as save_sig:
save_sig.add(remaining_query)

with SaveSignaturesToLocation(output_unassigned) as save_sig:
# CTB: note, multigather does not save abundances
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multigather does now save abundances, so the comment was false ;)

save_sig.add(remaining_query)
n += 1

# fini, next query!

# done! report at end.
notify(f"\nconducted gather searches on {n} signatures")
if size_may_be_inaccurate:
notify(
Expand Down
5 changes: 5 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,11 @@ def sig_save_extension(request):
return request.param


@pytest.fixture(params=["sig", "sig.gz", "zip", ".d/"])
def sig_save_extension_abund(request):
return request.param


# --- BEGIN - Only run tests using a particular fixture --- #
# Cribbed from: http://pythontesting.net/framework/pytest/pytest-run-tests-using-particular-fixture/
def pytest_collection_modifyitems(items, config):
Expand Down
Loading
Loading