-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: fix multigather
output by adding md5sum along with -U/--output-add-query-md5sum
#2722
Conversation
Hello! Hope you and yours are doing well. I am using multigather to query protein sequences against each other, and primarily use the csv file for downstream processing. As is, if the signatures in `--query query.sig` were created from one fasta file with e.g. `--singleton`, then all results iteratively overwrite each other into the same csv file. This proposed change adds the md5sum to the query file to ensure uniqueness. Other suggestions are welcome! (not tested yet) Warmest, Olga
…o ctb_update_2065
…o ctb_update_2065
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## latest #2722 +/- ##
==========================================
+ Coverage 86.76% 86.78% +0.01%
==========================================
Files 136 136
Lines 15509 15523 +14
Branches 2624 2626 +2
==========================================
+ Hits 13457 13471 +14
Misses 1751 1751
Partials 301 301
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
multigather
output by adding md5summultigather
output by adding md5sum along with -U/--output-add-query-md5sum
Note: PR into #2065. This PR updates #2065 with all the changes from `latest`. This includes the fix & update to multigather in #2322, which: - fixes the output of multigather to include more than one line 😆 - prints out the output filename - supports `--output-dir` No functional changes are made beyond the merge, so some tests are still failing; will discuss fixes in yet a new PR :). --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Keya Barve <53328492+keyabarve@users.noreply.github.com> Co-authored-by: ccbaumler <63077899+ccbaumler@users.noreply.github.com> Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Taylor Reiter <taylorreiter@gmail.com> Co-authored-by: Erik Young <eeyoung@ucdavis.edu> Co-authored-by: David Koslicki <dmkoslicki@gmail.com> Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com> Co-authored-by: Colton Baumler <baumlerc@farm.ucdavis.edu> Co-authored-by: Luiz Irber <contact+github@luizirber.org> Co-authored-by: N. Tessa Pierce-Ward <ntpierce@gmail.com> Co-authored-by: Peter Cock <p.j.a.cock@googlemail.com> Co-authored-by: Francesco Beghini <francesco.beghini@yale.edu> Co-authored-by: Jason Stajich <jason.stajich@ucr.edu> Co-authored-by: Katrin Leinweber <9948149+katrinleinweber@users.noreply.github.com>
…o fix_multigather_output_merge
for more information, see https://pre-commit.ci
multigather
output by adding md5sum along with -U/--output-add-query-md5sum
multigather
output by adding md5sum along with -U/--output-add-query-md5sum
for more information, see https://pre-commit.ci
OK, I'm proposing this for merge based on a few things -
and finally, @bluegenes is doing different experiments in re output files with It might be that a long-term future for |
remaining_mh = remaining_query.minhash.to_mutable() | ||
remaining_mh += noident_mh.downsample(scaled=remaining_mh.scaled) | ||
remaining_query.minhash = remaining_mh | ||
output_unassigned = output_base + f".unassigned{args.extension}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the open
fixes a bug I introduced in
sourmash/src/sourmash/commands.py
Line 1085 in 2b86b0d
with open(output_unassigned, 'wt') as fp: |
SaveSignaturesToLocation
on an already opened filename. The switch to using args.extension
revealed the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the other changes here are just indentation based.
|
||
with SaveSignaturesToLocation(output_unassigned) as save_sig: | ||
# CTB: note, multigather does not save abundances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multigather does now save abundances, so the comment was false ;)
@@ -4886,7 +4886,10 @@ def test_multigather_metagenome_output(runtmp): | |||
cmd = cmd.split(" ") | |||
c.run_sourmash(*cmd) | |||
|
|||
output_csv = runtmp.output("-.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixa da bug
@@ -4916,15 +4919,17 @@ def test_multigather_metagenome_output_outdir(runtmp): | |||
cmd = cmd.split(" ") | |||
c.run_sourmash(*cmd) | |||
|
|||
output_csv = runtmp.output("savehere/-.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixa da bug
testdata_glob = utils.get_test_data("gather/GCF*.sig") | ||
testdata_sigs = glob.glob(testdata_glob) | ||
|
||
utils.get_test_data("gather/combined.sig") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed no-op
@@ -5050,11 +5062,7 @@ def test_multigather_metagenome_query_with_sbt_addl_query(c): | |||
|
|||
assert os.path.exists(c.output("gcf_all.sbt.zip")) | |||
|
|||
another_query = utils.get_test_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed redundancy
|
||
assert ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed behavior
c.run_sourmash(*cmd) | ||
|
||
out = c.last_result.out | ||
print(out) | ||
err = c.last_result.err | ||
print(err) | ||
|
||
assert "conducted gather searches on 13 signatures" in err | ||
assert ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed behavior
@sourmash-bio/devs ready for review & merge! I've tried to clearly annotate the changed behavior and so on in the tests... good luck! |
output_base = query.md5sum() | ||
elif args.output_add_query_md5sum: | ||
# Uniquify the output file if all signatures were made from the same file (e.g. with --singleton) | ||
assert query_filename and query_filename != "-" # first branch | ||
output_base = os.path.basename(query_filename) + "." + query.md5sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens in the case of duplicated md5sums
? I assume this will be much more likely with short sequences that are sketched with --singleton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the output_base
will be flagged as a duplicate - which I think is appropriate, if not ideal ;). People can choose to force allow override, or... do something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than potential issues from duplicated md5sums, I think this looks ok.
duplicated md5sums would imply that the hashes are the same anyway, so really the issue would present as a non 1:1 query:outputfile issue -- all results would be available, just in slightly fewer files than 1 per query.
I'll merge then! Thanks! |
😮💨 |
# Release notes for sourmash v4.8.7 Note: This release changes the way `sourmash multigather` names output files in some situations. Please see #2722 for details. Minor new features: * support proper manifest creation with `--relpath` for `sig check` and `sig collect` (#3054) * fix `multigather` output by adding md5sum along with `-U/--output-add-query-md5sum` (#2722) * enable loading lineages from annotated gather with match_name instead of name (#3078) Bug fixes: * fix output for `sketch ... --singleton` (#3066) * fix `calculate_gather_stats` `threshold=0` bug (#3052) Cleanup and documentation updates: * adjust protein ksize for record/manifest (#3019) * Resolve `sourmash gather --help` issue (#3032) * rework the manifest documentation; do misc cleanup (#3027) * add branchwater web to docs (#3018) Developer updates: * make core Manifest booleans python compatible (core) (#3007) * safer ksize selection while still accommodating k=k*3 (#3028) * fix clippy beta issues (#3088) * tell dependabot to ignore upgrades to `byteorder`, `chrono`, `once_cell`, and `wasm-bindgen` (#3065) * update rust changelog for r0.13.0 in preparation for release (#3033) * Allow changing storage location for a collection in RevIndex (#3015) * Fix tox and nix configs so all tox tests execute correctly (#2992) * Calculate all gather stats in rust; use for rocksdb gather (#2943) * bump screed req to 1.1.3 (#3067) * bump to v4.8.7-dev (#2989) Dependabot updates: * Bump DeterminateSystems/magic-nix-cache-action from 1 to 3 (#2994) * Bump DeterminateSystems/magic-nix-cache-action from 3 to 4 (#3085) * Bump DeterminateSystems/nix-installer-action from 4 to 9 (#2995) * Bump DeterminateSystems/nix-installer-action from 9 to 10 (#3083) * Bump chrono from 0.4.33 to 0.4.34 (#3000) * Bump conda-incubator/setup-miniconda from 3.0.1 to 3.0.2 (#3046) * Bump conda-incubator/setup-miniconda from 3.0.2 to 3.0.3 (#3057) * Bump histogram from 0.9.0 to 0.9.1 (#3002) * Bump itertools from 0.12.0 to 0.12.1 (#3043) * Bump log from 0.4.20 to 0.4.21 (#3062) * Bump num-iter from 0.1.43 to 0.1.44 (#2997) * Bump pypa/cibuildwheel from 2.16.5 to 2.17.0 (#3084) * Bump rayon from 1.8.1 to 1.9.0 (#3058) * Bump roaring from 0.10.2 to 0.10.3 (#3014) * Bump serde from 1.0.196 to 1.0.197 (#3045) * Bump serde_json from 1.0.113 to 1.0.114 (#3044) * Bump tempfile from 3.10.0 to 3.10.1 (#3059) * Bump thiserror from 1.0.56 to 1.0.57 (#2999) * Bump thiserror from 1.0.57 to 1.0.58 (#3082) * Bump wasm-bindgen from 0.2.91 to 0.2.92 (#3060) * Bump wasm-bindgen-test from 0.3.40 to 0.3.41 (#2996) * Bump wasm-bindgen-test from 0.3.41 to 0.3.42 (#3063) * Bump web-sys from 0.3.67 to 0.3.68 (#2998) * Bump web-sys from 0.3.68 to 0.3.69 (#3061) * Revert "Bump wasm-bindgen from 0.2.91 to 0.2.92 (#3060)" (#3064) * Update asv to 0.6.2 (#3025) * Update pytest requirement from <8.1.0,>=6.2.4 to >=6.2.4,<8.2.0 (#3075)
This PR:
multigather
to sourmash docs!-U/--output-add-query-md5sum
tosourmash multigather
--force-allow-overwrite-output
tosourmash multigather
query.filename == '-'
as ifquery.filename
is empty, thus replacing it with md5sum-E/--extension
to allow output to files other than.sig
See discussion over in #2328:
multigather
CSV output uses signaturefilename
as basename.To add:
-U
;-E/--extension
--force-allow-overwrite-output
query_filename
being None/empty in-U
branch