-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: Calculate all gather stats in rust; use for rocksdb gather #2943
Conversation
…h-bio/sourmash into core-upds-for-branchwater-plugin
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## latest #2943 +/- ##
==========================================
+ Coverage 86.71% 86.74% +0.03%
==========================================
Files 135 136 +1
Lines 15396 15509 +113
Branches 2624 2624
==========================================
+ Hits 13350 13453 +103
- Misses 1745 1755 +10
Partials 301 301
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Lost track of a few fns that needed testing. Added remaining tests. There are a few codecov lines that I don't fully understand -- hoping they get updated as covered. |
- change signature for remove_many, takes an iterator now - implement PartialEq for SigStore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so cool, thanks Tessa!
I piggybacked some commits replacing I also changed The last commit is for some clippy fixes, I suspect CI is not running clippy with all features enabled, and some got away |
Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
) Updates `src/core/CHANGELOG.md` with release notes; see below. ## Release checklist from #2987 (comment): - [x] verify minimum supported rust version (MSRV) for writing release notes (see #2988 for an example); MSRV is checked by CI in `.github/workflows/rust.yml`, `minimum_rust_version` - [x] write release notes using `git log --oneline r0.12.0..latest src/core | cut -d\ -f2- > /tmp/out.txt` - [x] verify that the ChangeLog is up to date: https://github.com/sourmash-bio/sourmash/blob/latest/src/core/CHANGELOG.md - [x] bump version in `src/core/Cargo.toml` ## Release notes for r0.13.0 MSRV: 1.65 Changes/additions: * Calculate all gather stats in rust; use for rocksdb gather (#2943) * adjust protein ksize for record/manifest (#3019) * Allow changing storage location for a collection in RevIndex (#3015) * make core Manifest booleans python compatible (core) (#3007) Updates: * Bump roaring from 0.10.2 to 0.10.3 (#3014) * Bump histogram from 0.9.0 to 0.9.1 (#3002) * Bump chrono from 0.4.33 to 0.4.34 (#3000) * Bump web-sys from 0.3.67 to 0.3.68 (#2998) * Bump num-iter from 0.1.43 to 0.1.44 (#2997) * Bump wasm-bindgen-test from 0.3.40 to 0.3.41 (#2996)
Adds ANI columns to pairwise and multisearch output, building off of @mr-eyes's ANI translations (#188) which got streamlined + added into sourmash core in sourmash-bio/sourmash#2943 Split from #234 to make things more concise/simpler. ## benchmark summary ## 12k ICTV viral genomes, scaled=200 79.8m comparisons | version | experiment | time | | -------- | -------- | -------- | | PR | no ANI | 21s | | PR | with ANI | 20s | | v0.9.0 | no ANI | 21s | ## 12k ICTV viral genomes, scaled=10 79.8m comparisons | version | experiment | time | | -------- | -------- | -------- | | PR | no ANI | 14m 0s | | PR | with ANI | 14m 6s | | main branch (~v0.9.0) | no ANI | 14m 47s |
# Release notes for sourmash v4.8.7 Note: This release changes the way `sourmash multigather` names output files in some situations. Please see #2722 for details. Minor new features: * support proper manifest creation with `--relpath` for `sig check` and `sig collect` (#3054) * fix `multigather` output by adding md5sum along with `-U/--output-add-query-md5sum` (#2722) * enable loading lineages from annotated gather with match_name instead of name (#3078) Bug fixes: * fix output for `sketch ... --singleton` (#3066) * fix `calculate_gather_stats` `threshold=0` bug (#3052) Cleanup and documentation updates: * adjust protein ksize for record/manifest (#3019) * Resolve `sourmash gather --help` issue (#3032) * rework the manifest documentation; do misc cleanup (#3027) * add branchwater web to docs (#3018) Developer updates: * make core Manifest booleans python compatible (core) (#3007) * safer ksize selection while still accommodating k=k*3 (#3028) * fix clippy beta issues (#3088) * tell dependabot to ignore upgrades to `byteorder`, `chrono`, `once_cell`, and `wasm-bindgen` (#3065) * update rust changelog for r0.13.0 in preparation for release (#3033) * Allow changing storage location for a collection in RevIndex (#3015) * Fix tox and nix configs so all tox tests execute correctly (#2992) * Calculate all gather stats in rust; use for rocksdb gather (#2943) * bump screed req to 1.1.3 (#3067) * bump to v4.8.7-dev (#2989) Dependabot updates: * Bump DeterminateSystems/magic-nix-cache-action from 1 to 3 (#2994) * Bump DeterminateSystems/magic-nix-cache-action from 3 to 4 (#3085) * Bump DeterminateSystems/nix-installer-action from 4 to 9 (#2995) * Bump DeterminateSystems/nix-installer-action from 9 to 10 (#3083) * Bump chrono from 0.4.33 to 0.4.34 (#3000) * Bump conda-incubator/setup-miniconda from 3.0.1 to 3.0.2 (#3046) * Bump conda-incubator/setup-miniconda from 3.0.2 to 3.0.3 (#3057) * Bump histogram from 0.9.0 to 0.9.1 (#3002) * Bump itertools from 0.12.0 to 0.12.1 (#3043) * Bump log from 0.4.20 to 0.4.21 (#3062) * Bump num-iter from 0.1.43 to 0.1.44 (#2997) * Bump pypa/cibuildwheel from 2.16.5 to 2.17.0 (#3084) * Bump rayon from 1.8.1 to 1.9.0 (#3058) * Bump roaring from 0.10.2 to 0.10.3 (#3014) * Bump serde from 1.0.196 to 1.0.197 (#3045) * Bump serde_json from 1.0.113 to 1.0.114 (#3044) * Bump tempfile from 3.10.0 to 3.10.1 (#3059) * Bump thiserror from 1.0.56 to 1.0.57 (#2999) * Bump thiserror from 1.0.57 to 1.0.58 (#3082) * Bump wasm-bindgen from 0.2.91 to 0.2.92 (#3060) * Bump wasm-bindgen-test from 0.3.40 to 0.3.41 (#2996) * Bump wasm-bindgen-test from 0.3.41 to 0.3.42 (#3063) * Bump web-sys from 0.3.67 to 0.3.68 (#2998) * Bump web-sys from 0.3.68 to 0.3.69 (#3061) * Revert "Bump wasm-bindgen from 0.2.91 to 0.2.92 (#3060)" (#3064) * Update asv to 0.6.2 (#3025) * Update pytest requirement from <8.1.0,>=6.2.4 to >=6.2.4,<8.2.0 (#3075)
This PR fixes an issue introduced in #2943 where we introduced a subtly broken calculation that uses the _current_ size of the query metagenome as the denominator for the `f_unique_to_query` calculation. Fixes #3137 This PR also adds some commented-out test code that demonstrates #3139 / sourmash-bio/sourmash_plugin_branchwater#322. That's something I haven't been able to debug, so I'd suggest fixing that independently - I'd rather fix _a_ problem _now_, rather than waiting until we can fix _multiple_ problems at some later indeterminate time :). ## Notes - [x] do we need to fix same problem in `linear.rs`? or just rename things per #3137? - [x] we should add some tests for this
This PR adds a
calculate_gather_stats
function (and criterion benchmark) to calculate all gather statistics directly in rust. It makes use of this function indisk_revindex.rs
, meaning that gather on a rocksdb database will now produce the fullGatherResult
.Note that this PR does not change the python layer to use these stats. This is just a step on the way to that :).
New functions:
Sketch::MinHash::sum_abunds()
- sum abundances in a MinHashSketch::MinHash::n_unique_kmers()
- len(self) * self.scaled.Sketch::inflate(abunds_from)
- inflate abundances in self using abund values fromabunds_from
MinHashSketch::inflated_abundances(abunds_from)
- same process asinflate
, but just return the abundance vector and sum of abundances.ani_utils::ani_from_containment
- a streamlined function to get ANI from containment. I suspect a lot of the time added for ANI calculation (ref: testing over in branchwater) is the way we calculate it in python (including calculating a total number of k-mers). This excludes anything not really needed for basic ANI without confidence intervals.ani_utils::ani_from_containment_ci
- get high and low CI ANI from containmentindex::calculate_gather_stats
- calculate all gather statistics that can be separated from the gather iteration and return the fullGatherResult
The "expensive" calculations:
median
andstddev
currently require cloning abunds, not sure how big of an issue that is.Punted Issues
linear
gather #3020: calculate all gather stats for all other database types. start withlinear.rs
since it usesGatherResult
struct already.calculate_gather_stats
outside of thegather
functions to enable postprocessing in parallel.sketch::minhash::sum_abunds
sketch::minhash::n_unique_kmers
,sketch::minhash::inflate
sketch::minhash::inflated_abundances
ani_utils::ani_from_containment
ani_utils::ani_from_containment_ci
ani_utils::prob_nothing_in_common
ref sourmash-bio/sourmash_plugin_branchwater#187.