Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: Calculate all gather stats in rust; use for rocksdb gather #2943

Merged
merged 73 commits into from
Feb 21, 2024

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Jan 24, 2024

This PR adds a calculate_gather_stats function (and criterion benchmark) to calculate all gather statistics directly in rust. It makes use of this function in disk_revindex.rs, meaning that gather on a rocksdb database will now produce the full GatherResult.

Note that this PR does not change the python layer to use these stats. This is just a step on the way to that :).

New functions:

  • Sketch::MinHash::sum_abunds() - sum abundances in a MinHash
  • Sketch::MinHash::n_unique_kmers() - len(self) * self.scaled.
  • Sketch::inflate(abunds_from) - inflate abundances in self using abund values from abunds_from MinHash
  • Sketch::inflated_abundances(abunds_from) - same process as inflate, but just return the abundance vector and sum of abundances.
  • ani_utils::ani_from_containment - a streamlined function to get ANI from containment. I suspect a lot of the time added for ANI calculation (ref: testing over in branchwater) is the way we calculate it in python (including calculating a total number of k-mers). This excludes anything not really needed for basic ANI without confidence intervals.
  • ani_utils::ani_from_containment_ci - get high and low CI ANI from containment
  • index::calculate_gather_stats - calculate all gather statistics that can be separated from the gather iteration and return the full GatherResult

The "expensive" calculations:

  • abundance-weighted values require re-calculating intersections and/or manipulating the abundance vector

    Thankfully I think the new inflate/inflated_abundances code is pretty efficient? It uses the itertools merge_join_by strategy from https://github.com/sourmash-bio/sourmash/compare/lirber/itertools_merge. Thanks @luizirber :)

  • abundance median and stddev currently require cloning abunds, not sure how big of an issue that is.
  • ANI with confidence intervals

Punted Issues

ref sourmash-bio/sourmash_plugin_branchwater#187.

@bluegenes bluegenes changed the base branch from latest to core-upds-for-branchwater-plugin February 4, 2024 16:32
Base automatically changed from core-upds-for-branchwater-plugin to latest February 5, 2024 18:57
@bluegenes bluegenes marked this pull request as draft February 6, 2024 02:28
Copy link

codecov bot commented Feb 6, 2024

Codecov Report

Attention: 18 lines in your changes are missing coverage. Please review.

Comparison is base (fa4ae0b) 86.71% compared to head (7c0853c) 86.74%.

Files Patch % Lines
src/core/src/ani_utils.rs 82.35% 9 Missing ⚠️
src/core/src/sketch/minhash.rs 87.80% 5 Missing ⚠️
src/core/src/index/mod.rs 85.71% 3 Missing ⚠️
src/core/src/ffi/minhash.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           latest    #2943      +/-   ##
==========================================
+ Coverage   86.71%   86.74%   +0.03%     
==========================================
  Files         135      136       +1     
  Lines       15396    15509     +113     
  Branches     2624     2624              
==========================================
+ Hits        13350    13453     +103     
- Misses       1745     1755      +10     
  Partials      301      301              
Flag Coverage Δ
hypothesis-py 25.56% <ø> (ø)
python 92.85% <ø> (ø)
rust 61.06% <87.14%> (+1.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bluegenes
Copy link
Contributor Author

There are a number of tests with missing coverage, is that intentional or a bug in coverage calculations?

Lost track of a few fns that needed testing. Added remaining tests.

There are a few codecov lines that I don't fully understand -- hoping they get updated as covered.

- change signature for remove_many, takes an iterator now
- implement PartialEq for SigStore
@luizirber luizirber self-requested a review February 21, 2024 06:16
Copy link
Member

@luizirber luizirber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so cool, thanks Tessa!

@luizirber
Copy link
Member

I piggybacked some commits replacing Signature with SigStore in GatherResults, just to find out that the .unload() method on SigStore is not implemented yet, oops! But now things are in place for it to work =]

I also changed .remove_many() on KmerMinHash to take an Iterator, this way we don't need to allocate a new vec to remove hashes...

The last commit is for some clippy fixes, I suspect CI is not running clippy with all features enabled, and some got away

@bluegenes bluegenes merged commit 297ff0b into latest Feb 21, 2024
41 checks passed
@bluegenes bluegenes deleted the upd-gatherresult branch February 21, 2024 20:46
ctb added a commit that referenced this pull request Feb 23, 2024
)

Updates `src/core/CHANGELOG.md` with release notes; see below.

## Release checklist

from
#2987 (comment):

- [x] verify minimum supported rust version (MSRV) for writing release
notes (see #2988 for an
example); MSRV is checked by CI in `.github/workflows/rust.yml`,
`minimum_rust_version`
- [x] write release notes using `git log --oneline r0.12.0..latest
src/core | cut -d\ -f2- > /tmp/out.txt`
- [x] verify that the ChangeLog is up to date:
https://github.com/sourmash-bio/sourmash/blob/latest/src/core/CHANGELOG.md
- [x] bump version in `src/core/Cargo.toml`

## Release notes for r0.13.0

MSRV: 1.65

Changes/additions:

* Calculate all gather stats in rust; use for rocksdb gather (#2943)
* adjust protein ksize for record/manifest (#3019)
* Allow changing storage location for a collection in RevIndex (#3015)
* make core Manifest booleans python compatible (core) (#3007)

Updates:
* Bump roaring from 0.10.2 to 0.10.3 (#3014)
* Bump histogram from 0.9.0 to 0.9.1 (#3002)
* Bump chrono from 0.4.33 to 0.4.34 (#3000)
* Bump web-sys from 0.3.67 to 0.3.68 (#2998)
* Bump num-iter from 0.1.43 to 0.1.44 (#2997)
* Bump wasm-bindgen-test from 0.3.40 to 0.3.41 (#2996)
bluegenes added a commit to sourmash-bio/sourmash_plugin_branchwater that referenced this pull request Feb 26, 2024
Adds ANI columns to pairwise and multisearch output, building off of @mr-eyes's ANI translations (#188) which got streamlined + added into sourmash core in sourmash-bio/sourmash#2943

Split from #234 to make things more concise/simpler.

## benchmark summary

## 12k ICTV viral genomes, scaled=200

79.8m comparisons

| version | experiment | time |
| -------- | -------- | -------- |
| PR | no ANI     | 21s     | 
| PR | with ANI     | 20s     |
| v0.9.0 | no ANI | 21s |


## 12k ICTV viral genomes, scaled=10

79.8m comparisons

| version | experiment | time |
| -------- | -------- | -------- |
| PR | no ANI     | 14m 0s     | 
| PR | with ANI     | 14m 6s     | 
| main branch (~v0.9.0) | no ANI | 14m 47s |
@ctb ctb mentioned this pull request Mar 21, 2024
ctb added a commit that referenced this pull request Mar 21, 2024
# Release notes for sourmash v4.8.7

Note: This release changes the way `sourmash multigather` names output
files in some situations. Please see
#2722 for details.

Minor new features:

* support proper manifest creation with `--relpath` for `sig check` and
`sig collect` (#3054)
* fix `multigather` output by adding md5sum along with
`-U/--output-add-query-md5sum` (#2722)
* enable loading lineages from annotated gather with match_name instead
of name (#3078)

Bug fixes:

* fix output for `sketch ... --singleton` (#3066)
* fix `calculate_gather_stats` `threshold=0` bug (#3052)

Cleanup and documentation updates:

* adjust protein ksize for record/manifest (#3019)
* Resolve `sourmash gather --help` issue (#3032)
* rework the manifest documentation; do misc cleanup (#3027)
* add branchwater web to docs (#3018)

Developer updates:

* make core Manifest booleans python compatible (core) (#3007)
* safer ksize selection while still accommodating k=k*3 (#3028)
* fix clippy beta issues (#3088)
* tell dependabot to ignore upgrades to `byteorder`, `chrono`,
`once_cell`, and `wasm-bindgen` (#3065)
* update rust changelog for r0.13.0 in preparation for release (#3033)
* Allow changing storage location for a collection in RevIndex (#3015)
* Fix tox and nix configs so all tox tests execute correctly (#2992)
* Calculate all gather stats in rust; use for rocksdb gather (#2943)
* bump screed req to 1.1.3 (#3067)
* bump to v4.8.7-dev (#2989)

Dependabot updates:

* Bump DeterminateSystems/magic-nix-cache-action from 1 to 3 (#2994)
* Bump DeterminateSystems/magic-nix-cache-action from 3 to 4 (#3085)
* Bump DeterminateSystems/nix-installer-action from 4 to 9 (#2995)
* Bump DeterminateSystems/nix-installer-action from 9 to 10 (#3083)
* Bump chrono from 0.4.33 to 0.4.34 (#3000)
* Bump conda-incubator/setup-miniconda from 3.0.1 to 3.0.2 (#3046)
* Bump conda-incubator/setup-miniconda from 3.0.2 to 3.0.3 (#3057)
* Bump histogram from 0.9.0 to 0.9.1 (#3002)
* Bump itertools from 0.12.0 to 0.12.1 (#3043)
* Bump log from 0.4.20 to 0.4.21 (#3062)
* Bump num-iter from 0.1.43 to 0.1.44 (#2997)
* Bump pypa/cibuildwheel from 2.16.5 to 2.17.0 (#3084)
* Bump rayon from 1.8.1 to 1.9.0 (#3058)
* Bump roaring from 0.10.2 to 0.10.3 (#3014)
* Bump serde from 1.0.196 to 1.0.197 (#3045)
* Bump serde_json from 1.0.113 to 1.0.114 (#3044)
* Bump tempfile from 3.10.0 to 3.10.1 (#3059)
* Bump thiserror from 1.0.56 to 1.0.57 (#2999)
* Bump thiserror from 1.0.57 to 1.0.58 (#3082)
* Bump wasm-bindgen from 0.2.91 to 0.2.92 (#3060)
* Bump wasm-bindgen-test from 0.3.40 to 0.3.41 (#2996)
* Bump wasm-bindgen-test from 0.3.41 to 0.3.42 (#3063)
* Bump web-sys from 0.3.67 to 0.3.68 (#2998)
* Bump web-sys from 0.3.68 to 0.3.69 (#3061)
* Revert "Bump wasm-bindgen from 0.2.91 to 0.2.92 (#3060)" (#3064)
* Update asv to 0.6.2 (#3025)
* Update pytest requirement from <8.1.0,>=6.2.4 to >=6.2.4,<8.2.0
(#3075)
bluegenes pushed a commit that referenced this pull request May 10, 2024
This PR fixes an issue introduced in #2943 where we introduced a subtly
broken calculation that uses the _current_ size of the query metagenome
as the denominator for the `f_unique_to_query` calculation.

Fixes #3137

This PR also adds some commented-out test code that demonstrates
#3139 /
sourmash-bio/sourmash_plugin_branchwater#322.
That's something I haven't been able to debug, so I'd suggest fixing
that independently - I'd rather fix _a_ problem _now_, rather than
waiting until we can fix _multiple_ problems at some later indeterminate
time :).

## Notes

- [x] do we need to fix same problem in `linear.rs`? or just rename
things per #3137?
- [x] we should add some tests for this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants