Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: fix RocksDB-based gather & other rust-based infelicities revealed by plugins #3193

Merged
merged 14 commits into from
Jun 9, 2024

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jun 8, 2024

This PR fixes a bug in disk_revindex.rs::RevIndexOps::gather where RocksDB-based gather was
incorrectly subtracting hashes multiple times from the counter in situations of high redundancy.

For example, consider this Venn diagram of the 3-way intersection between three sketches:

image

When a metagenome contains the union of all three of these sketches, the broken implementation would subtract the 10 at the center multiple times. This was caused by removing hashes from the matches, rather than the intersection, each pass through the counter.

Of note, this made RocksDB-based fastmultigather return incorrect results, ref sourmash-bio/sourmash_plugin_branchwater#322; first discovered in #3138 (comment).

The PR fixes this, and adds a more complete pair of tests, based on test_gather_metagenome_num_results in the Python tests for sourmash.

This PR also adjusts the hash function string for DNA sketches in Rust to be uppercase DNA rather than lowercase dna, ref sourmash-bio/sourmash_plugin_directsketch#49

And remember, it's not just the destination - it's the friends you make along the way, like env_logger.

For consideration:

Right now we are calculating the intersection twice, once in disk_revindex.rs and once in calculate_gather_stats in revindex/mod.rs. This is unnecessary, of course. But the function signature for calculate_gather_stats would need to change to take the intersection as an argument. We could:

  • keep calculating it twice, just for simplicity;
  • change calculate_gather_stats to take an optional intersection, and calculate it if not provided;
  • change calculate_gather_stats to require an intersection.

TODO:

  • add at least one explicit test for the moltype fix

Other notes:

  • there is a discrepancy between the Python (sourmash gather) and Rust (sourmash_plugin_branchwater results) calculations for remaining_bp. It seems to me like the Python one is definitely wrong; not yet sure about Rust. Viz gather is calculating remaining_bp incorrectly #3194.

Copy link

codecov bot commented Jun 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.69%. Comparing base (a133e68) to head (231a134).
Report is 78 commits behind head on latest.

Additional details and impacted files
@@            Coverage Diff             @@
##           latest    #3193      +/-   ##
==========================================
+ Coverage   86.67%   86.69%   +0.01%     
==========================================
  Files         136      136              
  Lines       15832    15840       +8     
  Branches     2716     2716              
==========================================
+ Hits        13722    13732      +10     
+ Misses       1800     1798       -2     
  Partials      310      310              
Flag Coverage Δ
hypothesis-py 25.35% <ø> (ø)
python 92.33% <ø> (ø)
rust 62.26% <100.00%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ctb
Copy link
Contributor Author

ctb commented Jun 9, 2024

The situation so far:

from sourmash-bio/sourmash_plugin_branchwater#322 (comment),

in #3138 (comment), commit 10d5ee8, I add an analog to the Python test test_gather_metagenome. This demonstrates what is hopefully 😭 the same problem - we're getting only 6 matches in the Rust code (instead of 11), and the 6th match starts to diverge from the values we see in the Python implementation.

I've narrowed this down further to running gather on combined.sig against three sketches, the ones for NC_000853.1, NC_011978.1, and NC_009486.1:

signature: NC_009486.1 Thermotoga petrophila RKU-1, complete genome
signature: NC_011978.1 Thermotoga neapolitana DSM 4359, complete genome
signature: NC_000853.1 Thermotoga maritima MSB8 chromosome, complete genome

These three have f_match column with fastmulgather+rocksdb that is broken:
image

vs

the correct (I'm assuming) set of matches from sourmash gather:

image

note the last entry, .484 vs .431.

These are the first three matches that have a mutual overlap among all three:

image

so my spidey sense is tingling about that, but I don't have a clear idea of exactly how it would result in what I'm seeing yet ;).

@ctb
Copy link
Contributor Author

ctb commented Jun 9, 2024

And I've confirmed with simple Python code that the f_match numbers returned by sourmash gather are correct, at least to my understanding of the gather algorithm 😆

import sourmash

combined_sig = list(sourmash.load_file_as_signatures('combined.sig',
                                                     ksize=21))[0]
combined_mh = combined_sig.minhash
combined_set = set(combined_mh.hashes)

match1_sig = list(sourmash.load_file_as_signatures('match1_NC_000853.1.sig',
                                                   ksize=21))[0]
match1_mh = match1_sig.minhash
match1_set = set(match1_mh.hashes)

match2_sig = list(sourmash.load_file_as_signatures('match2_NC_011978.1.sig',
                                                   ksize=21))[0]

match2_mh = match2_sig.minhash
match2_set = set(match2_mh.hashes)

match5_sig = list(sourmash.load_file_as_signatures('match5_NC_009486.1.sig',
                                                   ksize=21))[0]
match5_mh = match5_sig.minhash
match5_set = set(match5_mh.hashes)

f_match_1 = len(match1_set.intersection(combined_set)) / len(match1_set)
print('1', f_match_1)

combined_set -= match1_set

f_match_2 = len(match2_set.intersection(combined_set)) / len(match2_set)
print('2', f_match_2)

combined_set -= match2_set

f_match_5 = len(match5_set.intersection(combined_set)) / len(match5_set)
print('5', f_match_5)

yields:

1 1.0
2 0.898936170212766
5 0.4842105263157895

which matches the sourmash gather f_match column. Ausgezeichnet!

@ctb
Copy link
Contributor Author

ctb commented Jun 9, 2024

EUREKA!

@ctb
Copy link
Contributor Author

ctb commented Jun 9, 2024

tl;dr it looks like this line:

is double-subtracting hashes that were already removed previously, because they are in the (complete) match but no longer in the intersection with the reduced query.

If you look at the Venn diagram above, there are 10 hashes that are at the intersection of all three - these 10 hashes are being removed from the counter EACH time through the loop.

@ctb ctb changed the title WIP: debug rust gather & other rust-based plugin infelicities WIP: fix RocksDB-based gather & other rust-based infelicities revealed by plugins Jun 9, 2024
@ctb ctb changed the title WIP: fix RocksDB-based gather & other rust-based infelicities revealed by plugins MRG: fix RocksDB-based gather & other rust-based infelicities revealed by plugins Jun 9, 2024
@ctb
Copy link
Contributor Author

ctb commented Jun 9, 2024

This is ready for review @luizirber @bluegenes. Note duplicate calculation of intersection; thoughts welcome.

@ctb
Copy link
Contributor Author

ctb commented Jun 9, 2024

oh, and should we bump the sourmash-rs core version as part of this PR?

Copy link
Member

@luizirber luizirber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

And +1 on doing a r0.14.0 release

Comment on lines +382 to +384
isect
.0
.iter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! And with this change there is a block below:

counter.entry(dataset).and_modify(|e| {
    if *e > 0 {
        *e -= 1
    }
});

that can be

counter.entry(dataset).and_modify(|e| { *e -= 1 });

instead, and will panic if e goes below 0.

With this change the error triggers in latest in the index::revindex::test::revindex_load_and_gather_2 test, and doesn't trigger in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 231a134, thanks!

@ctb
Copy link
Contributor Author

ctb commented Jun 9, 2024

And +1 on doing a r0.14.0 release

version bumped in 231a134. I'll merge this once tests pass, but wait a day or two to cut an actual release, in case @bluegenes has comments.

@ctb ctb enabled auto-merge (squash) June 9, 2024 19:48
@ctb ctb merged commit 8c6b58c into latest Jun 9, 2024
38 of 40 checks passed
@ctb ctb deleted the debug_rust_gather branch June 9, 2024 20:05
ctb added a commit that referenced this pull request Jun 9, 2024
ctb added a commit that referenced this pull request Jun 10, 2024
#3199)

## [0.14.0] - 2024-06-10

MSRV: 1.65

Changes/additions:

* fix cargo fmt for updated `disk_revindex.rs` code (#3197)
* fix RocksDB-based gather & other rust-based infelicities revealed by
plugins (#3193)
* use correct denominator in f_unique_to_query (#3138)
* fix clippy warnings about max_value (#3146)
* allow get/set record.filename (#3121)
Updates:

* Bump statrs from 0.16.0 to 0.16.1 (#3186)
* Bump serde from 1.0.202 to 1.0.203 (#3175)
* Bump ouroboros from 0.18.3 to 0.18.4 (#3176)
* Bump itertools from 0.12.1 to 0.13.0 (#3166)
* Bump camino from 1.1.6 to 1.1.7 (#3169)
* Bump serde from 1.0.201 to 1.0.202 (#3168)
* Bump serde_json from 1.0.116 to 1.0.117 (#3159)
* Bump serde from 1.0.200 to 1.0.201 (#3160)
* Bump roaring from 0.10.3 to 0.10.4 (#3142)
* Bump histogram from 0.10.0 to 0.10.1 (#3141)
* Bump num-iter from 0.1.44 to 0.1.45 (#3140)
* Bump serde from 1.0.199 to 1.0.200 (#3144)
* Bump serde from 1.0.198 to 1.0.199 (#3130)
* Bump serde_json from 1.0.115 to 1.0.116 (#3124)
* Bump serde from 1.0.197 to 1.0.198 (#3122)
* Bump histogram from 0.9.1 to 0.10.0 (#3109)
* Bump enum_dispatch from 0.3.12 to 0.3.13 (#3102)
* Bump serde_json from 1.0.114 to 1.0.115 (#3101)
* Bump rayon from 1.9.0 to 1.10.0 (#3098)
ctb added a commit that referenced this pull request Jun 11, 2024
Minor new features:

* add `--set-name` to `sig intersect` and `sig subtract` (#3162)
* upgrade `sig overlap` and `sig subtract` to load more than JSON
signatures (#3153)
* force continue past `tax genome` classification errors (#3100)

Bug fixes:

* fix `remaining_bp` output from sourmash gather (#3195)
* fix RocksDB-based gather & other rust-based infelicities revealed by
plugins (#3193, #3197)
* use correct denominator in f_unique_to_query (#3138)

Cleanup and documentation updates:

* update JOSS for sourmash v4 (#3114, #3203, #3209)
* fix links to taxonomy spreadsheets (#3119)
* fix description of `f_unique_weighted` (#3164)

Developer updates:

* transition internal signature loading functions (#3161)
* allow get/set record.filename (#3121)
* round a number that is losing precision in 15th place in
`test_distance_utpy` (#3126)
* disable ppc64le wheel building (#3127)
* prepare to remove `sourmash compute` for sourmash v5.0 (#3103)
* add rustup target x86_64-apple-darwin (#3148)
* mv `.cargo/config` to `config.toml` (#3147)
* fix clippy warnings about max_value (#3146)
* bump to v4.8.9-dev (#3135)
* update src/core/CHANGELOG.md for sourmash-rs core release r0.14.0
(#3199)

Dependabot updates:

* Bump DeterminateSystems/nix-installer-action from 11 to 12 (#3184)
* Bump DeterminateSystems/magic-nix-cache-action from 6 to 7 (#3185)
* Bump statrs from 0.16.0 to 0.16.1 (#3186)
* Bump serde from 1.0.202 to 1.0.203 (#3175)
* Bump ouroboros from 0.18.3 to 0.18.4 (#3176)
* Bump itertools from 0.12.1 to 0.13.0 (#3166)
* Bump camino from 1.1.6 to 1.1.7 (#3169)
* Bump serde from 1.0.201 to 1.0.202 (#3168)
* Bump thiserror from 1.0.60 to 1.0.61 (#3167)
* Bump pypa/cibuildwheel from 2.18.0 to 2.18.1 (#3165)
* Bump DeterminateSystems/magic-nix-cache-action from 4 to 6 (#3157)
* Bump DeterminateSystems/nix-installer-action from 10 to 11 (#3156)
* Bump pypa/cibuildwheel from 2.17.0 to 2.18.0 (#3155)
* Bump serde_json from 1.0.116 to 1.0.117 (#3159)
* Bump thiserror from 1.0.59 to 1.0.60 (#3158)
* Bump serde from 1.0.200 to 1.0.201 (#3160)
* Bump roaring from 0.10.3 to 0.10.4 (#3142)
* Bump histogram from 0.10.0 to 0.10.1 (#3141)
* Bump getrandom from 0.2.14 to 0.2.15 (#3143)
* Bump num-iter from 0.1.44 to 0.1.45 (#3140)
* Bump jinja2 from 3.1.3 to 3.1.4 (#3145)
* Bump serde from 1.0.199 to 1.0.200 (#3144)
* Bump serde from 1.0.198 to 1.0.199 (#3130)
* Bump conda-incubator/setup-miniconda from 3.0.3 to 3.0.4 (#3131)
* Update pytest requirement from <8.2.0,>=6.2.4 to >=6.2.4,<8.3.0
(#3132)
* Bump myst-parser from 2.0.0 to 3.0.1 (#3133)
* Bump thiserror from 1.0.58 to 1.0.59 (#3123)
* Bump serde_json from 1.0.115 to 1.0.116 (#3124)
* Bump serde from 1.0.197 to 1.0.198 (#3122)
* Update docutils requirement from <0.21,>=0.17.1 to >=0.17.1,<0.22
(#3116)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DNA moltype output in manifest is dna not DNA RocksDB-based RevIndex gather functionality is broken.
2 participants