-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: fix RocksDB-based gather & other rust-based infelicities revealed by plugins #3193
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## latest #3193 +/- ##
==========================================
+ Coverage 86.67% 86.69% +0.01%
==========================================
Files 136 136
Lines 15832 15840 +8
Branches 2716 2716
==========================================
+ Hits 13722 13732 +10
+ Misses 1800 1798 -2
Partials 310 310
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
The situation so far: from sourmash-bio/sourmash_plugin_branchwater#322 (comment),
I've narrowed this down further to running gather on
These three have vs the correct (I'm assuming) set of matches from note the last entry, .484 vs .431. These are the first three matches that have a mutual overlap among all three: so my spidey sense is tingling about that, but I don't have a clear idea of exactly how it would result in what I'm seeing yet ;). |
And I've confirmed with simple Python code that the import sourmash
combined_sig = list(sourmash.load_file_as_signatures('combined.sig',
ksize=21))[0]
combined_mh = combined_sig.minhash
combined_set = set(combined_mh.hashes)
match1_sig = list(sourmash.load_file_as_signatures('match1_NC_000853.1.sig',
ksize=21))[0]
match1_mh = match1_sig.minhash
match1_set = set(match1_mh.hashes)
match2_sig = list(sourmash.load_file_as_signatures('match2_NC_011978.1.sig',
ksize=21))[0]
match2_mh = match2_sig.minhash
match2_set = set(match2_mh.hashes)
match5_sig = list(sourmash.load_file_as_signatures('match5_NC_009486.1.sig',
ksize=21))[0]
match5_mh = match5_sig.minhash
match5_set = set(match5_mh.hashes)
f_match_1 = len(match1_set.intersection(combined_set)) / len(match1_set)
print('1', f_match_1)
combined_set -= match1_set
f_match_2 = len(match2_set.intersection(combined_set)) / len(match2_set)
print('2', f_match_2)
combined_set -= match2_set
f_match_5 = len(match5_set.intersection(combined_set)) / len(match5_set)
print('5', f_match_5) yields:
which matches the |
EUREKA! |
tl;dr it looks like this line:
is double-subtracting hashes that were already removed previously, because they are in the (complete) match but no longer in the intersection with the reduced query. If you look at the Venn diagram above, there are 10 hashes that are at the intersection of all three - these 10 hashes are being removed from the counter EACH time through the loop. |
This is ready for review @luizirber @bluegenes. Note duplicate calculation of intersection; thoughts welcome. |
oh, and should we bump the sourmash-rs core version as part of this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
And +1 on doing a r0.14.0
release
isect | ||
.0 | ||
.iter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! And with this change there is a block below:
counter.entry(dataset).and_modify(|e| {
if *e > 0 {
*e -= 1
}
});
that can be
counter.entry(dataset).and_modify(|e| { *e -= 1 });
instead, and will panic if e
goes below 0.
With this change the error triggers in latest
in the index::revindex::test::revindex_load_and_gather_2
test, and doesn't trigger in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in 231a134, thanks!
version bumped in 231a134. I'll merge this once tests pass, but wait a day or two to cut an actual release, in case @bluegenes has comments. |
#3199) ## [0.14.0] - 2024-06-10 MSRV: 1.65 Changes/additions: * fix cargo fmt for updated `disk_revindex.rs` code (#3197) * fix RocksDB-based gather & other rust-based infelicities revealed by plugins (#3193) * use correct denominator in f_unique_to_query (#3138) * fix clippy warnings about max_value (#3146) * allow get/set record.filename (#3121) Updates: * Bump statrs from 0.16.0 to 0.16.1 (#3186) * Bump serde from 1.0.202 to 1.0.203 (#3175) * Bump ouroboros from 0.18.3 to 0.18.4 (#3176) * Bump itertools from 0.12.1 to 0.13.0 (#3166) * Bump camino from 1.1.6 to 1.1.7 (#3169) * Bump serde from 1.0.201 to 1.0.202 (#3168) * Bump serde_json from 1.0.116 to 1.0.117 (#3159) * Bump serde from 1.0.200 to 1.0.201 (#3160) * Bump roaring from 0.10.3 to 0.10.4 (#3142) * Bump histogram from 0.10.0 to 0.10.1 (#3141) * Bump num-iter from 0.1.44 to 0.1.45 (#3140) * Bump serde from 1.0.199 to 1.0.200 (#3144) * Bump serde from 1.0.198 to 1.0.199 (#3130) * Bump serde_json from 1.0.115 to 1.0.116 (#3124) * Bump serde from 1.0.197 to 1.0.198 (#3122) * Bump histogram from 0.9.1 to 0.10.0 (#3109) * Bump enum_dispatch from 0.3.12 to 0.3.13 (#3102) * Bump serde_json from 1.0.114 to 1.0.115 (#3101) * Bump rayon from 1.9.0 to 1.10.0 (#3098)
Minor new features: * add `--set-name` to `sig intersect` and `sig subtract` (#3162) * upgrade `sig overlap` and `sig subtract` to load more than JSON signatures (#3153) * force continue past `tax genome` classification errors (#3100) Bug fixes: * fix `remaining_bp` output from sourmash gather (#3195) * fix RocksDB-based gather & other rust-based infelicities revealed by plugins (#3193, #3197) * use correct denominator in f_unique_to_query (#3138) Cleanup and documentation updates: * update JOSS for sourmash v4 (#3114, #3203, #3209) * fix links to taxonomy spreadsheets (#3119) * fix description of `f_unique_weighted` (#3164) Developer updates: * transition internal signature loading functions (#3161) * allow get/set record.filename (#3121) * round a number that is losing precision in 15th place in `test_distance_utpy` (#3126) * disable ppc64le wheel building (#3127) * prepare to remove `sourmash compute` for sourmash v5.0 (#3103) * add rustup target x86_64-apple-darwin (#3148) * mv `.cargo/config` to `config.toml` (#3147) * fix clippy warnings about max_value (#3146) * bump to v4.8.9-dev (#3135) * update src/core/CHANGELOG.md for sourmash-rs core release r0.14.0 (#3199) Dependabot updates: * Bump DeterminateSystems/nix-installer-action from 11 to 12 (#3184) * Bump DeterminateSystems/magic-nix-cache-action from 6 to 7 (#3185) * Bump statrs from 0.16.0 to 0.16.1 (#3186) * Bump serde from 1.0.202 to 1.0.203 (#3175) * Bump ouroboros from 0.18.3 to 0.18.4 (#3176) * Bump itertools from 0.12.1 to 0.13.0 (#3166) * Bump camino from 1.1.6 to 1.1.7 (#3169) * Bump serde from 1.0.201 to 1.0.202 (#3168) * Bump thiserror from 1.0.60 to 1.0.61 (#3167) * Bump pypa/cibuildwheel from 2.18.0 to 2.18.1 (#3165) * Bump DeterminateSystems/magic-nix-cache-action from 4 to 6 (#3157) * Bump DeterminateSystems/nix-installer-action from 10 to 11 (#3156) * Bump pypa/cibuildwheel from 2.17.0 to 2.18.0 (#3155) * Bump serde_json from 1.0.116 to 1.0.117 (#3159) * Bump thiserror from 1.0.59 to 1.0.60 (#3158) * Bump serde from 1.0.200 to 1.0.201 (#3160) * Bump roaring from 0.10.3 to 0.10.4 (#3142) * Bump histogram from 0.10.0 to 0.10.1 (#3141) * Bump getrandom from 0.2.14 to 0.2.15 (#3143) * Bump num-iter from 0.1.44 to 0.1.45 (#3140) * Bump jinja2 from 3.1.3 to 3.1.4 (#3145) * Bump serde from 1.0.199 to 1.0.200 (#3144) * Bump serde from 1.0.198 to 1.0.199 (#3130) * Bump conda-incubator/setup-miniconda from 3.0.3 to 3.0.4 (#3131) * Update pytest requirement from <8.2.0,>=6.2.4 to >=6.2.4,<8.3.0 (#3132) * Bump myst-parser from 2.0.0 to 3.0.1 (#3133) * Bump thiserror from 1.0.58 to 1.0.59 (#3123) * Bump serde_json from 1.0.115 to 1.0.116 (#3124) * Bump serde from 1.0.197 to 1.0.198 (#3122) * Update docutils requirement from <0.21,>=0.17.1 to >=0.17.1,<0.22 (#3116)
This PR fixes a bug in
disk_revindex.rs::RevIndexOps::gather
where RocksDB-basedgather
wasincorrectly subtracting hashes multiple times from the counter in situations of high redundancy.
For example, consider this Venn diagram of the 3-way intersection between three sketches:
When a metagenome contains the union of all three of these sketches, the broken implementation would subtract the
10
at the center multiple times. This was caused by removing hashes from the matches, rather than the intersection, each pass through the counter.Of note, this made RocksDB-based
fastmultigather
return incorrect results, ref sourmash-bio/sourmash_plugin_branchwater#322; first discovered in #3138 (comment).The PR fixes this, and adds a more complete pair of tests, based on
test_gather_metagenome_num_results
in the Python tests for sourmash.This PR also adjusts the hash function string for DNA sketches in Rust to be uppercase
DNA
rather than lowercasedna
, ref sourmash-bio/sourmash_plugin_directsketch#49And remember, it's not just the destination - it's the friends you make along the way, like
env_logger
.RevIndex
gather functionality is broken. #3139dna
notDNA
sourmash_plugin_directsketch#49For consideration:
Right now we are calculating the intersection twice, once in
disk_revindex.rs
and once incalculate_gather_stats
inrevindex/mod.rs
. This is unnecessary, of course. But the function signature forcalculate_gather_stats
would need to change to take the intersection as an argument. We could:calculate_gather_stats
to take an optional intersection, and calculate it if not provided;calculate_gather_stats
to require an intersection.TODO:
Other notes:
sourmash gather
) and Rust (sourmash_plugin_branchwater
results) calculations forremaining_bp
. It seems to me like the Python one is definitely wrong; not yet sure about Rust. Viz gather is calculatingremaining_bp
incorrectly #3194.