Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] debias all containment values #2243

Merged
merged 18 commits into from
Dec 7, 2022
Merged

[MRG] debias all containment values #2243

merged 18 commits into from
Dec 7, 2022

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Aug 29, 2022

Alternative to #2057.

Instead of adding new functions to specifically use the debiased containment values, just debias within the original functions.

This does alter the containment values slightly, but not enough to screw up any tests. The only test I had to change was one where I was (improperly) asserting that the containment between downsampled minhashes was equal to the containment between those original minhashes. It also seems to affect ANI confidence intervals slightly.

But perhaps this is a big enough change to be for 5.0 instead of 4.x?

@codecov
Copy link

codecov bot commented Aug 29, 2022

Codecov Report

Merging #2243 (0a80d9f) into latest (b2d2980) will increase coverage by 8.11%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2243      +/-   ##
==========================================
+ Coverage   84.10%   92.22%   +8.11%     
==========================================
  Files         130      101      -29     
  Lines       15065    11500    -3565     
  Branches     2211     2208       -3     
==========================================
- Hits        12671    10606    -2065     
+ Misses       2098      598    -1500     
  Partials      296      296              
Flag Coverage Δ
python 92.22% <100.00%> (-0.02%) ⬇️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/minhash.py 93.88% <100.00%> (-0.18%) ⬇️
src/sourmash/search.py 97.95% <100.00%> (ø)
src/core/src/signature.rs
src/core/src/index/sbt/mod.rs
src/core/src/index/sbt/mhbt.rs
src/core/src/storage.rs
src/core/src/ffi/utils.rs
src/core/src/ffi/index/mod.rs
src/core/src/encodings.rs
src/core/src/lib.rs
... and 21 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bluegenes bluegenes changed the title debias all containment values [MRG] debias all containment values Aug 29, 2022
@bluegenes
Copy link
Contributor Author

bluegenes commented Aug 29, 2022

@dkoslicki @mahmudur is it expected that using the bias factor has such a small effect on containment and ANI values?

@ctb
Copy link
Contributor

ctb commented Aug 30, 2022

To me, this is an improvement/bugfix, and I do not think it would/should require a major version bump.

@dkoslicki
Copy link
Collaborator

@bluegenes I wouldn't expect the de-biasing to cause large changes. It really only comes into play when the underlying sequence set is very small.

And agreed this seems suitable for a 4.x, especially since the CLI didn't change with the bias term

@ctb
Copy link
Contributor

ctb commented Aug 30, 2022

hot take - let's see what happens after merging #2057 first?

@ctb
Copy link
Contributor

ctb commented Nov 25, 2022

how are we feeling about this one? I have no memory of this :)

@dkoslicki
Copy link
Collaborator

IIRC, as long as @bluegenes says it's all good, I'm good to review. Though I don't know if this incorporates the better set size estimation from #2268

@bluegenes
Copy link
Contributor Author

bluegenes commented Dec 1, 2022

@ctb you had wanted to wait to see how it went with #2057 🤷🏻‍♀️

@dkoslicki this shouldn't mess with any of the changes from #2268 (aka, they should still be in place). Could you review when you get a chance, please? Or @ctb?

Primarily, this PR replaces the functionality in "contained_by" with the functionality in "contained_by_debiased" and removes the (now redundant) "debiased" function. This means we can remove duplicated/unnecessary tests.

This only contains changes for containment - was there anything we needed to do with jaccard?

@ctb
Copy link
Contributor

ctb commented Dec 2, 2022

@ctb you had wanted to wait to see how it went with #2057 🤷🏻‍♀️

yep, as a matter of general conservatism :). sourmash 4.5.0 included #2057 and nothing's blown up, so I think we're ok!

@ctb ctb enabled auto-merge (squash) December 7, 2022 19:11
@ctb ctb merged commit 12b5e19 into latest Dec 7, 2022
@ctb ctb deleted the debias-everything branch December 7, 2022 19:26
@dkoslicki
Copy link
Collaborator

was there anything we needed to do with jaccard?

Just noticed this detail. Yes, those need to be de-biased too, in broadly the same way. @bluegenes, the formula is the last theorem on page 20 of our manuscript. I'll get around to a PR/issue eventually unless @mahmudhera beats me to it.
Doesn't affect this PR at all though

@bluegenes bluegenes mentioned this pull request Dec 12, 2022
@ctb ctb mentioned this pull request Mar 3, 2023
ctb added a commit that referenced this pull request Mar 3, 2023
# sourmash release 4.7.0

Major new features:

* provide an initial plugin architecture for sourmash that supports new
signature saving & loading mechanisms (#2428)
* add plugin support for new command-line subcommands (#2438)
* debias all containment values (#2243)

Minor new features:

* Use RankLineageInfo to simplify reading lineages (#2467)
* store taxids in lineageDB (#2466)
* Use new tax classes for taxonomic summarization (#2443)
* add tax summarization dataclasses for safety and flexibility (#2439)
* add `--scaled` to sourmash compare (#2414)
* replace `lca_utils.LineagePair` with `tax_utils.LineagePair` (#2441)
* Add new classes for lineage manipulation (#2437)

Cleanup and documentation updates:

* ReadTheDocs updates (#2445)
* update `sourmash compare` command-line docs (#2400)

Developer updates:

* fix python tests by bumping tox and pip cache versions (#2424)
* Update sphinx requirement from <6,>=4.4.0 to >=4.4.0,<7 (#2430)
* Build: replace milksnake with maturin (#2393)
* importlib_metadata is a dependency on old Python versions (#2484)
* Release docs: use two separate sed commands (#2483)
* minor fixes to release behavior (#2479)
* Use screed and maturin from nixpkgs in `flake.nix` (#2481)
* update release procedure after v4.6.0 and v4.6.1 (#2386)
* Update makefile and docs (#2432)

Dependabot updates:

* Bump once_cell from 1.17.0 to 1.17.1 (#2488)
* Bump ouroboros from 0.15.5 to 0.15.6 (#2487)
* Bump memmap2 from 0.5.8 to 0.5.9 (#2486)
* Bump supercharge/redis-github-action from 1.4.0 to 1.5.0 (#2485)
* Bump proptest from 1.0.0 to 1.1.0 (#2460)
* Bump web-sys from 0.3.60 to 0.3.61 (#2461)
* Bump serde_json from 1.0.91 to 1.0.93 (#2471)
* Bump wasm-bindgen-test from 0.3.33 to 0.3.34 (#2463)
* Bump cachix/install-nix-action from 18 to 19 (#2459)
* Bump wasm-bindgen from 0.2.83 to 0.2.84 (#2464)
* Bump typed-builder from 0.11.0 to 0.12.0 (#2451)
* Bump bumpalo from 3.9.1 to 3.12.0 (#2450)
* Bump pypa/cibuildwheel from 2.11.4 to 2.12.0 (#2447)
* Bump bzip2 from 0.4.3 to 0.4.4 (#2444)
* Bump once_cell from 1.14.0 to 1.17.0 (#2429)
* Bump serde from 1.0.151 to 1.0.152 (#2423)
* Bump pypa/cibuildwheel from 2.11.3 to 2.11.4 (#2422)
* Bump serde_json from 1.0.89 to 1.0.91 (#2418)
* Bump serde from 1.0.150 to 1.0.151 (#2419)
* Bump thiserror from 1.0.37 to 1.0.38 (#2417)
* Bump finch from 0.4.3 to 0.5.0 (#2416)
* Bump rayon from 1.6.0 to 1.6.1 (#2404)
* Bump serde from 1.0.149 to 1.0.150 (#2403)
* Bump pypa/cibuildwheel from 2.11.2 to 2.11.3 (#2402)
* Bump serde from 1.0.148 to 1.0.149 (#2397)
* Bump capnp from 0.14.5 to 0.14.11 (#2396)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants