Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1798: implementation of function to test if cardinality estimate is accurate wrt scale size #2031

Merged
merged 6 commits into from
May 5, 2022

Conversation

dkoslicki
Copy link
Collaborator

Fixes #1798 in part: implements a function that tells you if your sketch size is large enough in order to be able to trust num_hashes * scale as an estimate of the number of distinct k-mers

This is placed in the Utils folder since @bluegenes will likely want to mover it elsewhere. Tests are in the script itself: run in main.

Note that this does not fully fix #1798 since we still need to be sure the de-biasing term is included. A future PR will address this.

If you are a new contributor, please provide
My ORCID: 0000-0002-0640-954X.

@bluegenes
Copy link
Contributor

Thanks @dkoslicki!! Quick question -- how does np.floor impact the set_size compared with just doing num_hashes * scaled?

@dkoslicki
Copy link
Collaborator Author

@bluegenes It shouldn't impact things at all. As I understand it, sourmash uses the denominator of the fraction (so 100 means 1/100th of the size of the original set), whereas I use the actual fraction (so 1/100). My way can lead to non-int estimates of set sizes (as can the sourmash approach, if it doesn't enforce int(scale)==scale), so that's to make sure it doesn't happen.

@bluegenes
Copy link
Contributor

ah, makes sense -- thanks!

@bluegenes
Copy link
Contributor

bluegenes commented May 4, 2022

tagging @ctb as requested -- cardinality estimate accuracy eqns (🎉 !)

@codecov
Copy link

codecov bot commented May 4, 2022

Codecov Report

Merging #2031 (f75d872) into latest (8136258) will increase coverage by 7.52%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           latest    #2031      +/-   ##
==========================================
+ Coverage   84.15%   91.67%   +7.52%     
==========================================
  Files         129       98      -31     
  Lines       15087    10807    -4280     
  Branches     2119     2119              
==========================================
- Hits        12696     9907    -2789     
+ Misses       2095      604    -1491     
  Partials      296      296              
Flag Coverage Δ
python 91.67% <ø> (ø)
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/core/src/storage.rs
src/core/src/ffi/hyperloglog.rs
src/core/src/ffi/cmd/compute.rs
src/core/src/index/sbt/mod.rs
src/core/src/ffi/index/revindex.rs
src/core/src/cmd.rs
src/core/src/signature.rs
src/core/src/ffi/nodegraph.rs
src/core/src/ffi/signature.rs
src/core/src/index/revindex.rs
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8136258...f75d872. Read the comment docs.

@ctb
Copy link
Contributor

ctb commented May 5, 2022

hi @bluegenes I kind of like having these in utils as well as in the running code base as in #2032 (comment). So I'm good for merge :).

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @dkoslicki!

@bluegenes bluegenes merged commit 7d28f5b into sourmash-bio:latest May 5, 2022
bluegenes added a commit that referenced this pull request May 13, 2022
…ate (#2032)

* integrate eqn from #2031

* init changes for ignoring ANI from inaccurate sigs

* zero out ani if size may be inaccurate

* compare 1s, ani None, etc

Co-authored-by: C. Titus Brown <titus@idyll.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

debiasing FracMinHash - plans and progress
3 participants