1798: implementation of function to test if cardinality estimate is accurate wrt scale size #2031

dkoslicki · 2022-05-04T18:24:13Z

Fixes #1798 in part: implements a function that tells you if your sketch size is large enough in order to be able to trust num_hashes * scale as an estimate of the number of distinct k-mers

This is placed in the Utils folder since @bluegenes will likely want to mover it elsewhere. Tests are in the script itself: run in main.

Note that this does not fully fix #1798 since we still need to be sure the de-biasing term is included. A future PR will address this.

If you are a new contributor, please provide
My ORCID: 0000-0002-0640-954X.

…cale and num_sketches

bluegenes · 2022-05-04T21:17:15Z

Thanks @dkoslicki!! Quick question -- how does np.floor impact the set_size compared with just doing num_hashes * scaled?

dkoslicki · 2022-05-04T22:16:27Z

@bluegenes It shouldn't impact things at all. As I understand it, sourmash uses the denominator of the fraction (so 100 means 1/100th of the size of the original set), whereas I use the actual fraction (so 1/100). My way can lead to non-int estimates of set sizes (as can the sourmash approach, if it doesn't enforce int(scale)==scale), so that's to make sure it doesn't happen.

bluegenes · 2022-05-04T22:17:38Z

ah, makes sense -- thanks!

bluegenes · 2022-05-04T22:20:35Z

tagging @ctb as requested -- cardinality estimate accuracy eqns (🎉 !)

codecov · 2022-05-04T22:22:28Z

Codecov Report

Merging #2031 (f75d872) into latest (8136258) will increase coverage by 7.52%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           latest    #2031      +/-   ##
==========================================
+ Coverage   84.15%   91.67%   +7.52%     
==========================================
  Files         129       98      -31     
  Lines       15087    10807    -4280     
  Branches     2119     2119              
==========================================
- Hits        12696     9907    -2789     
+ Misses       2095      604    -1491     
  Partials      296      296

Flag	Coverage Δ
python	`91.67% <ø> (ø)`
rust	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/core/src/storage.rs
src/core/src/ffi/hyperloglog.rs
src/core/src/ffi/cmd/compute.rs
src/core/src/index/sbt/mod.rs
src/core/src/ffi/index/revindex.rs
src/core/src/cmd.rs
src/core/src/signature.rs
src/core/src/ffi/nodegraph.rs
src/core/src/ffi/signature.rs
src/core/src/index/revindex.rs
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8136258...f75d872. Read the comment docs.

ctb · 2022-05-05T13:30:44Z

hi @bluegenes I kind of like having these in utils as well as in the running code base as in #2032 (comment). So I'm good for merge :).

bluegenes

lgtm, thanks @dkoslicki!

…ate (#2032) * integrate eqn from #2031 * init changes for ignoring ANI from inaccurate sigs * zero out ani if size may be inaccurate * compare 1s, ani None, etc Co-authored-by: C. Titus Brown <titus@idyll.org>

dkoslicki added 6 commits May 3, 2022 14:53

add file for chernoff and cardinality estimate

7444f1f

add basic functions, waiting to hear about where to pull the actual s…

729eaf8

…cale and num_sketches

complete functions, add tests

d70158a

change file name to be more accurate

d726b44

update readme

d025790

Merge branch 'latest' into 1798

f75d872

dkoslicki mentioned this pull request May 4, 2022

debiasing FracMinHash - plans and progress #1798

Open

bluegenes added a commit that referenced this pull request May 4, 2022

integrate eqn from #2031

75aa6b2

bluegenes mentioned this pull request May 4, 2022

[MRG] prevent ANI estimation when sketch size estimate may be inaccurate #2032

Merged

2 tasks

bluegenes approved these changes May 5, 2022

View reviewed changes

bluegenes merged commit 7d28f5b into sourmash-bio:latest May 5, 2022

ctb mentioned this pull request May 13, 2022

Draft release notes for sourmash v4.4.0 #1968

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1798: implementation of function to test if cardinality estimate is accurate wrt scale size #2031

1798: implementation of function to test if cardinality estimate is accurate wrt scale size #2031

dkoslicki commented May 4, 2022

bluegenes commented May 4, 2022

dkoslicki commented May 4, 2022

bluegenes commented May 4, 2022

bluegenes commented May 4, 2022 •

edited

Loading

codecov bot commented May 4, 2022 •

edited

Loading

ctb commented May 5, 2022

bluegenes left a comment •

edited

Loading

1798: implementation of function to test if cardinality estimate is accurate wrt scale size #2031

1798: implementation of function to test if cardinality estimate is accurate wrt scale size #2031

Conversation

dkoslicki commented May 4, 2022

bluegenes commented May 4, 2022

dkoslicki commented May 4, 2022

bluegenes commented May 4, 2022

bluegenes commented May 4, 2022 • edited Loading

codecov bot commented May 4, 2022 • edited Loading

Codecov Report

ctb commented May 5, 2022

bluegenes left a comment • edited Loading

Choose a reason for hiding this comment

bluegenes commented May 4, 2022 •

edited

Loading

codecov bot commented May 4, 2022 •

edited

Loading

bluegenes left a comment •

edited

Loading