[MRG] prevent ANI estimation when sketch size estimate may be inaccurate #2032

bluegenes · 2022-05-04T22:09:31Z

Using equations from @dkoslicki's #2031

integrate eqn's /create minhash method
when doing ANI comparisons, check set size accuracy, return "" when sizes are insufficient

Notes and concerns:

Adding the bias term for changes the containment for small sketches, which we probably only want to do for major releases. If HLL cardinality estimation will be added soon, I'm not sure it's a good idea to do this, only to change it back for HLL. Containment values will change then too, but at least it will just be once? ref add input number of k-mers (before sketching) to signature format #2030.

codecov · 2022-05-04T22:16:31Z

Codecov Report

Merging #2032 (a249b95) into latest (6c7b3a8) will increase coverage by 7.50%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2032      +/-   ##
==========================================
+ Coverage   84.15%   91.65%   +7.50%     
==========================================
  Files         129       98      -31     
  Lines       15087    10854    -4233     
  Branches     2119     2133      +14     
==========================================
- Hits        12696     9948    -2748     
+ Misses       2095      607    -1488     
- Partials      296      299       +3

Flag	Coverage Δ
python	`91.65% <100.00%> (-0.02%)`	⬇️
rust	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/sourmash/compare.py	`100.00% <100.00%> (ø)`
src/sourmash/distance_utils.py	`99.39% <100.00%> (+0.03%)`	⬆️
src/sourmash/minhash.py	`94.17% <100.00%> (+0.18%)`	⬆️
src/sourmash/signature.py	`91.81% <100.00%> (+0.15%)`	⬆️
src/sourmash/sketchcomparison.py	`95.23% <100.00%> (-4.77%)`	⬇️
src/core/src/lib.rs
src/core/src/ffi/nodegraph.rs
src/core/src/ffi/cmd/compute.rs
src/core/src/encodings.rs
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c7b3a8...a249b95. Read the comment docs.

bluegenes · 2022-05-04T22:21:17Z

@ctb here's where/how I'm thinking of integrating the equations from #2031

bluegenes · 2022-05-12T20:35:08Z

@ctb - Now that we zero/null out ANI when the size estimation may be inaccurate, should self x self be a special case where we return 1? Or still avoid returning ANI?

ctb · 2022-05-12T20:40:45Z

On Thu, May 12, 2022 at 01:35:23PM -0700, Tessa Pierce Ward wrote: @ctb - Now that we zero/null out ANI when the size estimation may be inaccurate, should self x self be a special case where we return 1? Or still avoid returning ANI?

I think avoid, is simplest.

bluegenes · 2022-05-12T21:14:48Z

On Thu, May 12, 2022 at 01:35:23PM -0700, Tessa Pierce Ward wrote: @ctb - Now that we zero/null out ANI when the size estimation may be inaccurate, should self x self be a special case where we return 1? Or still avoid returning ANI?
I think avoid, is simplest.

agreed! Bit of an issue though -- compare automatically populates diagonal with ones.

e.g. -- https://github.com/sourmash-bio/sourmash/blob/latest/src/sourmash/compare.py#L34

similarities = np.ones((n, n))

ctb · 2022-05-12T21:20:48Z

On Thu, May 12, 2022 at 02:15:03PM -0700, Tessa Pierce Ward wrote: > On Thu, May 12, 2022 at 01:35:23PM -0700, Tessa Pierce Ward wrote: @ctb - Now that we zero/null out ANI when the size estimation may be inaccurate, should self x self be a special case where we return 1? Or still avoid returning ANI? > I think avoid, is simplest. agreed! Bit of an issue though -- `compare` automatically populates diagonal with ones.

aieee

bluegenes · 2022-05-12T21:56:11Z

Just to put the answer somewhere on here..

compare using jaccard keeps the original matrix 1's because it uses iterator = itertools.combinations(range(n), 2) to get unique pairs to compare, since comparisons are not directional. compare with --containment instead loops through i/j in matrix, so it calculates containment/ani for every pair, including self vs self.

To fix, I am just always letting self vs self be 1. I'm also using the itertools.combinations for max_containment to avoid recalculating identical vals.

ctb

LGTM!

integrate eqn from #2031

75aa6b2

init changes for ignoring ANI from inaccurate sigs

6abbe2d

ctb mentioned this pull request May 5, 2022

1798: implementation of function to test if cardinality estimate is accurate wrt scale size #2031

Merged

bluegenes and others added 2 commits May 5, 2022 10:08

Merge branch 'latest' into card-estimate-scaled

a49ac27

zero out ani if size may be inaccurate

dacf601

bluegenes and others added 2 commits May 12, 2022 15:21

compare 1s, ani None, etc

ad6769a

Merge branch 'latest' into card-estimate-scaled

ca626b8

bluegenes changed the title ~~[WIP] prevent ANI estimation when sketch size estimate is likely to be inaccurate~~ [MRG] prevent ANI estimation when sketch size estimate is likely to be inaccurate May 12, 2022

bluegenes mentioned this pull request May 12, 2022

debiasing FracMinHash - plans and progress #1798

Open

Merge branch 'latest' into card-estimate-scaled

a249b95

ctb approved these changes May 13, 2022

View reviewed changes

bluegenes changed the title ~~[MRG] prevent ANI estimation when sketch size estimate is likely to be inaccurate~~ [MRG] prevent ANI estimation when sketch size estimate may be inaccurate May 13, 2022

bluegenes merged commit 06989d7 into latest May 13, 2022

bluegenes deleted the card-estimate-scaled branch May 13, 2022 01:24

ctb mentioned this pull request May 13, 2022

Draft release notes for sourmash v4.4.0 #1968

Closed

bluegenes mentioned this pull request May 13, 2022

what to do about ANI estimate for two very small scaled sketches? #2003

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] prevent ANI estimation when sketch size estimate may be inaccurate #2032

[MRG] prevent ANI estimation when sketch size estimate may be inaccurate #2032

bluegenes commented May 4, 2022 •

edited

Loading

codecov bot commented May 4, 2022 •

edited

Loading

bluegenes commented May 4, 2022

bluegenes commented May 12, 2022

ctb commented May 12, 2022 via email

bluegenes commented May 12, 2022 •

edited

Loading

ctb commented May 12, 2022 via email

bluegenes commented May 12, 2022 •

edited

Loading

ctb left a comment

[MRG] prevent ANI estimation when sketch size estimate may be inaccurate #2032

[MRG] prevent ANI estimation when sketch size estimate may be inaccurate #2032

Conversation

bluegenes commented May 4, 2022 • edited Loading

codecov bot commented May 4, 2022 • edited Loading

Codecov Report

bluegenes commented May 4, 2022

bluegenes commented May 12, 2022

ctb commented May 12, 2022 via email

bluegenes commented May 12, 2022 • edited Loading

ctb commented May 12, 2022 via email

bluegenes commented May 12, 2022 • edited Loading

ctb left a comment

Choose a reason for hiding this comment

bluegenes commented May 4, 2022 •

edited

Loading

codecov bot commented May 4, 2022 •

edited

Loading

bluegenes commented May 12, 2022 •

edited

Loading

bluegenes commented May 12, 2022 •

edited

Loading