Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXP] add sourmash distance estimation #1788

Closed
wants to merge 46 commits into from
Closed

[EXP] add sourmash distance estimation #1788

wants to merge 46 commits into from

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Jan 13, 2022

Implement a minimal version of https://github.com/KoslickiLab/mutation-rate-ci-calculator to estimate distance from scaled containment and jaccard.

See https://www.biorxiv.org/content/10.1101/2022.01.11.475870v1

May fix #1242

  • actually use distance utilities within sourmash:
    • add signature methods for jaccard, containment, max_containment --> ANI
    • compare
    • prefetch
    • gather
    • search

add'l to do:

  • Add functions for edge case detection/reporting (no sketch overlap, 100% sketch overlap by chance).
  • decide when/where to report confidence intervals.
    • Report in search, prefetch, gather, NOT in compare. This maintains compatibility with sourmash plot.
  • handle sketch downsampling in signature ANI functions
  • add confidence param to minhash, signature ANI functions to allow tuning confidence level. Defaults to 0.95.
  • add --ani-threshold to tax genome
  • ensure @dkoslicki and @mahmudhera are in sourmash authors list (for JOSS papers)
  • understand benchmarking issues

Questions/Thoughts:

  • Do we also want to report additional metrics, e.g. fraction genome matched?
  • Is there any issue with directly reporting the estimated ANI's (1-distance)? Should we enable both "distance" and ANI and let user choose?

NOTE: ANI estimation is not backwards compatible with previous prefetch/gather results (needs additional columns).

@codecov
Copy link

codecov bot commented Jan 13, 2022

Codecov Report

Merging #1788 (e8c9997) into latest (6f7eb06) will decrease coverage by 62.68%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           latest    #1788       +/-   ##
===========================================
- Coverage   83.17%   20.48%   -62.69%     
===========================================
  Files         126      124        -2     
  Lines       13954    13666      -288     
  Branches     1910     1871       -39     
===========================================
- Hits        11606     2800     -8806     
- Misses       2075    10842     +8767     
+ Partials      273       24      -249     
Flag Coverage Δ
python 0.11% <0.00%> (-91.03%) ⬇️
rust 65.16% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/cli/compare.py 0.00% <0.00%> (-100.00%) ⬇️
src/sourmash/cli/utils.py 0.00% <0.00%> (-100.00%) ⬇️
src/sourmash/commands.py 0.00% <0.00%> (-88.56%) ⬇️
src/sourmash/compare.py 0.00% <0.00%> (-100.00%) ⬇️
src/sourmash/search.py 0.00% <0.00%> (-97.01%) ⬇️
src/sourmash/tax/__main__.py 0.00% <0.00%> (-87.57%) ⬇️
src/sourmash/tax/tax_utils.py 0.00% <0.00%> (-97.78%) ⬇️
src/sourmash/__main__.py 0.00% <0.00%> (-100.00%) ⬇️
src/sourmash/cli/info.py 0.00% <0.00%> (-100.00%) ⬇️
src/sourmash/cli/plot.py 0.00% <0.00%> (-100.00%) ⬇️
... and 91 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f7eb06...e8c9997. Read the comment docs.

@ctb
Copy link
Contributor

ctb commented Jan 14, 2022

lookin' good. do you want comments now or do you want to wait until it's more fully baked?

@bluegenes
Copy link
Contributor Author

bluegenes commented Jan 14, 2022

@ctb Comments now would be good! I'm done adding things for now, just gonna go take it for a spin :)

src/sourmash/commands.py Outdated Show resolved Hide resolved
src/sourmash/commands.py Outdated Show resolved Hide resolved
src/sourmash/commands.py Outdated Show resolved Hide resolved
src/sourmash/compare.py Outdated Show resolved Hide resolved
src/sourmash/compare.py Outdated Show resolved Hide resolved
src/sourmash/compare.py Outdated Show resolved Hide resolved
src/sourmash/compare.py Outdated Show resolved Hide resolved
src/sourmash/compare.py Outdated Show resolved Hide resolved
src/sourmash/compare.py Outdated Show resolved Hide resolved
src/sourmash/compare.py Outdated Show resolved Hide resolved
src/sourmash/compare.py Outdated Show resolved Hide resolved
src/sourmash/compare.py Outdated Show resolved Hide resolved
src/sourmash/compare.py Outdated Show resolved Hide resolved
@ctb
Copy link
Contributor

ctb commented Jan 14, 2022

Questions/Thoughts:

* Currently not reporting confidence interval in `compare`, as the results are stored as a single np matrix. How can we store + report these? Larger CSV output, rather than symmetric matrix? This would elim compatibility with current`plot`. What other issues would larger CSV bring about?

I think leave it out for now. Would it be included in search output?

* Do we also want to report additional metrics, e.g. fraction genome matched? If yes, same issues as ^.

Same answer :). But we might want to NA or zero out the numbers where they are particularly sketch-y (heh)

* Is there any issue with directly reporting the estimated ANI's (1-distance)?  Should we enable both "distance" and ANI and let user choose?

ANI good, IMO!

Co-authored-by: C. Titus Brown <titus@idyll.org>
@mahmudhera
Copy link

mahmudhera commented Feb 18, 2022

Excellent! The point estimate formula would essentially stay the same, I just want to do some theoretical analysis first, so that we have justification for using the formula. I will do some analyses in the next couple of days and give you an update on this.

@mahmudhera
Copy link

It looks like the point estimate that we have right now: it would only be reasonable to use when the ANI is large. Otherwise, the formula may work well by chance, but there is no theoretical guarantee. The exact range of ANI depends on other parameters.

We can approach this in one of two ways: we can either keep the code as is, and besides the point estimate, we can return an estimated error. If the error is above some threshold, we can report that Jaccard->mutation_rate code cannot be trusted. Or, we can try to improve the error rate by solving for the point estimate differently, which may take a bit more work. I will try both locally, let me know which one of these you would prefer.

Meanwhile, do you think you can send the exact settings where the code failed? Sending the values of L, k, p, s for the cases where you got this RuntimeWarning would suffice.

@bluegenes
Copy link
Contributor Author

thanks @mahmudhera!

It looks like the point estimate that we have right now: it would only be reasonable to use when the ANI is large. Otherwise, the formula may work well by chance, but there is no theoretical guarantee. The exact range of ANI depends on other parameters.

ok, good to know!

We can approach this in one of two ways: we can either keep the code as is, and besides the point estimate, we can return an estimated error. If the error is above some threshold, we can report that Jaccard->mutation_rate code cannot be trusted. Or, we can try to improve the error rate by solving for the point estimate differently, which may take a bit more work. I will try both locally, let me know which one of these you would prefer.

I think if it's possible to generate the jaccard ANI point estimate differently without too much extra work, that would be great. If not, the other option is ok -- I would probably just zero out the ANI estimate when the error is too high (using some default, user-modifiable threshold). If the error is much higher, we can recommend only containment --> ANI and provide jaccard equations with warnings. Alternatively, we could recommend using containment only.

Meanwhile, do you think you can send the exact settings where the code failed? Sending the values of L, k, p, s for the cases where you got this RuntimeWarning would suffice.

I need to do some digging to get the exact parameters for when the warnings appeared -- will send over when I have them!

@bluegenes
Copy link
Contributor Author

bluegenes commented Feb 23, 2022

Relevant portion of a traceback for the RuntimeWarnings encountered w/ jaccard --> ANI

  File "/home/ntpierce/miniconda3/8d22aeebeaab4ce824f4fd7c3e1b98c6/lib/python3.8/site-packages/sourmash/distance_utils.py", line 148, in jaccard_to_distance
    sol1 = brentq(f1, 0.0000001, 0.9999999)
  File "/home/ntpierce/miniconda3/8d22aeebeaab4ce824f4fd7c3e1b98c6/lib/python3.8/site-packages/scipy/optimize/_zeros_py.py", line 783, in brentq
    r = _zeros._brentq(f, a, b, xtol, rtol, maxiter, args, full_output, disp)
  File "/home/ntpierce/miniconda3/8d22aeebeaab4ce824f4fd7c3e1b98c6/lib/python3.8/site-packages/sourmash/distance_utils.py", line 145, in <lambda>
    f1 = lambda pest: 2.0/(2- (1-pest)**ksize ) - 1 + z_alpha * sqrt(var_direct(pest)) - jaccard
RuntimeWarning: invalid value encountered in sqrt

Note, when this shows up, it shows up for both f1 and f2, never just one.

I re-ran a number of analyses and recorded incidence of errors. Sent csv via email!

@mahmudhera
Copy link

mahmudhera commented Mar 4, 2022

I did this PR #1860, which adds codes for jaccard to point-estimate into your branch. I tested the obvious corner cases, seem to be passing. Just to note: this code does have some error, and therefore also returns an error measure with the point-estimate. More details are in the code docstrings. Let me know if you have any questions. I also added code to calculate the likelihood that nothing may be common in sketches. Look at dist_utils.py for more details.

@bluegenes bluegenes changed the base branch from latest to add-distance-utils-only April 15, 2022 00:39
Base automatically changed from add-distance-utils-only to latest April 15, 2022 17:18
@bluegenes bluegenes changed the title [WIP] add distance estimation utilities [WIP] add sourmash distance estimation Apr 15, 2022
@bluegenes bluegenes changed the title [WIP] add sourmash distance estimation [EXP] add sourmash distance estimation Apr 27, 2022
bluegenes added a commit that referenced this pull request Apr 27, 2022
bluegenes added a commit that referenced this pull request Apr 27, 2022
@bluegenes
Copy link
Contributor Author

punted tax ANI to #2005 and JOSS updates to #2006. I think this can safely be closed now!

@bluegenes bluegenes closed this Apr 27, 2022
ctb pushed a commit that referenced this pull request Jul 24, 2022
* init with tax code from #1788

* tax threshold arg from #1788

* add back sql changes from latest

* make sure we can still use tax without ANI col

* dont warn about ignoring ANI more than once

* check query stats

* fix gather test

* add sentence on ani thresh

* apply sugg from code review
@ctb ctb deleted the add-dist-est branch August 20, 2022 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement pairwise evolutionary distance/ANI output from sourmash
3 participants