Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sourmash ANI estimate in some cases does not match manual computation, although using the same sketch signature #1

Closed
mahmudhera opened this issue Jun 1, 2022 · 3 comments

Comments

@mahmudhera
Copy link
Owner

mahmudhera commented Jun 1, 2022

Running the script python main.py inputs/ecoli.fasta --seed 0 --scalef 0.01 produces the results in the file ani_comparison_results. The results show a disagreement between manual calculation of the point estimate, and the sourmash estimate when the true ANI is <= 66%.

@mahmudhera
Copy link
Owner Author

@bluegenes is this because of some checks of the sketch size or the likelihood of corner cases? Please note that I do not use the latest branch of sourmash, rather compute the ANI using sourmash compare --containment --estimate-ani (please take a look at this), and then take an average of the two ANI values.
@dkoslicki these may be of interest to you.

@bluegenes
Copy link

bluegenes commented Jun 2, 2022

Hi @mahmudhera,

The latest sourmash addresses some issues with thresholding from your original equations that were affecting results (see sourmash-bio/sourmash#2060). While I would recommend switching to latest to take advantage of this fix, I don't think that's the crux of the issue here.

I think the issue you're seeing is related to sourmash-bio/sourmash#2003, where we zero out the ANI when the sketch size estimation may be inaccurate. I've been noticing the same thing that you're seeing here -- this is happening quite often (see sourmash-bio/sourmash#2058 to see the original verbose output from these checks).

Are we being too strict with size accuracy estimation checks?
https://github.com/sourmash-bio/sourmash/blob/latest/src/sourmash/minhash.py#L942-L956

@mahmudhera
Copy link
Owner Author

mahmudhera commented Jun 7, 2022

Hi @bluegenes, I have rerun the script with the latest branch that I installed a few days ago for Phylo-ani. Strangely, there is now a perfect agreement between our estimate and the sourmash estimate of ANI (see here). I am currently using the version 4.4.1.dev3+g99c3997 now, which uses 0.95 and 0.05 as parameters in the cardinality estimation function.

Therefore, just for the discrepancies in this repository (which can be seen here), I believe the issue is the hardcoded thresholds, not the size estimation being too stringent).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants