Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mamba env for installations; use compare --avg-containment --ani #1

Merged
merged 1 commit into from
May 17, 2022

Conversation

bluegenes
Copy link

Contains environment.yml file for installing from sourmash latest so we can use --avg-containment for ANI estimation.

install environment with:

mamba env create -f environment.yml

activate with: conda activate phylo-ani

The matrices now at least look identical between compare and using the internal distance function from sourmash. There are still discrepancies between sourmash and @mahmudhera's method.

@dkoslicki dkoslicki merged commit 77196cd into dkoslicki:main May 17, 2022
@mahmudhera
Copy link

This has been fixed, now there is an agreement between the three matrices. The problem was: that when computing sketches using sourmash, the seed was not the same as the seed for my manual approach. Adding the same seed for sourmash compute solved the problem.

@dkoslicki
Copy link
Owner

Interesting! I'm surprised that the seed could cause such variability. @mahmudhera can we quantify when this will happen (say, via variance)?

@mahmudhera
Copy link

I think the variability is captured by the variance of containment_by_fmh (and therefore, by the confidence interval) if I am not wrong. The variance in the product space (detailed in the paper) accounts for both the sketching and the mutation random processes. The seed for creating the sketch is the former of these two.

@dkoslicki
Copy link
Owner

Can you verify this for those entries that carries quite a bit (as a sanity check)? We can the consider using the size of the confidence interval to either report to the user (or warn them) that subsequent runs (with different seeds) can result in vastly different ANI estimates

@mahmudhera
Copy link

I am a bit unsure about the first line, can you kindly elaborate? Do you want me to run with multiple seeds to see if there is a variability? For all entries in the matrix, there are tiny disagreements (in the magnitude of 10e-2 for ANI) if we use different seeds at different times.

@dkoslicki
Copy link
Owner

When there was some discrepancy (due to seed settings), I meant to take the entries with large discrepancy (eg, I think there were a few with over 50% relative difference) and look at their CI width. Does that help clarify?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants