mamba env for installations; use compare --avg-containment --ani #1

bluegenes · 2022-05-16T22:25:29Z

Contains environment.yml file for installing from sourmash latest so we can use --avg-containment for ANI estimation.

install environment with:

mamba env create -f environment.yml

activate with: conda activate phylo-ani

The matrices now at least look identical between compare and using the internal distance function from sourmash. There are still discrepancies between sourmash and @mahmudhera's method.

mahmudhera · 2022-05-31T19:00:08Z

This has been fixed, now there is an agreement between the three matrices. The problem was: that when computing sketches using sourmash, the seed was not the same as the seed for my manual approach. Adding the same seed for sourmash compute solved the problem.

dkoslicki · 2022-05-31T20:03:41Z

Interesting! I'm surprised that the seed could cause such variability. @mahmudhera can we quantify when this will happen (say, via variance)?

mahmudhera · 2022-05-31T20:09:29Z

I think the variability is captured by the variance of containment_by_fmh (and therefore, by the confidence interval) if I am not wrong. The variance in the product space (detailed in the paper) accounts for both the sketching and the mutation random processes. The seed for creating the sketch is the former of these two.

dkoslicki · 2022-05-31T21:45:14Z

Can you verify this for those entries that carries quite a bit (as a sanity check)? We can the consider using the size of the confidence interval to either report to the user (or warn them) that subsequent runs (with different seeds) can result in vastly different ANI estimates

mahmudhera · 2022-05-31T22:11:28Z

I am a bit unsure about the first line, can you kindly elaborate? Do you want me to run with multiple seeds to see if there is a variability? For all entries in the matrix, there are tiny disagreements (in the magnitude of 10e-2 for ANI) if we use different seeds at different times.

dkoslicki · 2022-05-31T23:04:07Z

When there was some discrepancy (due to seed settings), I meant to take the entries with large discrepancy (eg, I think there were a few with over 50% relative difference) and look at their CI width. Does that help clarify?

mamba env for installations; use compare --avg-containment --ani

4f6c699

bluegenes mentioned this pull request May 16, 2022

add script that computes and prints the three different ways to make … mahmudhera/phylogenetic-tree-using-fracminhash#2

Merged

dkoslicki approved these changes May 17, 2022

View reviewed changes

dkoslicki merged commit 77196cd into dkoslicki:main May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mamba env for installations; use compare --avg-containment --ani #1

mamba env for installations; use compare --avg-containment --ani #1

bluegenes commented May 16, 2022

mahmudhera commented May 31, 2022

dkoslicki commented May 31, 2022

mahmudhera commented May 31, 2022

dkoslicki commented May 31, 2022

mahmudhera commented May 31, 2022

dkoslicki commented May 31, 2022

mamba env for installations; use compare --avg-containment --ani #1

mamba env for installations; use compare --avg-containment --ani #1

Conversation

bluegenes commented May 16, 2022

mahmudhera commented May 31, 2022

dkoslicki commented May 31, 2022

mahmudhera commented May 31, 2022

dkoslicki commented May 31, 2022

mahmudhera commented May 31, 2022

dkoslicki commented May 31, 2022