-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How exactly does the p-value calculation work? #87
Comments
Hi Miquel, Q1) Q2)
So you are absolutely correct that "in each resample, you permute the genotype transitions across nodes without replacement, with an uneven probability that is proportional to the branch length of each node." I'm not confident I understand the rest of your question but I will try to answer: The next step in the algorithm (Line 346) is to record in a matrix which tree edges were assigned to have these randomized genotype transitions. In this matrix, redistributed_hits, each column corresponds to a tree edge and each row corresponds to a permutation. A value of 1 means the tree edge is a randomized genotype transition, 0 means no transition. The next step is to compare redistributed_hits (the matrix of randomized genotypes) to the phenotype matrix (either phenotype presence/absence in phyc or phenotype transitions in the synchronous test). This is done in the function If I didn't answer your question, please be welcome reopen the issue to ask follow ups or rephrase. Please let me know if you have other questions or run into bugs. All the best, |
Good morning, Thanks for the answer, it clarifies many things that will be useful to me. I am still curious about how permuted_geno_trans_mat is ordered. I understand that each column of permuted_geno_trans_mat corresponds to a high-confidence node in the tree. How is the order of these nodes (columns) decided? I ask this because the first columns (nodes) will tend to have the genotype transition values (1 or 0) from longer branches (since the the sampling is without replacement and biased by branch length). I wonder if this could generate some bias in some trees with variable branch lengths, which is the case for my data. For example If the order comes from traversing the tree, the nodes firstly traversed will tend to have the genotype transition values from long branches right? Would this be a correct modelling of the null hypothesis? I thought that one way to correct for such a bias (and maybe you are already doing it) could be to sort the columns by the edge length of the corresponding node (in a descending way), so that the longer edges will more likely have resampled genotype transitions from long edges. Is this the case? If not, would this make sense to you? Thanks for your time, Miquel Àngel Schikora |
Thank you for restating the question, Miquel! Your observation has made it clear to me that hogwash could calculate a far more accurate null hypothesis! As currently written this line: I appreciate the solution you proposed. Here's the solution I am currently leaning towards: |
I made a quick (and ugly) plot to demonstrate the differences between the current version of the null distribution generation and the version I proposed as the fix. I generated some test data and ran either the current or prosed version of the sampling step to generate the null distribution. Below are plots of the null distribution generated from one test genotype. In both plots the y axis = count and x axis = the tree edge (in the ape default order). A black "x" indicates the edges with dummy low confidence genotype ancestral reconstruction and a grey caret indicates hi confidence. Orange circles are the expected values of the null distribution given the probability of choosing each edge based on relative edge length when low confidence edges are excluded. Code to generate these two figures is here. |
Good morning, Thanks for this quick response. I think that your improvement is great, so that (if I understand correctly) you only re-sample the values of high-confidence nodes. However, I still wonder how this default ordering of the columns by appearance in the tree (in permuted_geno_trans_mat ) could affect the results. I understand that this is not changed in this PROPOSED NEW VERSION rigth? I think that I did not phrase my question properly, and I will illustrate it with an example. For example, imagine the tree like the one below: The names of the nodes are 1-6 (ordered as they appear in the tree), and 1 or 0 indicates whether the node has a genotype transition (GT). I understand that permuted_geno_trans_mat would have columns for the nodes 2 - 6 (ordered 2,3,4,5,6), and node 1 would be discarded because it has a long branch. In most resamples, node 2 (which picks the first resampled GT) will get a GT of 1 coming from node 4 (which has the longer branch and it has the highest probability of getting picked first). However, if you use another ordering (i.e.: changing the type of tree traversal method) the results will be very different, and the first column will mostly get the value of node 4. I understand that the point of re-sampling with a probability proportional to branch length is to prioritize the GT values of longer branches right? The problem I see with the current method is that the GT values of longer branches will mostly be assigned to the first nodes in the columns, and I maybe this can generate biased null distributions. I thought that one way to correct for such a bias could be to sort the columns by the edge length of the corresponding node (in a descending way), so that the longer edges will more likely have resampled GTs from long edges. This would be a way that short branches get the GT values of short branches and vice-versa. Would this make sense to you? Thanks, Miquel Àngel Schikora |
Good morning,
Thanks for developing this great tool. I am trying to understand more deeply how the phyC and syncronous methods work to be sure that it can be used on my data.
Reading your paper I came across this sentence (in the PhyC section): "For permutation tests to determine the significance of associations genotype transitions are randomized on the tree with probability proportional to the branch length. The number of edges where the permuted genotype mutation intersects with phenotype presence edges is recorded for each permutation;". My first question is:
Q1) Does the syncronous test also randomize genotype transitions with probability proportional to the branch length? This is not clear to me from the paper, although looking at the code I think that this is the case. Could you confirm this?
In addition, I think that I do not fully understand how exactly is this randomization done. After going over your code, I understood that, in each resample, you permute the genotype transitions across nodes without replacement, with an uneven probability that is proportional to the branch length of each node. This means that in each resample the array of permuted transitions will have the values from longer branches at the beginning (since they are picked first). I understand that you then map these resampled transitions to an array of nodes (each with a given phenotype transition), which will be used to calculate the empirical distribution of N. I have a question about this:
Q2) How exactly do you map this array of resampled transitions to the array of nodes (which I assume it is always the same)? Is it possible that you sort the array of nodes by edge length (in a descending way), so that the longer edges will more likely have resampled transitions from long edges?
I would really appreciate if you may clarify these questions. Thanks.
Miquel Àngel Schikora
The text was updated successfully, but these errors were encountered: