-
Notifications
You must be signed in to change notification settings - Fork 2
How to use Tree Model Adequacy
5 July 2017
In this tutorial we will assess the adequacy of models that describe branching process of phylogenetic trees. We will focus on the Exponential-growth, and Constant-size coalescent models. To do this, we will use the Tree Model Adequacy (TMA) package in BEAST2. To install please follow the instructions here. We will also use FigTree, Tracer, and R to visualise the results.
The TMA package is used to assess the adequacy of tree branching models. Model adequacy differs from model selection methods in that the goal is not to select the best model from a set of competing models, but to determine whether the data at hand could have been generated by a given model. In this respect, the model is treated as a hypothesis that can be rejected on some statistical grounds. In the context of branching models, we can use model adequacy to determine whether the tree shape that we observe in an empirical tree could have been generated by a constant Birth-Death process. This is important to interpret the inferences from the model and to obtain better insight about the biological process that generated the data. For the purpose of tree model adequacy, we will treat the phylogenetic tree with branch lengths in units of time (i.e. a chronogram) as the "empirical data".
Bayesian model adequacy consists of analysing the data under the model in question to estimate the posterior. The posterior is then used to simulate data sets, which represent hypothetical future observations under the model, and which are known as posterior predictive simulations. The model is considered adequate if the posterior predictive simulations are similar to the empirical data according to some test statistic. The choice of test statistic can be driven by some expectation about the model. For example, the exponential-growth coalescent model typically generates trees with long external branches (for population sizes that are increasing exponentially), such that the ratio of external to internal branch lengths is a useful test statistic. Similarly, the Birth-Death model with sampling explicitly models the sampling process through time, so we could also use the sampling times (i.e. ages of the tips) as a test statistic. The values of the test statistic from the posterior predictive simulations represent the expectation of values under the model, which is known as the posterior predictive distribution. The posterior predictive model adequacy framework is outlined in Fig. 1.
Fig. 1 Posterior predictive model adequacy using the TMA package. The empirical tree is analysed in BEAST using a given model, M, to estimate the posterior distribution of the model's parameters (denoted with the greek letter eta). The posterior distribution of the parameters given the model and empirical tree are used to simulate trees, known as posterior predictive simulations, using MASTER (Vaughan and Drummond 2013). TreeStat (part of the BEAST software) is then used to estimate a summary statistic, Ta, for the empirical tree and the posterior predictive simulations. The pB value corresponds to the number of times that Ta for the empirical data is larger than that for the posterior predictive simulations (this is similar to the frequentist p-value, but not quite thesame). The orange arrow represents the initial analysis to approximate the posterior distribution, also shown in orange. The green arrow corresponds to the simulation step using MASTER. The final step, which consists in calculating test statistics using TreeStat, is shown with the blue arrow for the posterior predictive simulations, and with the red arrow for the empirical data.
We will use a tree that was estimated for Ebola samples collected in West Africa at the early stages of the 2013-2015 outbreak (Gire et al. 2014; Stadler et al. 2014). Please download the tree here (right click and Save File).
Open the tree in FigTree by double-clicking on it. Otherwise, open FigTree and go to File, Open, and find the tree. The display should look like Fig 2.
Fig. 2 Ebola tree for West African samples collected in 2014. The branch lengths and scale bar are in units of years.
We will also need some MASTER templates, which are available in the TMA directory. To access them easily, set the working directory in BEAUTI by selecting File, Set working dir, and clicking on TMA.
- We will use BEAUTI to set up the tree model adequacy analysis. Open BEAUTI, and click File, Template and select TreeModelAdequacy (see Fig. 3). If you do not see this menu, please check that the TMA package and its dependencies are correctly installed.
Fig. 3 Selecting the TreeModelAdequacy template in BEAUTI.
- After the TreeModelAdequacy template is loaded, you should see three tabs at the top of the BEAUTI window; Partitions, Priors, and MCMC. Click on File and Import Fixed Tree. Find the Ebola virus tree downloaded above. and click Open. You should see that the tree is loaded as shown in Fig. 4.
Fig. 4 Ebola virus tree loaded in BEAUTI.
- Now we need to set up the model to assess. We will set up the constant size coalescent first. To do this, click on the Priors tab and ensure that the Coalescent Constant Population is chosen. This model assumes that the virus population has been constant over time, and its only parameter is the effective population size (referred to as popSize in BEAST). In the context of an infectious disease, the effective population size is the number of infected hosts (Φ) divided by the coalescent rate (λ) times 2:
- popSize = Φ / 2 * λ
The default prior and initial value are fine for this analysis.
- Select the MCMC tab. You should see the display on Fig. 5.
Fig. 5 MCMC display in BEAUTI.
- There are several options that we need to set to conduct the analysis:
- Nr Of Trees This is the number of posterior predictive simulations, which are trees simulated under the constant size coalescent. The effective population size parameter used to generate the trees will be sampled from the posterior PopSize parameter from the Ebola virus tree (i.e. the empirical tree). Set this option to 100.
- Rootdir The directory where we will store the posterior predictive simulations and BEAST xml scripts used to generate them. Type tmp_CC here to make a file in the current directory to save the output from the constant size coalescent.
- Value, Hosts, and Do Not Run are options to obtain the posterior predictive simulations independently from the MCMC of the empirical data. Ignore them for now, and in a later tutorial we will demonstrate how to set these up.
- Burn in Percentage The posterior predictive simulations will be obtained taking 100 random samples from the stationary distribution of the MCMC, such that we need to discard the burn-in. For this tutorial, we will discard 10% of the chain. Set this value to 10.
- Master TMA requires an xml template that it will use to generate the posterior predictive simulations. The template should match the model set up in the Priors tab, which is the constant size coalescent in this example. Click on browse, select the folder called master and select a file called CC_hetero_sim.xml.
- Edit statistics In this option we can select from a large range of tree statistics. Select a test statistic and use the green arrow to drag it from the left to the right column. Test statistics in the right column will be computed in the analysis. For this example, select Tree Height, Time of maximum lineages, Slope ratio for LTT plot, Colless tree-imbalance, and External/Internal ratio (Fig. 7). Click OK. These test statistics measure the following aspects of phylogenetic trees:
>- Tree Height - The distance from the root of the tree to the youngest tip.
>- Time of maximum lineages - The time when the maximum number of lineages is reached. This is also the time of the peak of a lineages through time plot.
>- Slope ratio for LTT plot - To compute this statistic we consider a lineages through time plot. We then draw two lines, from the origin of the tree to the maximum number of lineages and from the maximum number of lineages to the end of the tree. This test statistic is the ratio of the slopes of these two lines.
>- Colless tree-imbalance - This is a measure of phylogenetic tree imbalance. For example, comb-like trees are extremely imbalanced.
>- External/Internal ratio - This statistic is the sum of external branch lengths divided by the sum of internal branch lengths.
Fig. 6 Selecting tree statistics in BEAUTI.
- Out This is the name of the output file where we will store the posterior predictive distributions of the tests statistics defined above. Click browse and select a location (such as your Desktop), and call the file out_cc.log.
- Chain Length This is the length of the MCMC chain. For this analysis 10,000,000 steps is sufficient.
- Leave the remaining default settings and go to File, Save As, and save it in a directory that is easy to find. Call the file ebola_CC.xml.
- We are now ready to run BEAST. Open BEAST by double clicking the icon. You should see a window like that in Fig. 7. Click Choose File... and select the ebola_CC.xml file. The rest of the options are fine for now. Click Run.
Fig. 7
- BEAST will now start running. It will conduct the following steps:
- Estimate the effective population size (PopSize) under the constant size coalescent. This MCMC will run for 10,000,000 steps.
- Sample 100 random values from the chain, after discarding 10% of burn-in.
- Use the samples from the previous steps to generate 100 trees, stored in tmp_CC. These are the posterior predictive simulations. To generate them, BEAST will read the MASTER template defined in step 5.1, above.
- For each posterior predictive simulation, calculate Tree Height, Time of maximum lineages, Slope ratio for LTT plot, Colless tree-imbalance, and External/Internal ratio. These test statistics are also calculated for the empirical tree. They will be stored in out_cc.log, where the first row contains the tests statistics for the empirical tree, and the rest are for the posterior predictive simulations.
- After BEAST has run, a table will appear at the end of the run. It contains the pB values for the test statistics, their quantiles and values computed for the empirical data. The pB value might vary slightly because we have only obtained 100 posterior predictive simulations, but they will give us a reasonable overview of the performance of this model (in practice we might want to obtain some 1000 posterior predictive simulations). In my analysis, the pB values are; 0.06 for Tree Height, 0.76 for Time of maximum lineages, 0.96 for Slope ratio for LTT plot, 0.95 for Colless tree imbalance and 0.99 for External/Internal ratio. The test statistics for the empirical data do not depend on the posterior predictive distributions. They are shown under the observed column, and they are; 0.1068 for Tree Height, 0.008537 for Time of maximum lineages, 0.0867 for Slope ratio for LTT plot, 0.2354 for Colless tree imbalance and 1.1937 for External/Internal ratio.
If the model is adequate, we expect that the test statistic for the empirical data should fall somewhere in the centre of the posterior predictive distribution for a given test statistic (pB value of around 0.5). We could also use a frequentist approach to determine that P should range between 0.05 and 0.95 for an adequate model according to a given test statistic. In this case, the ratio of external to internal branch lengths, the slope of the LTT plot and the Colless tree imbalance from the empirical tree are implausible under the constant size coalescent.
The posterior predictive distributions distributions can also be visualised in R or Excel. We will do this at the end of the next section.
Some aspects of the Ebola virus tree are implausible under the constant size coalescent. This is unsurprising because pathogen populations should increase over an infectious outbreak, such that a constant size model is inappropriate. In contrast, the exponential growth coalescent assumes that the pathogen population is increasing exponentially. This model has two parameters, the effective population size at present and the growth rate, known as ePopSize and growthRate in BEAST. For infectious diseases, these parameters correspond to:
ePopSize = Φ / 2 * λ
growthRate = λ - δ
where the δ is the rate of becoming uninfectious, which occurs due to death or recovery, and λ and Φ are the same as for the constant size coalescent. This parameterisation if the exponential coalescent is useful in an epidemilogical context because one can estimate the basic reproductive number, R0, which is the mean number of secondary cases per infected individual assuming a fully susceptible population:
R0 = λ / δ
Note that the coalescent rate is analogous to the birth rate in Birth-Death models.
However, because ePopSize and growthRate are compound parameters, it is necessary to know the number of infected individuals at present to infer R0. We will assess the adequacy of this model in BEAST using the same test statistics from the previous section.
-
Follow steps 1. and 2. to load the TMA template and Ebola virus tree in BEAUTI.
-
Click on the Priors tab and select the Coalescent Exponential Population. The display in BEAUTI should look like Fig. 8.
Fig. 8 Display of coalescent exponential in BEAUTI.
- The Ebola virus tree has branch lengths in units of years. This means that the growth rate can be very large. Leave the default prior for the ePopSize parameter. For the growthRate the default prior is inappropriate. Set a Uniform distribution and an initial value of 100. To do this click on the dropdown menu highlighted in Fig. 9, and click the initial button. Set the Value to 100 (Fig. 10) and click OK.
Fig. 9 Setting the starting value for the growthRate.
- Click on the MCMC tab and follow the following settings:
- Nr Of Trees Set this to 100.
- Rootdir Type in tmp_CE.
- Leave Value and Hosts empty.
- Burn in Percentage Set this to 10.
- Master Clock on browse and find the master file called CE_hetero_sim.xml.
- Edit statistics As in step 5. above, select Tree Height, Time of maximum lineages, Slope ratio for LTT plot, Colless tree-imbalance, and External/Internal ratio (Fig. 7). Click OK.
- Chain Length Set this to 10,000,000.
- Leave everything else with the defaults. Except for tracelog. Click on the pencil next to this option and change the File Name to ebola_2014_ce.log and click OK.
Click on File, Save as and save the file as ebola_CE.xml.
-
Follow step 6. to run BEAST.
-
After the analysis has run, the pB values and test statistics will be displayed. I obtained the following values; 0.7 for Tree Height, 0.52 for Time of maximum lineages, 0.24 for Slope ratio for LTT plot, 0.93 for Colless tree imbalance and 0.88 for External/Internal ratio. The test statistics for the empirical data are the same as in the previous example because they do not depend on the model chosen or on the posterior predictive distribution. If we used the frequentist we would consider that the aspects of the Ebola virus tree that are measured by these test statistics are plausible under the exponential growth coalescent. This is expected because these ebola samples were collected at the very early stages of the 2014 - 2016 West African outbreak. In fact, all the samples were obtained in mid-2014 such that the virus population was probably still growing exponentially.
The main output from TMA is a log file with test statistics for the posterior predictive simulations and for the empirical data, which is used to compute the pB values at the end of the analysis. Although this is all the information we need, it is sometimes useful to visualise the posterior predictive distributions. For the analyses above, the log files have names out_cc.log and out_ce.log for the constant size coalescent and the exponential growth coalescent models, respectively.
Tracer is a program to visualise MCMC runs (see the start of the tutorial for the download link).
Open Tracer and drag and drop out_cc.log and out_ce.log to the left panel of the Tracer window. Double-click the Burn-in cell for each log file (top left of the window) and change the default value of 10 to 1. The reason we do this is that the first step in the log files from TMA is the test statistic value for the empirical data, which we want to skip because it is not part of the posterior predictive distribution. Select the out_cc.log file, the Tree Height trace, and the Marginal Prob. Distribution (top tabs). The output should look like that in Fig 10.1.
Fig. 10.1 Tracer with log files out_cc.log and out_ce.log loaded.
Select other test statistics in the Traces panel. Some look quite smooth, but others might have a few peaks, which indicates that we might want to obtain more posterior predictive samples. To do this, we can run the analyses again, but setting the option Nr Of Trees to a larger number. An other interesting aspect to inspect about the posterior predictive distributions is that they can differ between the models. To illustrate this, select the two log files in the left-hand panel, and select the External/Internal ratio test statistic. Then, select the options Legend: Top-right and Colour by: Trace File. Note that the exponential-growth coalescent (file out_ce.log) tends to result in longer external branches (relative to internal branches) than the constant-size coalescent (Fig 10.2).
Fig. 10.2 Tracer with log files out_cc.log and out_ce.log loaded, and the External/Internal ratio test statistic selected for both models. The density plot in blue corresponds to the exponential-growth coalescent, where as the constant-size coalescent is shown with the grey density.
Although the visualisation in Tracer is very useful, it does not currently display where the test statistics for the empirical data are located relative to the posterior predictive distribution. One solution to this is to use other software, such as R. We have provided an R script to make some more detailed plots, available here. Save it the same folder where out_cc.log and out_ce.log. In UNIX machines you can run this script by opening Terminal (in mac go to Applications, Utilities, Terminal). Type "cd" and drag and drop the folder where out_cc.log and out_ce.log are contained, and hit Enter. In my terminal it looks like this:
5310-b127601:tree_model_adequacy sduchene$ cd /Users/sduchene/Dropbox/projects_WORKING/phylodynamics/GIT_Project/tree_model_adequacy/
If you have R installed, type the following in Terminal:
Rscript plot_pps.R out_cc.log out_cc.pdf
This will produce a pdf file, called out_cc.pdf, which contains the results for the constant size coalescent. Before looking at it, use the same command to obtain the same plot for the exponential growth:
Rscript plot_pps.R out_ce.log out_ce.pdf
Open the pdf files. The blue histograms correspond to the posterior predictive distribution for the five test statistics calculated here (i.e. those calculated using the trees simulated under each of the models). The red lines correspond to the test statistic for the empirical tree. The pB values for each test statistic are also shown. The bottom right plot was obtained by combining all the test statistics into two components, similar to a principal component analysis (called singular value decomposition). The blue dots represent posterior predictive simulations, and the red dot is the empirical data (Fig. 11 and Fig. 12).
Fig. 11 Posterior predictive distributions for the four test statistics under the constant size coalescent (blue histograms) and values for empirical test statistics (red lines). The lower right panel shows singular value decomposition of the four test statistics, where blue points correspond to posterior predictive simulations and the red point is the Ebola virus tree.
Fig. 12 Posterior predictive distributions for the four test statistics under the exponential growth coalescent.
As we can see, the test statistics for the empirical data tend to fall closer to the centre of the posterior predictive distributions under the exponential growth coalescent than under the constant size coalescent. This is consistent with the trees generated under each model. Fig. 13 shows the empirical Ebola virus tree and one posterior predictive simulation for both models. These are not sufficient posterior predictive simulations to make a meaningful comparison, but there are several key differences that can be spotted by eye. First, the root height of the constant size coalescent is almost two fold larger than empirical tree and for the exponential growth coalescent. Second, the constant size coalescent tree has internal branches that are much longer than external branches, compared to the other two trees. These two features are captured by the Tree Height and External/Internal ratio test statistics.
Fig. 13 Trees generated under the constant size coalescent, the exponential growth coalescent, and the empirical Ebola virus tree. The numbers at the root of each tree correspond to the tree height.
The TMA package has a range of phylodynamic models available, including the Birth-Death with sampling through time, and the Birth-Death SIR. These models can be used similarly to those used in this tutorial. Other phylodynamic models in BEAST will also be implemented in TMA.
Gire, S. K., Goba, A., Andersen, K. G., Sealfon, R. S., Park, D. J., Kanneh, L., ... & Wohl, S. (2014). Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. science, 345(6202), 1369-1372.
Stadler, T., Kühnert, D., Rasmussen, D. A., & du Plessis, L. (2014). Insights into the early epidemic spread of Ebola in Sierra Leone provided by viral sequence data. PLoS currents, 6.
Vaughan, T. G. and Drummond, A. J., (2013) A Stochastic Simulator of Birth–Death Master Equations with Application to Phylodynamics, Mol. Biol. Evol. 30(6), 1480.