Skip to content

Substitution models

sabifo4 edited this page Nov 13, 2024 · 4 revisions

Substitution models

Nucleotide substitution models

For detailed descriptions of Markov models of nucleotide substitution, see Whelan et al. (2001), Felsenstein (2004). or Yang (2006: Chapter 1).

Models used in PAML include JC69 (Jukes and Cantor 1969), K80 (Kimura 1980), F81 (Felsenstein 1981), F84 (Felsenstein 2005, DNAML program since 1984, PHYLIP Version 2.6), HKY85 (Hasegawa et al. 1984; Hasegawa et al. 1985), T92 (Tamura 1992), TN93 (Tamura and Nei 1993), and REV, also known as the GTR model for general-time-reversible (Yang 1994c; Zharkikh 1994). The rate matrices are parameterised as follows:

JC69

$$\mathrm{Q=}\left(\begin{array}{cccc} . & 1 & 1 & 1\\ 1 & . & 1 & 1\\ 1 & 1 & . & 1\\ 1 & 1 & 1 & . \end{array}\right)$$

K80

$$\mathrm{Q=}\left(\begin{array}{cccc} . & \kappa & 1 & 1\\ \kappa & . & 1 & 1\\ 1 & 1 & . & \kappa\\ 1 & 1 & \kappa & . \end{array}\right)$$

F81

$$\mathrm{Q=}\left(\begin{array}{cccc} . & \pi_{C} & \pi_{A} & \pi_{G}\\ \pi_{T} & . & \pi_{A} & \pi_{G}\\ \pi_{T} & \pi_{C} & . & \pi_{G}\\ \pi_{T} & \pi_{C} & \pi_{A} & . \end{array}\right)$$

F84

$$\mathrm{Q=}\left(\begin{array}{cccc} . & (1+\kappa/\pi_{Y})/\pi_{C} & \pi_{A} & \pi_{G}\\ (1+\kappa/\pi_{Y})/\pi_{T} & . & \pi_{A} & \pi_{G}\\ \pi_{T} & \pi_{C} & . & (1+\kappa/\pi_{R})/\pi_{G}\\ \pi_{T} & \pi_{C} & (1+\kappa/\pi_{R})/\pi_{A} & . \end{array}\right)$$ $$\textrm{with }\pi_{Y}=\pi_{T}+\pi_{C}\textrm{and}\pi_{R}=\pi_{A}+\pi_{G}.$$

HKY85

$$\mathrm{Q=}\left(\begin{array}{cccc} . & \kappa\pi_{C} & \pi_{A} & \pi_{G}\\ \kappa\pi_{T} & . & \pi_{A} & \pi_{G}\\ \pi_{T} & \pi_{C} & . & \kappa\pi_{G}\\ \pi_{T} & \pi_{C} & \kappa\pi_{A} & . \end{array}\right)$$

T92

$$\mathrm{Q=}\left(\begin{array}{cccc} . & \kappa(1-\pi_{GC})/2 & \pi_{GC}/2 & \pi_{GC}/2\\ \kappa(1-\pi_{GC})/2 & . & \pi_{GC}/2 & \pi_{GC}/2\\ (1-\pi_{GC})/2 & (1-\pi_{GC})/2 & . & \kappa\pi_{GC}/2\\ (1-\pi_{GC})/2 & (1-\pi_{GC})/2 & \kappa\pi_{GC}/2 & . \end{array}\right)$$

TN93

$$\mathrm{Q=}\left(\begin{array}{cccc} . & \kappa_{1}\pi_{C} & \pi_{A} & \pi_{G}\\ \kappa_{1}\pi_{T} & . & \pi_{A} & \pi_{G}\\ \pi_{T} & \pi_{C} & . & \kappa_{2}\pi_{G}\\ \pi_{T} & \pi_{C} & \kappa_{2}\pi_{A} & . \end{array}\right)$$

REV (GTR)

$$\mathrm{Q=}\left(\begin{array}{cccc} . & a\pi_{C} & b\pi_{A} & c\pi_{G}\\ a\pi_{T} & . & d\pi_{A} & e\pi_{G}\\ b\pi_{T} & d\pi_{C} & . & \pi_{G}\\ c\pi_{T} & e\pi_{C} & \pi_{A} & . \end{array}\right)$$

UNREST

$$\mathrm{Q=}\left(\begin{array}{cccc} . & q_{TC} & q_{TA} & q_{TG}\\ q_{CT} & . & q_{CA} & q_{CG}\\ q_{AT} & q_{AC} & . & q_{AG}\\ q_{GT} & q_{GC} & q_{GA} & . \end{array}\right)=\left(\begin{array}{cccc} . & a & b & c\\ d & . & e & f\\ g & h & . & i\\ j & k & l & . \end{array}\right)$$

Codon substitution models

There is now a large collection of codon substitution models, please see Yang and Bielawski (2000), Yang (2002) and Yang (2006: Chapter 8) for detailed discussions. The basic model is a simplified version of the model of Goldman and Yang (1994) and specifies the substitution rate from codon $i$ to codon $j$ as

$$q_{ij}=\begin{cases} 0, & \textrm{if the two codons differ at more than one position,}\\ \pi_{j}, & \textrm{for synonymous transversion,}\\ \kappa\pi_{j}, & \textrm{for synonymous transition,}\\ \omega\pi_{j}, & \textrm{for nonsynonymous transversion,}\\ \omega\kappa\pi_{j}, & \textrm{for nonsynonymous transition;} \end{cases}$$

(Yang et al. 1998). The equilibrium frequency of codon $j$ ($\pi_{j}$) can be considered a free parameter, but can also be calculated from the nucleotide frequencies at the three codon positions (control variable CodonFreq). Under this model, the relationship holds that $\omega=d_{N}/d_{S}$, the ratio of nonsynonymous/synonymous substitution rates. This basic model is fitted by specifying model = 0 and NSsites = 0, in the control file to run CODEML (e.g., codeml.ctl).

The $\omega$ ratio is a measure of natural selection acting on the protein. Simplistically, values of $\omega < 1, = 1, \textrm{and} > 1$ means negative purifying selection, neutral evolution, and positive selection, respectively. However, the ratio averaged over all sites and all lineages is almost never $> 1$, since positive selection is unlikely to affect all sites over prolonged time. Thus, interest has been focused on detecting positive selection that affects only some lineages or some sites.

Branch models

The branch models allow the $\omega$ ratio to vary among branches in the phylogeny and are useful for detecting positive selection acting on particular lineages (Yang 1998;Yang and Nielsen 1998). They are specified using option model. Specifically, when setting model = 1 the so-called free-ratios model is enabled, which assumes an independent $\omega$ ratio for each branch. This model is very parameter-rich and its use is discouraged. On the other hand, model = 2 allows you to have several $\omega$ ratios. Note that you have to specify how many ratios and which branches should have which rates in the tree file by using branch labels. To better understand the notation you need to use for this purpose, please see “Branch or node labels” under section “Tree file format” when you navigate through the "Data formatting" markdown document. The lysozyme example data files are included in the examples/lysozyme/ folder; check the README.txt file.

Site models

The site models allow the $\omega$ ratio to vary among sites (among codons or amino acids in the protein) (Nielsen and Yang 1998; Yang et al. 2000b). A number of such models are implemented in codeml using the variable Nssites together with option model = 0. You can run several Nssites models in one go (i.e., batch run), by specifying several values for NSsites. For example, NSsites = 0 1 2 7 8 will fit 5 site models to the same data in one go.

Site models have been used in real data analyses and evaluated in computer simulation studies (Anisimova et al. 2001, 2002; Anisimova et al. 2003; Wong et al. 2004). Two pairs of models appear to be particularly useful, forming two likelihood ratio tests of positive selection. The first one compares M1a (NearlyNeutral) and M2a (PositiveSelection), while the second one compares M7 (beta) and M8 (beta&$\omega$).

Important

Please note that models M1a (NearlyNeutral) and M2a (PositiveSelection) are slight modifications of models M1 (neutral) and M2 (selection) in Nielsen and Yang (1998). The old models M1 and M2 fix $\omega_{0}=1$ and $\omega_{1}=1$, and are unrealistic as they do not account for sites with $0<\omega < 1$. In the new models M1a and M2a, described in Wong et al. (2004) and Yang et al. (2005) and implemented in PAML since v3.14 (i.e., since 2004), $0< \omega_{0} < 1$ is estimated from the data, while $\omega_{1}=1$ is fixed.

The naïve empirical Bayes (NEB) (Nielsen and Yang 1998; Yang et al. 2000b) and the Bayes empirical Bayes (BEB) (Yang et al. 2005, implemented since v3.14) are available for calculating the posterior probabilities for site classes, and can be used to identify sites under positive selection if the likelihood ratio test is significant. NEB uses the MLEs of parameters (such as the proportions and $\omega$ ratios) but do not account for their sampling errors, while BEB deals with the sampling errors by applying a Bayesian prior. In both tests, comparing M2a against M1a or M8 against M7, $df = 2$ may be used (i.e., the degree of freedom, $df$, in the chi-square test is the difference in the number of free parameters between the two models being compared). The M1a-M2a comparison appears to be more stringent than the M7-M8 comparison. BEB is implemented under models M2a and M8 only.

Important

Please note that the NEB results you shall see in the output files generated by the CODEML program are to be disregarded; you should only consider the BEB results (see Yang et al. 2005 and Álvarez-Carretero et al. 2023).


The BEB output has the following format:

Prob(w>1) mean w
135 K 0.983* 4.615 +- 1.329

Interpretation: 4.615 is the approximate mean of the posterior distribution for $\omega$, and 1.329 is the square root of the variance in the posterior distribution for $\omega$. The program prints out an * if the posterior probability is >95%, and ** if the probability is > 99%.


A third test compares the null hypothesis M8a (beta&$\omega_{s}$ =1) and M8 (Swanson et al. 2003; Wong et al. 2004). M8a is specified by setting the following options in the control file: NSsites = 8, fix_omega = 1, omega = 1. The null distribution is the 50:50 mixture of point mass 0 and $\chi_{1}^{2}$ (Self and Liang 1987). The critical values are 2.71 at 5% and 5.41 at 1% (as opposed to 3.84 for 5% and 6.63 for 1% for $\chi_{1}^{2}$). Note that the $p$ value for a 50:50 mixture of $\chi_{j}^{2}$ and $\chi_{k}^{2}$ is just the average of the two $p$ values from the two distributions. Un the case of M8a-M8 comparison, you get the $p$ value from $\chi_{1}^{2}$ and then half it to get the $p$ value for the mixture distribution. You can also use $\chi_{1}^{2}$ (Wong et al. 2004).

We suggest that the M0-M3 comparison should be used as a test of variable $\omega$ among sites rather than a test of positive selection. However, the model of a single $\omega$ for all sites is probably wrong in every functional protein, so there is little point of testing.

For more details on the site models, please see the table below:

Table 2. Parameters in the site models.

Model NSsites #p Parameters Note
M0 (one ratio) 0 1 $\omega$ (Goldman and Yang 1994; Yang and Nielsen 1998)
M1a (neutral) 1 2 $p_{0}$ $(p_{1}= 1 – p_{0})$, $\omega_{0}< 1$, $\omega_{1}=1$ (Nielsen and Yang 1998; Yang et al. 2005)
M2a (selection) 2 4 $p_{0}$, $p_{1}$ $(p_{2} = 1 – p_{0} – p_{1})$, $\omega_{0}< 1$, $w_{1}=1$, $\omega_{2} > 1$ (Nielsen and Yang 1998; Yang et al. 2005)
M2a_rel 22 4 $p_{0}$, $p_{1}$ $(p_{2} = 1 – p_{0} – p_{1})$, $\omega_{0}< 1$, $w_{1}=1$, $\omega_{2} > 0$ $\omega_{2}>0$ for use as null for testing the clade model (Weadick and Chang 2012)
M3 (discrete) 3 5 $p_{0}$, $p_{1}$ $(p_{2} = 1 – p_{0} – p_{1})$ $\omega_0$, $\omega{1}$, $\omega_{2}$ (Yang et al. 2000b)
M7 (beta) 7 2 $p$, $q$ (Yang et al. 2000b)
M8 (beta&$\omega$) 8 4 $p_{0}$ $(p_{1}= 1 – p_{0})$, $p$, $q$, $\omega_{s} > 1$ (Yang et al. 2000b)

NOTE⎯ #p is the number of free parameters in the $\omega$ distribution. Parameters in parentheses are not free and should not be counted. For example, in M1a, $p_{1}$ is not a free parameter as $p_{1} = 1 – p_{0}$. In both likelihood ratio tests comparing M1a against M2a and M7 against M8, $df = 2$. The site models are specified via option NSsites in the control file that executes CODEML.

Suzuki and Gojobori’s (1999) method for detecting sites under positive selection.

In the terminology used here, the SG99 method tests whether the $\omega$ ratio for each site is $>1$ or $<1$. This approach uses parsimony to reconstruct ancestral sequences and then, for each site, counts the numbers of synonymous and nonsynonymous differences and the numbers of synonymous and nonsynonymous sits. Afterwards, the method tests whether the $\omega$ ratio at the site is significantly different from 1. Errors in the ancestral sequence reconstruction are ignored. Suzuki has a program called AdaptSite that implements the test. This is an intuitive test, and is known to lack power.

In CODEML, a test of this kind is implemented as a by-product of ancestral sequence reconstruction in both CODEML and BASEML, PAML programs that use ML to reconstruct ancestral sequences. If you want to enable that feature, please use option RateAncestor = 1. The choice of BASEML versus CODEML and also the choice of substitution model for each program affects ancestral sequence reconstruction only. The later steps are the same, and follow Suzuki & Gojobori (1999). For CODEML, you can use M0 (i.e., specifying options NSsites = 0 and model = 0 in the control file). If you want, you can try some other models, such as NSsites = 2 or 8. The models for ancestral reconstruction typically make little difference. For BASEML, you should have GC on the first line of the sequence data file to indicate that the sequences are protein coding (i.e., please check section "Option G in the "Data formatting" markdown file to familiarise with the notation). Use option icode (set to 0 for the universal code or 1 for vertebrate mitochondrial code) in the control file to specify the genetic code, as in CODEML. The following “multiple-gene” model is close to M0: you need to specify model = 4 and Mgene = 4 in the control file (see Yang 1996b).

By specifying NSsites = 22, the site model M2a_rel of Weadick & Chang (2012). This is the same model as M2a except for the fact that $\omega_{2} > 0$ (as mentioned above, model M2a is enabled by setting NSsites = 2 and has $\omega_{2} > 1$). Note that M2a_rel is the null model for the likelihood ratio test based on clade model C (Weadick and Chang 2012).

Branch-site models

The branch-site models allow $\omega$ to vary both among sites in the protein and across branches on the tree and aim to detect positive selection affecting a few sites along particular lineages (called foreground branches). Initially, Yang and Nielsen (2002) implemented the branch-site model A (i.e., enabled by specifying in the control file options model = 2 and NSsites = 2) and the branch-site model B (i.e., specifying in the control file options model = 2 and NSsites = 3). The tests did not work well in simulations (Zhang 2004), so a change was introduced to model A (see Table 3 below, also Yang et al. 2005; Zhang et al. 2005), with two tests constructed. Test 2, also known the branch-site test of positive selection, is the recommended test. Specifically, tis test compares the modified model A with the corresponding null model with $\omega_{2}=1$ fixed (i.e., you can enable the null model by specifying in the control file options fix_omega = 1 and omega = 1). The null distribution is the 50:50 mixture of point mass 0 and $\chi_{1}^{2}$, with critical values 2.71 at 5% and 5.41 at 1%. To calculate the $p$ value based on this mixture distribution, you calculate the $p$ value using $\chi_{1}^{2}$, and then divide it by 2. Suppose your $2\Delta\ell=2.71$, you will get 0.10 from $\chi_{1}^{2}$, the the $p$ value for the mixture is 0.10/2 = 5%. We recommend that you use $\chi_{1}^{2}$ (with critical values 3.84 and 5.99) instead of the mixture to guide against violations of model assumptions. Similarly, both the NEB and BEB methods for calculating posterior probabilities for site classes are implemented for the modified branch-site model A (not for model B).

Important

You should use the branch-site model A in combination with the BEB procedure and ignore the NEB output.

Table 3. Parameters in branch-site model A ($np = 4$).

Site class Proportion Background Foreground
0 $p_{0}$ $0 < \omega_{0} < 1$ $0 < \omega_{0} < 1$
1 $p_{1}$ $\omega_{1} = 1$ $\omega_{1} = 1$
2a $(1 – p_{0} – p_{1}) p_{0}/(p_{0} + p_{1})$ $0 < \omega_{0} < 1$ $\omega_{2}\geq1$
2b $(1 – p_{0} – p_{1}) p_{1}/(p_{0} + p_{1})$ $\omega_{1} = 1$ $\omega_{2}\geq1$

NOTE⎯ Branch-site model A is specified using options model = 2 and NSsites = 2 in the control file. This is the alternative model in the branch-site test of positive selection. The null model fixes $\omega_{2} = 1$. The likelihood ratio test has df = 1 (see text).

Clade models

Clade model C is specified by setting in the control file options model = 3 and Nssites = 2, while clade model D is specified by setting options model = 3 and NSsites = 3 (Bielawski and Yang 2004; see also Forsberg and Christiansen 2003). In both models, the number of site classes (option ncatG in the control file) is fixed at 3 (i.e., ncatG = 3).

Clade model C went through a modification, in a similar way to branch-site model A. The new model C replaces $\omega_{0} = 0$ by $0 < \omega{0} < 1$ and has 5 parameters in the $\omega$ distribution (if there are two clades or branch types): $p_{0}$, $p_{1}$, $\omega_{0}$, $\omega_{2}$, and $\omega_{3}$. The new model C can be compared with the new M1a (NearlyNeutral), which has 2 parameters, with $df=3$. A more powerful test is to use the site model M2a_rel of Weadick & Chang (2012) as the null model, with $df = 1$.

Table 4. Parameters in clade models C or D, with more than two clades (branch types)

Site class Proportion Clade 1 Clade 2 Clade 3 Clade 4
0 $p_{0}$ $\omega_{0}$ $\omega_{0}$ $\omega_{0}$ $\omega_{0}$
1 $p_{1}$ $\omega_{1}$ $\omega_{1}$ $\omega_{1}$ $\omega_{1}$
2 $p_{2} = 1 – p_{0} – p_{1}$ $\omega_{2}$ $\omega_{3}$ $\omega_{4}$ $\omega_{5}$

In clade model C, $\omega_{1} = 1$ is fixed, while in clade model D, $\omega_{1}$ is estimated as a free parameter. In both models, $\omega_{0}$ is estimated under the constraint $0 < \omega_{0} < 1$. The BEB procedure is implemented for clade models C and D. As aforementioned, please ignore the results output by the NEB, only those output by the BEB are to be used. An extension has been made to clade models C and D to allow for more than two clades or branch types. The branch types are specified using labels in the tree file. For instance, if you have four branch types, you would use labels #0, #1, #2, #3 to identify each branch type.

When there are more than two clades, you can see that site class 2 (Table 4) has $\omega_{2}$, $\omega_{3}$, $\omega_{4}$, and $\omega_{5}$ as independent parameters, optimised in the range $(0, \infty)$. The BEB calculation under clade model C is very expensive (with each additional branch type increasing the computation by 10 folds), so that the model should be used with just a few branch types. The maximum number of clades is set by the variable NBTYPE near the beginning of the source file codeml.c. Note that the BEB calculation may be too expensive if you have more than 5 clades. As of February 2024, this variable is set to 8 so, if you want to reduce/increase this number, you can edit the source file codeml.c and recompile again.

Mutation-selection model

The mutation-selection models of Yang and Nielsen (2008) are implemented using the following control variables

CodonFreq = 7    * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table
                 * 4:F1x4MG, 5:F3x4MG, 6:FMutSel0, 7:FMutSel
estFreq = 0      * 0: use observed freqs; 1: estimate freqs by ML

where CodonFreq = 6 specifies the "FMutSel0" model and CodonFreq = 7 specifies the "FmutSel" model (Yang and Nielsen 2008). The "FmutSel" model assigns a fitness to every codon with 60 (= 61 – 1) codon fitness parameters for the universal genetic code. The "FmutSel0" model is a special case of "FmutSel" and assigns the same fitness value for all synonymous codons, so that only 19 (= 20 – 1) amino acid fitness parameters are used. This model assumes that the amino acid frequencies are determined by the functional requirements of the protein, and given the amino acid frequencies, the relative frequencies of synonymous codons are determined solely by the mutational-bias parameters.

If estFreq = 1 is specified in the control file, then the frequency/fitness parameters are estimated by ML from the data, while if estFreq = 0 is specified, they are calculated using the observed frequencies in the data. See Yang and Nielsen (2008) for details of the model. You can also take a look at the README.txt file in the examples/mtCDNAape/ folder to see how to duplicate the results reported in Table 1 in Yang and Nielsen (2008).

Amino acid substitution models

I made a distinction between empirical and mechanistic models of amino acid substitution ((Yang et al. 1998; 2006: Chapter 2). Empirical models implemented in codeml include Dayhoff (Dayhoff et al. 1978), JTT (Jones et al. 1992), WAG (Whelan and Goldman 2001), mtMam (Yang et al. 1998), mtREV (Adachi and Hasegawa 1996b), etc. These are estimates of substitution rate parameters under the general time reversible model from real datasets.

Mechanistic models are formulated by considering the mutational distance between the amino acids as determined by the locations of their encoding codons in the genetic code, and the filtering of mutations by natural selection operating on the protein level (Yang et al. 1998). The program aaml (i.e., which is now launched with CODEML when selecting seqtype = 2) implements a few such models, specified by the variable aaDist.

When specifying options seqtype = 2 and model = 6 (FromCodon1), the mechanistic amino acid substitution model of Yang et al. (1998; Table 3) is enabled, which uses the Markov chain for codons and aggregates the synonymous codons into one state (the coded amino acid) to construct a model of amino acid substitution. This is an approximate formulation since the process after state aggregation is not Markovian. The model involves the parameter $\kappa$, but $\omega$ is not estimable. Branch lengths are in the expected number of amino acid substitutions per amino acid.

When specifying options seqtype = 2 and model = 5 (FromCodon0), another codon-based amino acid substitution model is enabled, which treats amino acids as ambiguities codons. This is an exact formulation. This model involves both parameters $\kappa$ and $\omega$, but because $\omega$ is weakly identifiable, it cannot be reliably estimated. As a result, branch lengths are not reliably estimated either. Branch lengths are in the expected number of nucleotide substitutions per codon.