-
Notifications
You must be signed in to change notification settings - Fork 23
Substitution models
For detailed descriptions of Markov models of nucleotide substitution, see Whelan et al. (2001), Felsenstein (2004). or Yang (2006: Chapter 1).
Models used in PAML
include JC69 (Jukes and Cantor 1969), K80 (Kimura 1980), F81 (Felsenstein 1981), F84 (Felsenstein 2005, DNAML program since 1984, PHYLIP Version 2.6), HKY85 (Hasegawa et al. 1984; Hasegawa et al. 1985), T92 (Tamura 1992), TN93 (Tamura and Nei 1993), and REV, also known as the GTR model for general-time-reversible (Yang 1994c; Zharkikh 1994). The rate matrices are parameterised as follows:
JC69
K80
F81
F84
HKY85
T92
TN93
REV (GTR)
UNREST
There is now a large collection of codon substitution models, please see Yang and Bielawski (2000), Yang (2002) and Yang (2006: Chapter 8) for detailed discussions. The basic model is a simplified version of the model of Goldman and Yang (1994) and specifies the substitution rate from codon
(Yang et al. 1998). The equilibrium frequency of codon CodonFreq
). Under this model, the relationship holds that model = 0
and NSsites = 0
, in the control file to run CODEML
(e.g., codeml.ctl
).
The
The branch models allow the model
. Specifically, when setting model = 1
the so-called free-ratios model is enabled, which assumes an independent model = 2
allows you to have several examples/lysozyme/
folder; check the README.txt
file.
The site models allow the codeml
using the variable Nssites
together with option model = 0
. You can run several Nssites
models in one go (i.e., batch run), by specifying several values for NSsites
. For example, NSsites = 0 1 2 7 8
will fit 5 site models to the same data in one go.
Site models have been used in real data analyses and evaluated in computer simulation studies (Anisimova et al. 2001, 2002; Anisimova et al. 2003; Wong et al. 2004). Two pairs of models appear to be particularly useful, forming two likelihood ratio tests of positive selection. The first one compares M1a
(NearlyNeutral) and M2a
(PositiveSelection), while the
second one compares M7
(beta) and M8
(beta&$\omega$).
Important
Please note that models M1a
(NearlyNeutral) and M2a
(PositiveSelection) are slight modifications of models M1
(neutral) and M2
(selection) in Nielsen and Yang (1998). The old models M1
and M2
fix M1a
and M2a
, described in Wong et al. (2004) and Yang et al. (2005) and implemented in PAML
since v3.14 (i.e., since 2004),
The naïve empirical Bayes (NEB) (Nielsen and Yang 1998; Yang et al. 2000b) and the Bayes empirical Bayes (BEB) (Yang et al. 2005, implemented since v3.14) are available for calculating the posterior probabilities for site classes, and can be used to identify sites under positive selection if the likelihood ratio test is significant. NEB uses the MLEs of parameters (such as the proportions and M2a
against M1a
or M8
against M7
, M1a
-M2a
comparison appears to be more stringent than the M7
-M8
comparison. BEB is implemented under models M2a
and M8
only.
Important
Please note that the NEB results you shall see in the output files generated by the CODEML
program are to be disregarded; you should only consider the BEB results (see Yang et al. 2005 and Álvarez-Carretero et al. 2023).
The BEB output has the following format:
Prob(w>1) mean w
135 K 0.983* 4.615 +- 1.329
Interpretation: 4.615 is the approximate mean of the posterior distribution for
$\omega$ , and 1.329 is the square root of the variance in the posterior distribution for$\omega$ . The program prints out an*
if the posterior probability is >95%, and**
if the probability is > 99%.
A third test compares the null hypothesis M8a
(beta&$\omega_{s}$ =1) and M8
(Swanson et al. 2003; Wong et al. 2004). M8a
is specified by setting the following options in the control file: NSsites = 8
, fix_omega = 1
, omega = 1
. The null distribution is the
50:50 mixture of point mass 0 and M8a
-M8
comparison, you get the
We suggest that the M0
-M3
comparison should be used as a test of variable
For more details on the site models, please see the table below:
Table 2. Parameters in the site models.
Model | NSsites |
#p | Parameters | Note |
---|---|---|---|---|
M0 (one ratio) |
0 | 1 | (Goldman and Yang 1994; Yang and Nielsen 1998) | |
M1a (neutral) |
1 | 2 |
|
(Nielsen and Yang 1998; Yang et al. 2005) |
M2a (selection) |
2 | 4 |
|
(Nielsen and Yang 1998; Yang et al. 2005) |
M2a_rel |
22 | 4 |
|
|
M3 (discrete) |
3 | 5 |
|
(Yang et al. 2000b) |
M7 (beta) |
7 | 2 |
|
(Yang et al. 2000b) |
M8 (beta&$\omega$) |
8 | 4 |
|
(Yang et al. 2000b) |
NOTE⎯ #p is the number of free parameters in the
$\omega$ distribution. Parameters in parentheses are not free and should not be counted. For example, inM1a
,$p_{1}$ is not a free parameter as$p_{1} = 1 – p_{0}$ . In both likelihood ratio tests comparingM1a
againstM2a
andM7
againstM8
,$df = 2$ . The site models are specified via optionNSsites
in the control file that executesCODEML
.
In the terminology used here, the SG99
method tests whether the AdaptSite
that implements the test. This is an intuitive test, and is known to lack power.
In CODEML
, a test of this kind is implemented as a by-product of ancestral sequence reconstruction in both CODEML
and BASEML
, PAML
programs that use ML to reconstruct ancestral sequences. If you want to enable that feature, please use option RateAncestor = 1
. The choice of BASEML
versus CODEML
and also the choice of substitution model for each program affects ancestral sequence reconstruction only. The later steps are the same, and follow Suzuki & Gojobori (1999). For CODEML
, you can use M0
(i.e., specifying options NSsites = 0
and model = 0
in the control file). If you want, you can try some other models, such as NSsites = 2
or 8
. The models for ancestral reconstruction typically make little difference. For BASEML
, you should have GC
on the first line of the sequence data file to indicate that the sequences are protein coding (i.e., please check section "Option G in the "Data formatting" markdown file to familiarise with the notation). Use option icode
(set to 0
for the universal code or 1
for vertebrate mitochondrial code) in the control file to specify the genetic code, as in CODEML
. The following “multiple-gene” model is close to M0
: you need to specify model = 4
and Mgene = 4
in the control file (see Yang 1996b).
By specifying NSsites = 22
, the site model M2a_rel
of Weadick & Chang (2012). This is the same model as M2a
except for the fact that M2a
is enabled by setting NSsites = 2
and has M2a_rel
is the null model for the likelihood ratio test based on clade model C (Weadick and Chang 2012).
The branch-site models allow model = 2
and NSsites = 2
) and the branch-site model B (i.e., specifying in the control file options model = 2
and NSsites = 3
). The tests did not work well in simulations (Zhang 2004), so a change was introduced to model A (see Table 3 below, also Yang et al. 2005; Zhang et al. 2005), with two tests constructed. Test 2, also known the branch-site test of positive selection, is the recommended test. Specifically, tis test compares the modified model A with the corresponding null model with fix_omega = 1
and omega = 1
). The null distribution is the 50:50 mixture of point mass 0 and
Important
You should use the branch-site model A in combination with the BEB procedure and ignore the NEB output.
Table 3. Parameters in branch-site model A (
$np = 4$ ).
Site class | Proportion | Background | Foreground |
---|---|---|---|
0 | |||
1 | |||
2a | |||
2b |
NOTE⎯ Branch-site model A is specified using options
model = 2
andNSsites = 2
in the control file. This is the alternative model in the branch-site test of positive selection. The null model fixes$\omega_{2} = 1$ . The likelihood ratio test has df = 1 (see text).
Clade model C is specified by setting in the control file options model = 3
and Nssites = 2
, while clade model D is specified by setting options model = 3
and NSsites = 3
(Bielawski and Yang 2004; see also Forsberg and Christiansen 2003). In both models, the number of site classes (option ncatG
in the control file) is fixed at 3
(i.e., ncatG = 3
).
Clade model C went through a modification, in a similar way to branch-site model A. The new model C replaces M1a
(NearlyNeutral), which has 2 parameters, with M2a_rel
of Weadick & Chang (2012) as the null model, with
Table 4. Parameters in clade models C or D, with more than two clades (branch types)
Site class | Proportion | Clade 1 | Clade 2 | Clade 3 | Clade 4 |
---|---|---|---|---|---|
0 | |||||
1 | |||||
2 |
In clade model C, #0
, #1
, #2
, #3
to identify each branch type.
When there are more than two clades, you can see that site class 2 (Table 4) has NBTYPE
near the beginning of the source file codeml.c
. Note that the BEB calculation may be too expensive if you have more than 5 clades. As of February 2024, this variable is set to 8 so, if you want to reduce/increase this number, you can edit the source file codeml.c
and recompile again.
The mutation-selection models of Yang and Nielsen (2008) are implemented using the following control variables
CodonFreq = 7 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table
* 4:F1x4MG, 5:F3x4MG, 6:FMutSel0, 7:FMutSel
estFreq = 0 * 0: use observed freqs; 1: estimate freqs by ML
where CodonFreq = 6
specifies the "FMutSel0" model and CodonFreq = 7
specifies the "FmutSel" model (Yang and Nielsen 2008). The "FmutSel" model assigns a fitness to every codon with 60 (= 61 – 1) codon fitness parameters for the universal genetic code. The "FmutSel0" model is a special case of "FmutSel" and assigns the same fitness value for all synonymous codons, so that only 19 (= 20 – 1) amino acid fitness parameters are used. This model assumes that the amino acid frequencies are determined by the functional requirements of the protein, and given the amino acid frequencies, the relative frequencies of synonymous codons are determined solely by the mutational-bias parameters.
If estFreq = 1
is specified in the control file, then the frequency/fitness parameters are estimated by ML from the data, while if estFreq = 0
is specified, they are calculated using the observed frequencies in the data. See Yang and Nielsen (2008) for details of the model. You can also take a look at the README.txt
file in the examples/mtCDNAape/
folder to see how to duplicate the results reported in Table 1 in Yang and Nielsen (2008).
I made a distinction between empirical and mechanistic models of amino acid substitution ((Yang et al. 1998; 2006: Chapter 2). Empirical models implemented in codeml include Dayhoff
(Dayhoff et al. 1978), JTT
(Jones et al. 1992), WAG
(Whelan and Goldman 2001), mtMam
(Yang et al. 1998), mtREV
(Adachi and Hasegawa 1996b), etc. These are estimates of substitution rate parameters under the general time reversible model from real datasets.
Mechanistic models are formulated by considering the mutational distance between the amino acids as determined by the locations of their encoding codons in the genetic code, and the filtering of
mutations by natural selection operating on the protein level (Yang et al. 1998). The program aaml
(i.e., which is now launched with CODEML
when selecting seqtype = 2
) implements a few such models, specified by the variable aaDist
.
When specifying options seqtype = 2
and model = 6
(FromCodon1
), the mechanistic amino acid substitution model of Yang et al. (1998; Table 3) is enabled, which uses the Markov chain for codons and aggregates the synonymous codons into one state (the coded amino acid) to construct a model of amino acid substitution. This is an approximate formulation since the process after state aggregation is not Markovian. The model involves the parameter
When specifying options seqtype = 2
and model = 5
(FromCodon0
), another codon-based amino acid substitution model is enabled, which treats amino acids as ambiguities codons. This is an exact formulation. This model involves both parameters
© Copyright 1993-2023 by Ziheng Yang
The software package is provided "as is" without warranty of any kind. In no event shall the author or their employer be held responsible for any damage resulting from the use of this software, including but not limited to the frustration that you may experience in using the package. The program package, including source codes, example data sets, executables, and this documentation is maintained by Ziheng Yang and distributed under the GNU GPL v3.
Ziheng Yang
Department of Genetics, Evolution, and Environment
University College London
Gower Street
WC1E 6BT, London, United Kingdom