-
Notifications
You must be signed in to change notification settings - Fork 23
Overview
Besides this manual, please note that you can always consult the following additional resources:
-
Ziheng Yang Lab's website: this website has information about downloading and compiling
PAML
programs too. -
PAML
FAQ page: document that compiles various FAQs sincePAML
4 was released. Last update: 2005/01/05. -
PAML
discussion group: if you have any questions with regards to usingPAML
programs, please post them on this discussion Google group, do not open new issues on this GitHub repository. The latter should strictly be used for technical problems withPAML
programs.
The PAML
package currently includes the following programs: BASEML
, basemlg
, CODEML
, evolver
, pamp
, yn00
, MCMCtree
, and chi2
. A brief overview of the most commonly used models and methods implemented in PAML
is provided by Yang (2007). The book Yang (2006) describes the statistical and computational details. Examples of analyses that can be performed using the package include the following:
- Comparison and tests of phylogenetic trees (
BASEML
andCODEML
). - Estimation of parameters in sophisticated substitution models, including models of variable rates among sites and models for combined analysis of multiple genes or site partitions (
BASEML
andCODEML
). - Likelihood ratio tests (LRTs) of hypotheses through comparison of implemented models (
BASEML
,CODEML
,chi2
). - Estimation of divergence times under global and local clock models (
BASEML
andCODEML
). - Likelihood (Empirical Bayes) reconstruction of ancestral sequences using nucleotide, amino acid, and codon models (
BASEML
andCODEML
). - Generation of datasets of nucleotide, codon, and amino acid sequence by Monte Carlo simulation (
evolver
). - Estimation of synonymous and nonsynonymous substitution rates and detection of positive selection in protein-coding DNA sequences (
yn00
andCODEML
). - Bayesian estimation of species divergence times incorporating uncertainties in fossil calibrations (
MCMCtree
).
The strength of PAML
is its collection of sophisticated substitution models. Tree search algorithms implemented in BASEML
and CODEML
are rather primitive, so except for very small datasets with say, <10 species, you are better off using another software such as raxml-ng
, IQ-TREE
, PhyloBayes
, or MrBayes
to infer the tree topology/ies, which you can then evaluate using BASEML
or CODEML
as input tree/s.
-
BASEML
andCODEML
: The programBASEML
is for maximum likelihood analysis of nucleotide sequences. The programCODEML
is formed by merging two old programs:codonml
, which implements the codon substitution model of Goldman and Yang (1994) for protein-coding DNA sequences, andaaml
, which implements models for amino acid sequences. These two are now distinguished by the variableseqtype
in the control filecodeml.ctl
, with1
for codon sequences and2
for amino acid sequences. In this document, I usecodonml
andaaml
to refer toCODEML
withseqtype = 1
andseqtype = 2
, respectively. The programsBASEML
andCODEML
use similar algorithms to fit models by maximum likelihood, the main difference being that the unit of evolution in the Markov model, referred to as a "site" in the sequence, is a nucleotide, a codon, or an amino acid for the three programs, respectively. Markov process models are used to describe substitutions between nucleotides, codons, or amino acids, with substitution rates assumed to be either constant or variable among sites. -
evolver
: This program can be used to simulate sequences under nucleotide, codon, and amino acid substitution models. It also has some other options such as generating random trees and calculating the partition distances (Robinson and Foulds 1981) between trees. -
basemlg
: This program implements the (continuous) gamma model of Yang (1993). It is very slow and unfeasible for data of more than 6 or 7 species. Instead, the discrete-gamma model inBASEML
described in Yang (1994) should be used. -
MCMCtree
: This program implements the Bayesian MCMC algorithm of Yang and Rannala (2006) and Rannala and Yang (2007) for estimating species divergence times. -
pamp
: This program implements the parsimony-based analysis of Yang and Kumar (1996). -
yn00
: This program implements the method of Yang and Nielsen (2000) for estimating synonymous and nonsynonymous substitution rates (dS and dN) in pairwise comparisons of protein-coding DNA sequences. -
chi2
: This calculates the$\chi_{2}$ critical value and p-value for conducting the likelihood ratio test. Run the program by typing its name:chi2
. Once you do this, the software will print out the critical values for different d.f. (for example, the 5% critical value with d.f. = 1 is 3.84). If you run the program with one command-line argument, the program enters a loop to ask you to input the d.f. and the test statistic and then calculates the p-value. A third way of running the program from the command line is to include the d.f. and test statistic both as command-line argument. For instance:
chi2
chi2 p
chi2 1 3.84
There are many things that you might well expect a phylogenetics package should do, but PAML
cannot. Below, you can find a partial list of such limitations, provided in the hope that it might help you avoid wasting time.
-
Sequence alignment: You should use some other programs such as
Muscle5
,mafft
, orBAli-Phy
(just to name a few, there are many more you can use!) to align the sequences automatically. Manual adjustment does not seem to have reached the mature stage to be entirely trustable, so you should always do that with care. If you are constructing thousands of alignments in genome-wide analysis, you should implement some quality control, and, say, calculate some measure of sequence divergence as an indication of the unreliability of the alignment. For coding sequences, you might align the protein sequences and construct the DNA alignment based on the protein alignment. Note that, ifcleandata = 0
, both ambiguity characters and alignment gaps are treated as ambiguity characters inBASEML
andCODEML
. Ifcleandata = 1
, all sites with ambiguity characters and alignment gaps are removed from all sequences before analysis. -
Gene prediction: The codon-based analysis implemented in
CODEML
(seqtype = 1
) assumes that the sequences are pre-aligned exons, the sequence length is an exact multiple of 3, and the first nucleotide in the sequence is codon position 1. Introns, spacers, and other non-coding regions must be removed and the coding sequences must be aligned before running the program. The program cannot process sequences downloaded directly from GenBank, even though the CDS information is there, nor predict coding regions. - Tree search in large data sets: As mentioned earlier, you should use another program to get a tree or some candidate trees and use them as user trees to fit models that might not be available in other packages.
Before running a PAML
program, please make sure that you have followed the installation instructions according to your operating system. When PAML
programs are exported to the system's path, you can run a program by typing its name from the command line. If your working directory is not the same where you have your sequence file, tree file, and control file, you should know the relative/absolute path to such folder. If inexperienced and/or you are having issues to export paths (see Installation.md for tips on how to do this for different operating systems), you may copy the relevant executable file to the folder containing your data files, and run the PAML
program from this folder.
Note
When running CODEML
, please note that you may need a data file such as grantham.dat
, dayhoff.dat
, jones.dat
, wag.dat
, mtREV24.dat
, mtmam.dat
, etc.; so you should copy these files as well in the same directory where you have your input files and control file (and add the corresponding name in variable aaRatefile
in the control file!). You can find these files in the dat
directory, which you will have access from your file system once you clone the repository or download the latest release. Alternatively, you can always type the relative path to the file you want to use in variable aaRatefile
.
Important
Some PAML programs produce result files such as as rub
, lnf
, rst
, or rates
. You should not use these names (or other names that PAML programs use to create output files) for your own files. Otherwise, they will be overwritten!
The examples/
folder contains many example data sets. They were used in the original papers to test the new methods, and I included them so that you could duplicate our results in the papers. Sequence alignments, control files, and detailed readme files are included. They are intended to help you get familiar with the input data formats and with interpretation of the results, and also to help you discover bugs in the program. If you are interested in a particular analysis, get a copy of the paper that described the method and analyse the example dataset to duplicate the published results. This is particularly important because the manual, as it is written, describes the meanings of the control variables used by the programs but does not clearly explain how to set up the control file to conduct a particular analysis.
-
examples/HIVNSsites/
: This folder contains example data files for the HIV-1 env V3 region analysed in Yang et al. (2000b). The data set is for demonstrating theNSsites
models described in that paper, that is, models of variable$\omega$ ratios among amino acid sites. Those models are called the “random-sites” models by Yang & Swanson (2002) since a priori we do not know which sites might be highly conserved and which under positive selection. They are also known as “fishing-expedition” models. The included data set is the 10th data set analysed by Yang et al. (2000b), and the results are in table 12 of that paper. Look at the README.txt file in that folder. -
examples/lysin/
: This folder contains the sperm lysin genes from 25 abalone species analysed by Yang, Swanson & Vacquier (2000a) and Yang and Swanson (2002). The data set is for demonstrating both the “random-sites” models (as in Yang, Swanson & Vacquier (2000a)) and the “fixed-sites” models (as in Yang and Swanson (2002)). In the latter paper, we used structural information to partition amino acid sites in the lysin into the “buried” and “exposed” classes and assigned and estimated different$\omega$ ratios for the two partitions. The hypothesis is that the sites exposed on the surface are likely to be under positive selection. Look at the README.txt file in that folder. -
examples/lysozyme/
: This folder contains the primate lysozyme c genes of Messier and Stewart (1997), re-analysed by Yang (1998). This is for demonstrating codon models that assign different$\omega$ ratios for different branches in the tree, useful for testing positive selection along lineages. Those models are sometimes called branch models or branch-specific models. Both the “large” and the “small” data sets in Yang (1998) are included. Those models require the user to label branches in the tree, and the readme file and included tree file explain the format in great detail. See also the section “Tree file and representations of tree topology” later about specifying branch/node labels. The lysozyme data set was also used by Yang and Nielsen (2002) to implement the so-called “branch-site” models, which allow the$\omega$ ratio to vary both among lineages and among sites. Look at the README.txt file to learn how to run those models. -
examples/MouseLemurs/
: This folder includes the mtDNA alignment that Yang and Yoder (2003) analysed to estimate divergence dates in mouse lemurs. The data set is for demonstrating maximum likelihood estimation of divergence dates under models of global and local clocks. The most sophisticated model described in that paper uses multiple calibration nodes simultaneously, analyses multiple genes (or site partitions) while accounting for their differences, and also account for variable rates among branch groups. The README.txt file explains the input data format as well as model specification in detail. The README2.txt file explains the ad hoc rate smoothing procedure of Yang (2004). -
examples/mtCDNA/
: This folder includes the alignment of 12 protein-coding genes on the same strand of the mitochondrial genome from seven ape species analysed by Yang, Nielsen, & Hasegawa (1998) under a number of codon and amino acid substitution models. The data set is the “small” data set referred to in that paper, and was used to fit both the “mechanistic” and empirical models of amino acid substitution as well as the “mechanistic” models of codon substitution. The model can be used, for example, to test whether the rates of conserved and radical amino acid substitutions are equal. See the README.txt file for details. -
examples/TipDate.HIV2/
: This folder includes the alignment of 33 SIV/HIV-2 sequences, compiled and analysed by Lemey et al. (2003) and re-analysed by Stadler and Yang (2013). The README.txt file explains how to duplicate the ML and Bayesian results published in that paper. Note that the sample date is the last field in the sequence name.
Some other data files are included in the package as well. The details follow:
-
brown.nuc
andbrown.trees
: the 895-bp mtDNA data of Brown et al. (1982), used in Yang et al. (1994) and Yang (1994b) to test models of variable rates among sites. -
mtprim9.nuc
and9s.trees
: mitochondrial segment consisting of 888 aligned sites from 9 primate species (Hayasaka et al. 1988), used by Yang (1994b) to test the discrete-gamma model and Yang (1995) to test the auto-discrete-gamma models. -
abglobin.nuc
andabglobin.trees
: the concatenated$\alpha$ - and$\beta$ -globin genes, used by Goldman and Yang (1994) in their description of the codon model.abglobin.aa
is the alignment of the translated amino acid sequences. -
stewart.aa
andstewart.trees
: lysozyme protein sequences of six mammals (Stewart et al. 1987), used by Yang et al. (1995b) to test methods for reconstructing ancestral amino acid sequences.
© Copyright 1993-2023 by Ziheng Yang
The software package is provided "as is" without warranty of any kind. In no event shall the author or their employer be held responsible for any damage resulting from the use of this software, including but not limited to the frustration that you may experience in using the package. The program package, including source codes, example data sets, executables, and this documentation is maintained by Ziheng Yang and distributed under the GNU GPL v3.
Ziheng Yang
Department of Genetics, Evolution, and Environment
University College London
Gower Street
WC1E 6BT, London, United Kingdom