Skip to content

Data formatting

sabifo4 edited this page Nov 13, 2024 · 5 revisions

Data formatting

Sequence data file format

Have a look at some of the example data files in the package (e.g., .nuc, .aa, .nex, etc.). As long as you get your data file into one of the formats, PAML programs should be able to read it. The “native” format is the PHYLIP format used in Joe Felsenstein’s PHYLIP package (Felsenstein 2005) (but see below). PAML has limited support for the NEXUS file format, and thus we highly encourage you to convert all your sequence data files into PHYLIP format. PAML does not deal with comment blocks in the sequence data block, so please avoid them.

Sequential and interleaved formats

Below is an example of the PHYLIP format. The first line contains the number of species and the sequence length (possibly followed by option characters). For codon sequences (CODEML with seqtype = 1), the sequence length in the sequence file refers to the number of nucleotides rather than the number of codons. The only options allowed in the sequence file are I, S, P, C, and G. The sequences may be in either interleaved format (option I, example data file abglobin.nuc), or sequential format (option S, example data file brown.nuc). The default option is S, so you don’t have to specify it. Option G is used for combined analysis of multiple gene data and is explained below. The following is an example data set in the sequential format. It has 4 sequences each of 60 nucleotides (or 20 codons):

4 60
sequence1
AAGCTTCACCGGCGCAGTCATTCTCATAAT
CGCCCACGGACTTACATCCTCATTACTATT
sequence2
AAGCTTCACCGGCGCAATTATCCTCATAAT
CGCCCACGGACTTACATCCTCATTATTATT
sequence3
AAGCTTCACCGGCGCAGTTGTTCTTATAAT
TGCCCACGGACTTACATCATCATTATTATT
sequence4
AAGCTTCACCGGCGCAACCACCCTCATGAT
TGCCCATGGACTCACATCCTCCCTACTGTT

Species/sequence names

Do not use the following special symbols in a species/sequence name: , : # ( ) $ = . These symbols are used for special purposes and may confuse the programs. The symbol @ can be used as part of and end of the sequence name to specify the date of determination of that sequence. E.g.: if you wrote virus1@1984, the @ symbol is considered part of the name and the program would understand that this sequence was determined in 1984. The maximum number of characters in a species name (LSPNAME) is specified at the beginning of the main programs (e.g., baseml.c, codeml.c, mcmctree.c, etc.). In PHYLIP, exactly 10 characters are used for a species name, which can be quite restrictive, and so the default value in PAML programs is 30 characters. To make this discrepancy less a problem, PAML considers two consecutive spaces as the end of a species name, so that the species name does not have to have exactly 30 (or 10) characters. In make this rule work, you should not have two consecutive spaces within a species name. Bear in mind that having spaces in the sequence names may cause problems, and thus we suggest that you use _ instead if you want to separate parts of your sequence names.

Note

If you want the file to be readable by both PHYLIP and PAML, you should limit the number of characters in the name to 10 and separate the name and the sequence by at least two spaces.

Some important information about the characters to be used:

  • In a sequence, T, C, A, G, U, t, c, a, g, u are recognised as nucleotides (for BASEML, BASEMLG and CODEML), while the standard one-letter codes (A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V or their lowercase equivalents) are recognised as amino acids. Ambiguity characters (undetermined nucleotides or amino acids) are allowed as well.
  • Three special characters ., -, and ? are interpreted like this: a dot means the same character as in the first sequence, a dash means an alignment gap, and a question mark means an undetermined nucleotide or amino acid.
  • Non-alphabetic symbols such as > < ! " £ $ % & ^ [ ] ( ) { } 0 1 2 3 4 5 6 7 8 9 inside a sequence are simply ignored and can be freely used as signposts.
  • Lines do not have to be equally long and you can put the whole sequence on one line.
  • The way that ambiguity characters and alignment gaps are treated in BASEML and CODEML depends on the variable cleandata in the control file. In the maximum likelihood analysis, sites at which at least one sequence involves an ambiguity character are removed from all sequences before analysis if cleandata = 1, while if cleandata = 0, both ambiguity characters and alignment gaps are treated as ambiguity characters. In the pairwise distance calculation (the lower-diagonal distance matrix in the output), cleandata = 1 means “complete deletion”, with all sites involving ambiguity characters and alignment gaps removed from all sequences, while cleandata = 0 means “pairwise deletion”, with only sites which have missing characters in the pair removed. There are no models for insertions and deletions in the PAML programs, so an alignment gap is treated as an ambiguity (that is, a question mark ?). Note also that, for codon sequences, removal of any nucleotide means removal of the whole codon.

Notes may be placed at the end of the sequence file (i.e., after //) and will be ignored by the programs.

Option G

This option is for combined analyses of heterogeneous data sets such as data of multiple genes or data of the three codon positions. The sequences must be concatenated, and the option is used to specify which gene or codon position each site is from. There are three formats with this option.

Format 1

The first format is illustrated by an excerpt of a sequence file listed below. The example data of Brown et al. (1982) are an 895-bp segment from the mitochondrial genome, which codes for parts of two proteins (ND4 and ND5) at the two ends and three tRNAs in the middle. Sites in the sequence fall naturally into 4 classes: the three codon positions and the tRNA coding region.

  • The first line of the file contains the option character G: 5 895 G
  • The second line begins with a G at the first column, followed by the number of site classes: G 4
  • The following lines contain the site marks, one for each site in the sequence (or each codon in the case of CODEML): 3.
5 895 G
G 4
3
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
1231231231231231231231231231231231231
444444444444444444444444444444444444444444444444444444444444
444444444444444444444444444444444444444444444444444444444444
444444444444444444444444444444444444444444444444444444444444
444444444444444444
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
123123123123123123123123123123123123123123123123123123123123
12312312312312312312312312312312312312312312312312312312312
Human
AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGACTTACATCCTCATTACTATT
CTGCCTAGCAAACTCAAACTACGAACGCACTCACAGTCGCATCATAATC [...]
Chimpanzee

[...]

Important

The site mark specifies which class each site is from. E.g., if there are $g$ classes, the marks should be 1, 2, ..., $g$, and if $g &gt; 9$, the marks need to be separated by spaces. The total number of marks must be equal to the total number of sites in each sequence. In the example above, there are 4 classes as there are sites assigned site mark 1, 2, 3, and 4.

Format 2

The second format is useful if the data are concatenated sequences of multiple genes, shown below for an example data set:

5 1000 G
G 4 100 200 300 400
Sequence 1
TCGATAGATAGGTTTTAGGGGGGGGGGTAAAAAAAAA [...]
  • The first line shows the number of sequences, the length, and has flag G: 5 1000 G.
  • The second line specifies the data type. In the example below, you can see that this sequence has 1,000 nucleotides (first line) from 4 genes (second element in second line, 4), obtained from concatenating four genes with 100, 200, 300, and 400 nucleotides (third, fourth, fifth, and sixth element in second line, 100 200 300 400) from genes 1, 2, 3, and 4, respectively. The "lengths" for the genes must be on the line that starts with G, i.e., on the second line of the sequence file. Note that this requirement allows the program to determine which of the two formats is being used! The sum of the lengths for the genes should be equal to the number of nucleotides, amino acids, or codons in the combined sequence for BASEML (or BASEMLG), CODEML, respectively (i.e., the number input in the first line, 1000 in this example).
Format 3

The third format applies to protein-coding DNA sequences only (for BASEML). You use option characters GC on the first line instead of G alone (see first line in example below, 5 855 GC). The program will then treat the three codon positions differently in the nucleotide-based analysis. It is assumed that the sequence length (i.e., 855 in the example below, second element in the first line) is an exact multiple of three:

5  855  GC
human      GTG CTG TCT CCT [...]

Option G for codon sequences (CODEML with seqtype = 1)

The format is similar to the same option for BASEML, but note that the sequence length is in number of nucleotides while the gene lengths are in number of codons, which has repeatedly been a source of confusion. Below is an example:

5  300  G
G2   40  60

As per the example above of the header of the sequence data file, the data set would have 5 sequences, each of 300 nucleotides (100 codons), which are partitioned into two genes (G2, first element in second line), with the first gene having 40 codons (40, second element in second line) and the second gene 60 codons (60, second element in second line).

Site pattern counts

The sequence alignment can also be input in the form of site patterns and counts of sites having those site patterns. This format is specified by the option P on the first line of the input data file, as illustrated by the following example:

3  8  P

human   GTACTGCC
rabbit  GTACTACT
rat     GTACAGAC

100 200 40 50 11 12 13 14

The example above shows 3 sequences and 8 site patterns; being the flag P what indicates that the data are site patterns and not sites (i.e., first line 3 8 P). The P option is used in the same way as options I for interleaved format and S for sequential format (default, and hence why normally people do not type S after indicating number of taxa and number of characters).

The 8 numbers below the alignment (i.e., line with 100 200 40 50 11 12 13 14) are the number of sites having the 8 patterns displayed above. For example, there are 100 sites that have G in all three species (i.e., the column for site pattern 1 is GGG), and there are 200 sites that have T for all three species (i.e., second column is TTT), and so on. In total, there are 100 + 200 + 40 + … + 14 = 440 sites that follow these 8 patterns.

This example applies to BASEML and BASEMLG programs for nucleotide-based analysis. To specify multiple genes (site partitions), one may use option G together with option P:

3  10  PG
G 2    4 6

human   GTTA CATGTC
rabbit  GTCA CATATT
rat     GTTA CAAGTC

100 200 40 50 120 61 12 13 54 12

In the example above, there are 10 site patterns and 2 genes (i.e., 2 site partitions). The first 4 patterns are for the first gene (i.e., first site partition, 100 200 40 50) while the next 6 patterns are for the second gene (i.e., the second site partition, 120 61 12 13 54 12), with a total of 10 site patterns. For instance, in partition 1 there are 50 sites having A (i.e., nucleotide A is in all three species, column 4 is AAA), while in partition 2 there are 61 such sites (i.e., column 6, AAA).

The same format applies to protein sequences (CODEML with seqtype = 2), with amino acids replacing nucleotides in the examples above.

For codon sequences (CODEML with seqtype = 1), the format is as follows:

3  24  P G

human     GTG CTG TCT CCT GCC GAC AAG ACC 
rabbit    ... ... ... G.C ... ... ... T.. 
rat       ... ... ... ..C ..T ... ... ... 

    6 1 1 1 1 4 3 1

There are 3 species and 8 site patterns. The program requires that you use the number of nucleotide sites on the first line, and thus you see 24: $3\times 8=24$ as the unit of evolution is the triplet. This is strange but consistent with the sequential or interleaved sequence format, where the sequence length is specified in the number of nucleotides rather than number of codons (initially, I did this so that the same file can be read by both BASEML for nucleotide-based analysis and CODEML for codon based analysis). If you pay attention to the last line, you can see the site patterns. E.g., the first 6 in the last line corresponds to the first column of triplets, GTG in all three species, hence the ... for the rabbit and the rat in the first column. In that way, this means that there are 6 sites having the first site pattern, GTG, for all sequences.

To specify multiple genes for codon site patterns, see the following example:

3  24  P G
G 2      4   4

human     GTG CTG TCT CCT GCC GAC AAG ACC 
rabbit    ... ... ... G.C ... ... ... T.. 
rat       ... ... ... ..C ..T ... ... ... 

    6 1 1 1 1 4 3 1

Above, you can see that there are again 8 codon site patterns (i.e., 24, second element in the first line) in total, with the first 4 patterns for gene 1 and the next 4 patterns for gene 2. Note that G 2 in the second line is used to specify that there are two partitions in the alignment below, then the third and fourth elements in such line, both 4, indicate that there are 4 site patterns for the first site partition and another four patterns for the second site partition. Furthermore, variable P in the first line can be used together with variables I or S. For instance, PI means that the site patterns are listed using the interleaved format while PS means that the site patterns are listed using the sequential format. P without I or S uses the default sequential format. Having the whole sequence of all site patterns in one line conforms with both the I and S formats, so there is no need to specify I or S.

If you run BASEML and CODEML to read the sequential or interleaved formats of sequences, the output will include a print-out in this partitioned format. Look for the line that reads Printing out site pattern counts. You can move this block into a new file and then read that file instead, if it takes a long time to pack sites into patterns. Note that there are restrictions with the P format:

  • Some outputs are disabled for this option, including ancestral sequence reconstruction and posterior estimates of rates for sites (or site patterns), that you can get for sequences by using RateAncestor = 1.
  • Second, some of the calculations require the sequence length, which I set to the sum of the site pattern frequencies. If the site pattern frequencies are not counts of sites but are instead site pattern probabilities, calculations involving sequence length will not be correct. Such calculations would be the SEs for MLEs, the numbers of sites S and N in CODEML, etc.

Possible uses of this option

Sometimes, I use EVOLVER to simulate very long sequences (with >1M sites) and it can take minutes or hours to collapse sites into patterns when the maximum likelihood iteration takes a few seconds. A similar case involves analyses of large genomic data of long sequences with >100Mb sites. In this case, you can run BASEML or CODEML once, and then copy the pattern counts from the output file into a data file. Next time you run the program, you can read the new file. This way, the program skips the step of counting site patterns. Note that the pattern counts do not have to be integers. You can calculate the site pattern probabilities under a model and then read the probabilities for analysis using a different model to see whether the correct tree is still recovered. The site pattern probabilities under the model amount to a dataset of infinite size (infinitely many sites). This way, you can check whether the tree reconstruction method is still consistent. See Debry (1992) and Yang (1994a) for such analysis.

Tree file format and representations of tree topology

A tree structure file is used when runmode = 0 or 1. The file name is specified in the appropriate control file. The tree topology is typically specified using the parenthesis notation, although it is possible to use a branch representation, as described below.

Parenthesis notation

The first thing to do is getting familiar with the parenthesis representation that is used in most phylogenetic software: the Newick format. The species can be represented using either their names or their indexes corresponding to the order of their occurrences in the sequence data file. If species names are used, they have to match exactly those in the sequence data file, including spaces or strange characters.

The following is a possible tree structure file for a data set of four species (human, chimpanzee, gorilla, and orangutan, occurring in this order in the data file):

4 5 // 4 species, 5 trees
(1,2,3,4); // the star tree
((1,2),3,4); // species 1 and 2 are clustered together
((1,2),3,4); // commas are needed with more than 9 species
((human,chimpanzee),gorilla,orangutan);
((human:.1,chimpanzee:.2):.05,gorilla:.3,orangutan:.5); // branch lengths can be used, but beware of specific programs that do not enable their usage!

Note

Remember that anything written after // is ignored by the software.

The first tree is a star tree, while the next four trees are the same. If the tree has branch lengths, BASEML and CODEML allow you to use the branch lengths in the tree as starting values for maximum likelihood iteration. Please note that, while branch lengths are allowed, they should not be used when fixing the tree topology for divergence times estimation (MCMCtree), tests of positive selection (CODEML), or in other analyses where the there is no need for branch lengths to be used.

Whether you should use rooted or unrooted trees depends on the model, for example, on whether a molecular clock is assumed. Without the clock (clock = 0), unrooted trees should be used, such as ((1,2),3,4); or (1,2,(3,4));. With the clock or local-clock models, the trees should be rooted and these two trees are different from (((1,2),3),4);. In PAML, a rooted tree has a bifurcation at the root, while an unrooted tree has a trifurcation or multifurcation at the root. You should read the section Rooted versus unrooted trees of the supplementary material of Álvarez-Carretero et al., 2023, where we illustrate and discuss different scenarios for rooting or not the tree.

In the latest PAML versions, tree files should have a PHYLIP header indicating how many species and how many trees are present in the file. For instance, a header such as 4 1 would indicate that, in the next lines, one tree file with 4 species/tags will be read. E.g.:

4 1
(((a, b), c), d);

Tree files produced by PAUP and MacClade

PAML programs have only limited compatibility with the tree file generated by PAUP or MacClade. First, the “[&U]” notation for specifying an unrooted tree is ignored. For the tree to be accepted as an unrooted tree by PAML, you have to manually modify the tree file so that there is a trifurcation at the root, for example, by changing (((1,2),3),4); into ((1,2),3,4);. Second, the Translate keyword is ignored by PAML as well, and it is assumed that the ordering of the sequences in the tree file is exactly the same as the ordering of the sequences in the sequence data file.

Branch or node labels

Some models implemented in BASEML and CODEML allow several groups of branches on the tree, which are assigned different parameters of interest. For example, in the local clock models (clock = 2 or 3) in BASEML or CODEML, you can have, say, 3 branch rate groups, with low, medium, and high rates respectively. Also, the branch-specific codon models (model = 2 or 3 for CODEML) allow different branch groups to have different $\omega$s, leading to the so-called “two-ratios” and “three-ratios” models. All those models require branches or nodes in the tree to be labelled. Branch labels are specified in the same way as branch lengths except that the symbol # is used rather than :. The branch labels are consecutive integers starting from 0, which is the default and does not have to be specified. For example, the following tree...

7 1
((Hsa_Human, Hla_gibbon) #1, ((Cgu/Can_colobus, Pne_langur), Mmu_rhesus), (Ssc_squirrelM, Cja_marmoset)); 

... is based on the tree file available in examples/lysozyme/lysozymeSmall.trees, with a branch label for fitting models of different $\omega$ ratios for branches. The internal branch ancestral to human and gibbon has the ratio $\omega_{1}$ (i.e., the label #1 has been used), while all other branches (with the default label #0, no need to indicate this in the tree) have the background ratio $\omega_{0}$. This fits the model in table 1C for the small data set of lysozyme genes in Yang (1998). See the README.txt file in the examples/lysozyme/ folder.

If you want to label all branches within a clade on a big tree, it may be easy to use the clade label $. Clade $2 is equivalent to labelling all nodes/branches within the clade with #2. The following two trees are thus equivalent:

5 1
(((rabbit, rat) $1, human), goat_cow, marsupial);
5 1
(((rabbit #1, rat #1) #1, human), goat_cow, marsupial);

There is an important rule concerning nested clade labels: the symbol # takes precedence over the symbol $, and clade labels close to the tips take precedence over clade labels for ancestral nodes close to the root. The following two trees are equivalent:

5 1
((((rabbit, rat) $2, human #3), goat_cow) $1, marsupial);
5  1
((((rabbit #2, rat #2) #2, human #3) #1, goat_cow #1) #1, marsupial);

In the first tree above, $1 is first applied to the whole clade of placental mammals (except for the human lineage), and then $2 is applied to the rabbit-rat clade.

Note that, with this rule, it may make a difference whether or not you include a label $0.The tree below...

4 1
((a, b) $0, (c, d)) $1;

... labels the three branches on the left side (a, b, and both a and b) as #0, while the other branches are #1. However, the following would label all branches in the tree as #1:

4 1
((a, b), (c, d)) $1;

I have found it convenient to create the tree file with labels and read the tree using graphical software such as FigTree or TreeViewer to check that the labels are right.

Divergence date symbol @

Fossil calibration information is specified using the symbol @. This is used for the clock and local clock models in BASEML and CODEML. See the README.txt file in the examples/MouseLemurs/ folder. So in the following example, the human-chimpanzee divergence is fixed at 5MY:

4 1
((gorilla, (human, chimpanzee) '@0.05'), orangutan);

This kind of calibration information (point calibration) is not used by the Bayesian dating program MCMCtree (see descriptions later).

Branch representation of tree topology

A second way of representing the tree topology used in PAML is by enumerating its branches, each of which is represented by its starting and ending nodes.

This representation is also used in the result files for outputting the estimated branch lengths, but you can also use it in the tree file. For example, the tree ((1,2),3,4); can be specified by enumerating its 5 branches:

       5
     5 6    6 1    6 2    5 3    5 4

The nodes in the tree are indexed by consecutive natural numbers, with 1, 2, ..., $s$ representing the $s$ known sequences in the data, in the same order as in the sequence data file. A number larger than $s$ labels an internal node, at which the sequence is unknown. So in the above tree, node 5 is ancestral to nodes 6, 3, and 4, while node 6 is ancestral to nodes 1 and 2. This notation is convenient to specify a tree in which some sequences in the data are direct ancestors to some others. For example, the following tree for 5 sequences has 4 branches, with sequence 5 to be the common ancestor of sequences 1, 2, 3, and 4:

  4     
5 1     5 2     5 3     5 4

Caution

I did not try to make this tree representation work with all models implemented in BASEML and CODEML. If you use this representation, you should test the program carefully. If it does not work, you can let me know so that I will try to fix it.