-
Notifications
You must be signed in to change notification settings - Fork 23
/
Copy pathpamlHistory.txt
2076 lines (1541 loc) · 92.4 KB
/
pamlHistory.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
History and bug fixes of PAML (Phylogenetic Analysis by Maximum Likelihood)
Ziheng Yang
If you find problems or have questions, please visit the PAML
discussion site
https://groups.google.com/forum/#!forum/pamlsoftware.
paml v4.10.8, November 2024
(*) baseml/codeml Tipdate-clock option. The Tipdate-clock (clock = 1,
tipdate = 1 100) option in baseml/codeml is broken when multiple trees
are in the tree file, with the program analysing the first tree
multiple times. The error shows up when one changes the first line in
the file examples/TipDate.HIV2/HIV2ge.tre from 33 1 to 33 3.
I think the example data file was last edited around 2013, at the time
the paper with Tanja ( Stadler & Yang 2013 Syst Biol 62: 674-688) was
prepared. Based on my quick tests, the option is correct in versions
4.8a (August 2014) to version 4.9h (March 2018), but has been broken
in version 4.9i (September 2018) and 4.9j (February 2020). I haven't
tested the different updates since version 4.10.
I have fixed the bug. Note that some of the trees in the tree file
are bootstrap consensus trees which are multifurcating trees.
Thanks to jtmccr1 for identifying the bug:
https://github.com/abacus-gene/paml/issues/43#issuecomment-2342570454
paml v4.10.1, December 2021
(*) mcmctree. We decided to change the default MCMC sliding-window
algorithm from Bactrian (normal kernel with m = 0.95) to
BactrianLaplace with m = 0.90. This change is made at the same time
in bpp4 and bpp3 as well. It was noted that Bactrian (m = 0.95) was
too aggressive when used with reflection. For example, if 0 < x < 1,
and the current value is close to 0, after reflection most of the
newly proposed values are larger than the current value.
(*) mcmctree. The program is not reading the date information from the
sequences and print out 0 dates for sampled sequences. This is now fixed.
A bug caused the mixing step to reject all proposals. It affects
mixing but not the posterior results. Version 4.8a (July 2014) is
correct, but version 4.9 (March 2015) has this bug and is incorrect.
The bug was introducted when i added the conditional iid model. This
is now fixed.
(*) baseml & codeml. changed the way the option ndata is handled. In the
old versions, all trees are read from the tree file for each dataset.
Now for each dataset, one block is read from the tree file. This
allows the possibility that different datasets have different numbers
of sequences or species.
(*) pamp. I introduced quite a few bugs in version 4.10.1 when
editing the pamp.c, which broke the ancestral reconstruction
algorithms. The output file then had missing or clearly incorrect
results. There are also error messages on the screen like
"err:PathwayMP 0 != 4".
Versions 4.9j and earlier are correct. This is now fixed.
paml v4.10.0, September 2020
Edited code to remove compiler warnings
mcmctree. Changed the calculation of initial values for mcmctree so that the initial ages
are not too old.
baseml & codeml. Fixed a bug which causes the tip-dating analysis in
baseml and codeml to fail to print out a block of output including
substitution rate per time unit and absolute age estimates.
Version 4.9j, February 2020
(*) baseml and codeml, clock dating. When clock=1 and some internal
nodes are fixed at certain ages (by using the symbol "@" or "="),
baseml and codeml rescale the estimated relative node ages (measured
in expected numbers of substitutions per site) into absolute node
ages. In such cases there should be a block of output including
"Substitution rate is per time unit" and the absolute node ages below
"Detailed output identifying parameters".
In versions 4.9i, this output is missing. Versions 4.9h and earlier
seem to be correct. This bug is reported at the google group for
paml. I have now fixed this and added the printout. Note that this
way of clock dating assumes that the node ages specified in the tree
are known without errors or uncertainties and does not accommodate
uncertainties in fossil calibrations.
(*) mcmctree, initial values have ages that are too old, exceeding max
bounds. In theory bounds in mcmctree are always soft so that the node
ages will move to the area of posterior mode when burnin is long
enough. In practice the poor starting values are problematic and
requires long burn-in. I have rewritten the code for generating
initial values to respect the min and max bounds specified in the
fossil calibrations.
(*) codeml, clade labels in the tree file. A bug introduced in 4.9i
caused the clade labels ($) to be ignored. This affects the branch,
branch-site and clade models. If your tree has branch labels (#) only
and no clade models, everything will be fine. If you have used the
clade label ($) in the tree with or without the branch label (#),
either the program will abort or the results will be incorrect. The
clade label ($) is supposed to label all branches in the clade as well
as the branch itself, but all clade labels in the tree are ignored in
4.9i. Earlier versions are correct.
Thanks to Yonghua Wu for pointing out the bug.
Version 4.9i, March 2019
(*) mcmctree: I have added an option (duplication = 1) for dating a
tree with both speciations and gene duplications, so that some nodes
on the tree share divergence times. Nodes sharing ages are identified
using labels in the tree file: #1, #2, .... I have yet to update the
document about specification of the model.
(*) mcmctree: The TipDate option was written for one locus or
partition and never worked for more than two loci/partitions. I have
edited the code so that it works for multiple partitions, some of
which may be molecular and the others morphological.
(*) codeml: The option estFreq = 0 when codonFreq = 6 (FMutSel0) and 7
(FMutSel) is not working in versions 4.9g and 4.9h. This is fixed
now. This option uses the observed codon or amino acid frequencies
for the mutation-selection models of codon usage. Instead the program
estimates the frequencies using maximum likeihood, which is what the
option estFreq = 1 does. Look at the README file in the
examples/mtCDNAape/ folder.
(*) codeml clade model D: The bounds for the w (dN/dS) ratios in the
first site classes are set tp (0.0001, 0.5) for w0 and (0.5, 1.5) for
w1, in versions 4.9b,c,d,e,f,g, since I added the BEB calculation for
clade model D in 4.9b. The motivation for the bounds is that site
class 0 represents strong purifying selection with a small w0, while
site class 1 should include sites under weak purifying selection with
a larger w1. However the bounds are arbitrary. In some datasets, the
MLEs are found to be at the bounds, making the interpretation awkward.
I have changed the bounds to the following:
w0b[]={0.0001, 1.0}, w1b[]={0.01, 1.5}.
This means that the user should swap the estimates of w0 and w2 if w0 > w1.
Version 4.9h, March 2018
(*) mcmctree: gamma-Dirichlet versus conditional i.i.d. priors for
rates for loci. Since 4.9d, the program and the documentation are
inconsistent about the two priors, and which value (0 or 1) means
which prior. I have now checked the program and the documentation to
make sure that they are consistent:
prior = 0: gamma-Dirichlet (dos Reis 2014). This is the default.
prior = 1: conditional i.i.d. prior (Zhu et al. 2015).
I believe these two are similar especially if the number of loci
(partitions) is large, but no serious comparisons between the two
priors have been published.
Thanks to Adnan Moussalli for pointing out the errors.
(*) codeml. It was discovered that the mechanistic amino acid
substitution model implemented in Yang et al. (1998; see table 3),
specified by seqtype = 2 model = 6, has been broken for a long time,
since version 3.0 (2000) at least. Version 2.0 (1999) seems to be
correct. This means that the model become broken soon since it was
published. I have now fixed this.
This model of amino acid substitution starts from a Markov chain for
codons and then aggregate the states and merge the synonymous codons
into one state (the coded amino acid). This is an approximate
formulation since the process after state aggregation is not Markovian
anymore.
I have now added another codon-based amino acid substitution model
that treats amino acids as ambiguities codons. The model is specified
by seqtype = 2 model = 5. This is an exact formulation.
(*) codeml. The number of categories in the BEB calculation under M2
and M8 is unintentionally set to 4 rather than 10. I have changed
this back to 10. The details of this calculation are in Yang et
al. 2005 MBE.
Version 4.9g, December 2017
(*) codeml. A bug caused the BEB calculation under the site model M8
(NSsites = 8) to be incorrect, with the program printing out warming
messages like "strange: f[ 5] = -0.0587063 very small." This bug was
introduced in version 4.9b and affects versions 4.9b-f. A different
bug was introduced in version 4.9f that causes the log likelihood
function under the site model M8 (NSsites = 8) to be calculated
incorrectly. These are now fixed.
Version 4.9f, October 2017
(*) baseml, nonhomogeneous models (nhomo & fix_kappa). Those models
allow different branches on the tree to have different Q matrices.
Roughly nhomo controls the base frequency parameters while gix_kappa
controls kappa or the exchangeability parameters (a b c d e in
GTR/REV, for example). I added the option (nhomo = 5, fix_kappa = 2),
which lets the user to define branch types, so that branches of the
same type have the same exchangeability parameters (a b c d e for GTR)
and base composition parameters, while branches of different types
have different parameters. Branch types are labeled (using # and $),
0, 1, 2, .... The labels should be consecutive positive integers.
The old options nhomo = 3 or 4 work for some models like GTR, but not
some other models which also have base composition parameters. In
this update, I think those options should work with all those models.
I have also edited the documentation (look for option variable nhomo
for baseml).
(*) baseml & codeml. i added an option fix_blength = 3
(proportional), which means that branch lengths will be proporational
to those given in the tree file, and the proportionality factor is
estimated by ML.
(*) codeml. The program does not count the parameters correctly for
model M0 when fix_kappa = 1. The bug was introduced in version 4.9c
and affects versions 4.9c-e. This is now fixed.
(*) codeml (seqtype = 2 model = 2). If you are analyzing multiple
protein data sets (ndata > 1) under the empirical models such as wag,
jtt, dayhoff. The results for the first data set are correct, but all
later data sets are analyzed incorrectly under the corresponding +F
models, that is, wag+F, jtt+F, dayhoff+F, etc. A bug in the program
means that for the second and later data sets, the equilibrium amino
acid frequencies are taken from the real data and not correctly set to
those specified by the empirical models. I note that this bug was
recorded in the update Version 3.14b, April 2005, but it was somehow
not fixed, even in that version. This is now fixed. Thanks to Nick
Goldman for reporting this again.
(*) evolver (options 5, 6, 7 for simulating nucleotide, codon and
amino acid alignments). If you choose the option of printing out the
site pattern counts instead of the sequences (specified at the
beginning of the control file such as MCbase.dat), and if you are
simulating two or more alignments, the program crashes after finishing
the first alignment. This is now fixed.
(*) mcmctree. The program crashes if you have a mixture of
morphological loci and molecular loci, if not all the morphological
loci are before the molecular loci. I have now fixed this.
I think this was never described anyway.
Version 4.9e, March 2017
(*) Edited the readme files to change the license to GPL.
(*) mcmctree. A bug was introduced in version 4.9b which causes the
program to read the fossil calibration information in the tree file
incorrectly, if joint (minimum and maximum) bounds are specified using
the symbol '<' and '>'. If you use the notation "B()", "L()", and
'U()', the information is read correctly. This bug was introduced in
version 4.9b and exists in 4.9c and 4.9d. Versions 4.9a and earlier
were correct.
Version 4.9d, February 2017
(*) mcmctree. Changed the default prior for rates for loci to
gamma-Dirichlet (dos Reis 2014), and updated the documentation as
well. It was set to the conditional i.i.d. prior (Zhu et al. 2015).
(*) mcmctree. Added Bayes factor calculation. A program called
BFdriver is included in the release, as well as a pdf document in the
folder examples/DatingSoftBound/BFdriverDOC.pdf. We suggest that you
use the exact likelihood calculation when you use this option, since the
normal approximation is unreliable when the power posterior is close to
the prior (when beta is small).
Version 4.9c, September 2016
(*) Added GPL license statement in various places.
(*) codeml. When two or more trees are in the tree file and the
specified model is the branch model, the program counts the number of
parameters correctly only for the first tree. For the second or later
trees, it overcounts the parameters in the model. The lnL and
parameter estimates are still correct, but the printed values for the
extra parameters are more or less random numbers and vary among
different runs of the same analysis. The bug was introduced in
version 4.7a and affects the versions since then until 4.9b. It is
now fixed. Thanks to Casey Bergman.
Version 4.9b, March 2016
(*) codeml. I have added the BEB calculation for clade model D. I
deleted the option of having two site classes under model D so that
both clade models C and D now assume three site classes. The
difference between models C and D is that in C, w1 = 1 is fixed, while
in D, w1 is a free parameter, just like w0. Both the ML iteration and
the BEB calculation under models C and D allow two or more clades
(branch types), but note that an increase in the number of clades by 1
roughly increases the computation by 10 times. Also the BEB
calculation under model D is about 10 times more expensive than that
under model C. I updated the codeml section of the documentation to
remove the description of the old models (like M1 and M2) and focus on
the new models that are in the program (like M1a and M2a).
baseml & codeml. If the tree has labels using the symbol #,
the program will crash. This is caused by a bug in the routine for
reading trees from the tree file. This is now fixed.
Version 4.9a, September 2015
codeml (seqtype = 2 for amino acid sequences) crashes when the dataset
is large, with more than 2M sites, say. This is due to a stupid bug
in an irrelevant part of the code. This is fixed in 4.9a.
Version 4.9, January 2015
(*) mcmctree. There are now two priors for locus rates (mu_i or rates
for loci), as summarized in Zhu et al. (2014 Sys Biol). They are
specified as follows
rgene_gamma = alpha_u beta_u alpha prior * gamma prior for overall rates for genes
where prior = 0 (default) is the conditional i.i.d. prior and prior =
1 is the gamma-dirichlet prior.
For example the following specifies the conditional i.i.d. prior with
alpha_mu = 100, beta_mu = 1000, and alpha = 0.5. Here the average
overall rate is 0100/1000 = .1 changes per site per time unit. alpha
= 0.5 describes the extent of rate variation among loci.
rgene_gamma = 100 1000 0.5 0 * conditional i.i.d. prior for locus rates
sigma2_gamma = 4 100 0.5 * prior for sigma^2 (for clock=2 or 3)
The sigma2_i prior has parameters alpha_u, beta_u, alpha, specified in
the same format.
(*) mcmctree. I have changed the option of setting the finetune step
lengths for the MCMC algorithms so that each parameter in the model
has its own step lengths. As a result, the step lengths are all
automatically adjusted during the burn-in, and cannot be fixed or
specified by the user. The line
finetune = 1: .0040 0.030 0.015 0.03 .55 * times, rates, mixing, paras, RateParas
in the control file, if present, is read but ignored.
(*) mcmctree. The calibration density for S2N (skew 2 normals) is
incorrectly implemented. While the correct formula should be
p0 * f1(t) + (1 - p0) f2(t)
the incorrect formula used is
p0 * f1(t) + f2(t).
Basically the two skew-normal densities was not given the correct weights.
If you run the program without data to make sure that the effective prior
is sensible, then this mistake should have little impact. The bug is now
fixed. Thanks to Mario dos Reis.
Version 4.8a, August 2014
(*) baseml (nhomo = 3 or 4). I have added the nonhomogeneous GTR
(REV) model with a set of abcde parameters for every branch. The
specifications are as follows.
model=7 nhomo=4 fix_kappa=0: one set of rate parameters abcde in GTR for the whole tree.
model=7 nhomo=4 fix_kappa=1: one set of rate parameters abcde in GTR for every branch.
In version 4.8, only the second option (one set of abcde parameters
for the whole tree) is implemented, no matter whether you choose 0 or
1 for fix_kappa. After this change, the use of fix_kappa in the nhomo
models is consistent between HKY and GTR. For HKY, the specifications
have always been as follows.
model=4 nhomo=4 fix_kappa=0: one kappa in HKY for the whole tree.
model=4 nhomo=4 fix_kappa=1: one kappa in HKY for every branch.
Note that nhomo=4 means one set of frequency parameters for every
branch on the tree.
I have also added a method for calculating the expected number of
changes from any nucleotide i to nucleotide j along every branch under
the nonhomogeneous models (nhomo=3 or 4). This is called EMC, and
accounts for the fact that the base frequencies are changing over
time, and so does the rate of substitution. It is invoked by using
RateAncestor = 1 and verbose = 1.
I tested only the nhomo=4 option carefully and the options nhomo=3 and
5 are not well tested.
Version 4.8, March 2014
(*) mcmctree. The prior for rates for loci (with ndata = 2 or more)
has changed in this version, and the gamma-Dirichlet prior is used.
The old i.i.d. prior is replaced and unavailable anymore. The new
gamma-Dirichlet prior is decribed in a paper
dos Reis, M., T. Zhu, and Z. Yang. 2014. The impact of the rate prior
on Bayesian estimation of divergence times with multiple
loci. Syst. Biol. 63:555-565.
(*) codeml. Added a Bayesian method for estimating distance t and dN/dS
ratio w in pairwise comparisons. The option is
runmode = -3 * -3: pairwise Bayesian
runmode = -3 1.1 1.1 1.1 2.2 * -3: pairwise Bayesian
The four numbers are the parameters in the gamma priors for t and w (a_t b_t a_w b_w).
The method is described in the following paper:
Angelis, K., M. dos Reis, and Z. Yang. 2014. Bayesian estimation of
nonsynonymous/synonymous rate ratios for pairwise sequence
comparisons. Mol. Biol. Evol. 31:1902-1913.
(*) baseml, codeml, mcmctree). In version 4.7a, I disabled the option
of reading species indexes in the tree file (for example, the number
19, instead of the sequence name, may be used in the tree file to mean
the 19th sequence in the sequence alignment). I have added this back.
Version 4.7 does not have this problem.
(*) codeml in paml 4.6 and 4.7 has a bug if your sequences have stop
codons. Stop codons have always been disallowed in the codon models
implemented in codeml. In versions 4.6-4.7, if there is a stop codon
in any sequence, the codons in all sequences at that site (column in
alignment) are changed into ??? and then the sequences are processed
in the usual way. The likelihood should then be the same as the
likelihood if that column is removed from the alignment. However, the
implementation of the idea was incorrect and gives wrong results if
the sequences do not contain any ambiguity characters and if you
choose cleandata = 0. I have now fixed this. The idea is not really
useful, so the good advice is that you delete the column of stop
codons before the data are analyzed using codeml.
Version 4.7a, October 2013
(*) codeml. The mutation-selection model of Yang and Nielsen (2008)
is noted to work for the site and branch-site models but not for the
branch models. This is now fixed.
(*) mcmctree. I fixed and modified the infinitesites program, which
generates the limiting distribution when the amount of sequence data
approaches infinity. This works for both the clock (clock=1) and
relaxed clock models (clock=2 or 3). The mcmctree tutorial explains
how to run this program.
Version 4.7, January 2013
(*) mcmctree. This version includes a method for dating divergences
using sampled tips, as in viral datasets. The prior for node ages in
the given rooted tree is generated using a
birth-death-sequential-sampling model, described in Stadler and Yang
(2012 Syst Biol, submitted). The SIV/HIV-2 example dataset analyzed
in the paper is in the examples/TipDate/ folder. Look at the readme
file there.
(*) codeml. Version 4.6 crashes if you have a lot of ambiguity codons
in the alignment, or precisely if the total number of fully determined
codons (61 in the standard code) and the ambiguous codons exceed 127.
I was using "unsigned char" to represent the codons, which go from 0
to 255 and can represent 256 distinct codons. Somehow I changed this
to "char", which on some systems goes from -128 to 127. Can't
remember why I made the change. Anyway this is now fixed. Versions
4.5 or earlier do not have this problem. Thanks to David Yu for
pointing out the problem.
Version 4.6, September 2012
(*) baseml. The model of autocorrelated-rates among sites (Yang 1995
Genetics 139:993-1005) is broken in paml versions 4.3 to 4.5
inclusive. Earlier versions were apparently fine. When you select
the auto-discrete-gamma model (alpha > 0, rho > 0) or the
nonparametric autocorrelated-rates model (fix_alpha = 0, ncatG = 3,
nparK = 4), the program aborts with the following error message:
"Error: use 10, 20, 32, 64, 128, 512, 1024 for npoints for legendre.."
This is now fixed. The option of nparK = 3 seems to have problems in
all versions, and the program prints out the following warning
message.
"nparK==3, double stochastic, may not work. Use nparK=4?"
This option is supposed to fit a Markov-dependent model with equal
proportions in the rate categies. I have not got time to investigate
this option, but a glance at table 3 in that paper indicates that this
model was not used in the table and probably never properly
implemented, so please follow the advice and use nparK = 4 instead.
(*) codeml. I have now added the site model M2a_rel of Weadick & Chang
(2012 Mol. Biol. Evol.), in which w2 > 0. This is specified using
NSsites = 22 while model M2a is specified as NSsites = 2 (with w2 >
1). M2a_rel is the null model for the likelihood ratio test based on
clade model C, suggested by Weadick & Chang (2012 Mol. Biol. Evol.).
Note that the very old site model M2 from Nielsen & Yang (1998
Genetics) assumes w0 = 0. Yang, Wong, Nielsen (2005 MBE) changed the
model so that 0 < w0 < 1 and renamed the model M2a. My tests using
the versions around that time reveals the following:
3.13d (M2): w0 = 0, w2 > 0
3.14a (M2a): w0 > 0, w2 >= 1
3.14b (M2a): w0 > 0, w2 >= 1
If seems that when we changed M2 into M2a, we also changed the
constraint on w2. I don't have a copy of version 3.14 to check.
(*) basseml & codeml. To following option
bootstrap = 1000
should produce 1000 bootstrap samples in the file boot.txt. A bug is
introduced in versions 4.4 and 4.5, so that garbiage is created
instead of bootstrap datasets. Versions 4.3 and earlier were correct.
This bug is now fixed.
(*) evolver. The simulation of codon sequences (evolver 6
MCcodon.dat) is incorrect when a genetic code different from the
universal code is used. This is a conceptual error I made and it has
been in all previous versions when simulation of codon sequences is
allowed. I have assumed incorrectly that setting the frequencies of
stop codons to 0 should be sufficient to get correct results. The
zero frequencies allow the program to identify the stop codons
correctly, but one still needs the code table to know which changes
are synonymous and which nonsynonymous. evolver has been using the
universal code to simulate codon sequences. Simulation using the
universal code has been correct, and simulation using the
mitochondrial code (for example) is affcted and the results should be
incorrect although the impact may be slight. This bug is now fixed.
A variable is added in the control file MCcodon.dat right below the
codon frequencies, which tells the program which genetic code to use.
1 * genetic code (0:universal; 1:mammalian mt; 2-10:see below)
The program also checks that the stop codons have frequency 0 and that
the frequencies of the sense codons sum to 1. Thanks to Mario dos
Reis for pointing this out.
(*) mcmctree. A bug is found in the calculation of the prior on rates
under the correlated-rates model (clock=3). This bug has been in
version 4 (July 2007) to version 4.5 (December 2011), inclusive. The
prior is calculated by multiplying the conditional densities of
equation (7) in Rannala & Yand (2007) over all interior nodes on the
master tree (see also Figure 1 in that paper). However, the rA used
in the program was not for the current branch: it was instead for the
ancestral branch. The correct formula used was correct for the root
of the master tree, but incorrect for other interior nodes. The
impact of the bug on posterior time estimates may be expected to be
small, but if the clock is seriously violated and rates change over
neighboring branches, this may be a concern.
Version 4.5, December 2011
baseml. I have added the joint ancestral reconstruction under the
nonhomogeneous models (nhomo = 3 or 4). This works for the model of
one rate for all sites, and does not work for the model of gamma rates
for sites or the partition models. Also I implemented the joint
reconstruction only and not the marginal reconstruction.
mcmctree. The program now prints out a file called FigTree.tre in the
current working directory that is readable by FigTree. The branch
lengths show the posterior means of times while the bars generated by
FigTree will indicate the 95% CIs. I have also added an option to
deal with TipDate data, such as viral sequences with dates. See
examples/TipDate/README.txt for more notes.
Version 4.4e, May 2011
mcmctree. I have added an option of automatically adjusting the step
lengths for proposals in the MCMC algorithm. The program will use the
burn in to automatically adjust the step lengths, using those values
specified in the control file as initial values.
codeml: The iteration method that updates one branch length at a time
(method = 1) is broken for the free-ratios model (model = 1) from
versions 4.3 to 4.4d inclusive. When codeml runs, you will see
messages like the following on the screen:
Warning rounding error? b=3 cycle=0 lnL=6036.9413757 != 6070.1069284
Version 4.2 and earlier are correct. This bug is now fixed.
Thanks to Tae-Kun Seo for reporting the bug.
Version 4.4d, March 2011
yn00. The option of using weighting (weighting = 1) in the
calculation of dN and dS by the YN00 method (Yang and Nielsen 2000)
has been broken since Version 4.4 (February 2010). A bug was
introduced when I changed the character coding in Version 4.4, so that
this bug affects Versions 4.4, 4.4a, 4.4b, and 4.4c. The bug
typically causes a crash. The result for weighting = 0 is not
affected. Versions before 4.4 are correct. This bug is now fixed.
Thanks for Charlie
(http://www.ucl.ac.uk/discussions/viewtopic.php?t=8726&highlight=yn00).
Version 4.4c, August 2010
mcmctree. A bug causes the exotic calibrations (gamma, skew normal
and skew t) to be sometimes ignored, if you have multiple loci
(partitions) and if some species/sequences are missing at some loci.
If all calibrations are bounds (minimum, maximum, and joint), or if
the partitions have the same number of species/sequences, the
calculation is still correct. This bug is in versions 4.1-4.4b. This
is now fixed. Thanks to Mike Steiper.
Version 4.4b, July 2010
mcmctree. A serious bug has been discovered which affects the option
of approximate likelihood calculation (usedata = 2). If the first
child of the root in the tree for any locus is a tip, the MLEs of
branch lengths in the in.BV file are assigned to wrong branches, so
that the results are entirely incorrect, and also very different from
those obtained using the exact likelihood calculation (usedata = 1).
Suppose the tree for a locus is (A, B), with a bifurcation at the
root, and with A and B representing species or clades. If A is a tip,
the program is incorrect, while if A is an internal node, the program
is correct. This bug is in versions 4.2b - 4.4a. Hopefully this is
now corrected. Thanks to Mario dos Reis.
mcmctree. I also changed the specification on L, U, and B bounds to
allow the user to specify the tail probabilities (which used to be
fixed at 2.5%) for each fossil calibration. See table 8 in
pamlDOC.pdf for details. You will still have to run the program
without data to get the effective prior of times used by the program.
Version 4.4a, June 2010
codeml (RateAncestor=1). For ancestral reconstruction under codon
models, the program prints out the translated amino acids when it
prints the codons. A bug caused the translation to be incorrect,
so instead of the amino acid, the program prints something like h or
H. This bug was introduced in Version 4.4, February 2010 when I
changed the way that the characters are coded. This is fixed now.
baseml (clock = 1, 2, or 3 Mgene option). If you assign different
rates for different genes (or site parations) and the model assumes
global clock or local clock, the program fails to print out the
relative rates for genes. This bug seems to be introduced between
3.13 and 3.14 and has since. The log likelihood value and estimates
of parameters are all correct, and the problem is only that MLEs of
rates for genes are not printed in the output file properly.
Version 4.4a, April 2010
In versions 4.2, 4.3, 4.4, Clade model D does not print the w
estimates for the site classes correctly. For the example run in
examples/datasets2/, the incorrect output looks like the following
site class 0 1 2
proportion 0.48590 0.10718 0.40692
branch type 0: 0.10557 4.17186 4.17186
branch type 1: 0.10557 4.17186 3.05340
while the correct results (from version 4.1) are
site class 0 1 2
proportion 0.48590 0.10718 0.40692
background w 0.10557 4.17186 3.05340
foreground w 0.10557 4.17186 0.95081
The ratios w2 and w3 for site class 2 are incorrectly printed, with w2
equal to w1. The branch lengths and lnL values etc. are all correct.
This bug affects versions 4.2, 4.3, and 4.4, and version 4.1 and
earlier versions appear to be correct. Also the bug affects clade
model D, and clade model C appears to be fine. This is fixed now.
Thanks to cajawe for reporting the bug (see
http://www.ucl.ac.uk/discussions/viewtopic.php?p=28240#28240).
Version 4.4, February 2010
I changed the way that the characters are coded in the programs, which
means that essentially every program in the package is modified. It
is likely that some bugs are introduced during the process.
codeml. The iteration algorithm specified by method = 1 when applied
to amino acid model of gamma rates among sites (seqtype = 2, alpha >
0, method = 1) is broken in versions 4.2-4.3b. The bug causes the
iteration to fail to converge, with messages like the following
printed on the monitor:
"Warning rounding error? b=5 cycle=0 lnL=1718.5667685 != 1749.4724545".
mcmctree. If the gamma prior for rates has alpha <1/3, the initial
values are generated incorrectly. This is now fixed.
Version 4.3b, November 2009
mcmctree. A bug was introduced in version 4.2, which causes the
program to ignore the minimum age constraint on the root age and uses
the maximum bound specified by RootAge in the control file, if there
is a calibration on the root and it is a minimum-age constraint. If
you have no calibration on the root, or if the calibration is an
maximum bound, both minimum and maximum bound or gamma distribution,
there is no problem. The error affects versions 4.2 and 4.3, and
version 4.1 is correct. This is now fixed. Thanks to Mathieu
Groussin.
evolver. I have now added back the option of printing out the site
pattern counts instead of the sequences (specified at the beginning of
the control file such as MCbase.dat). Use of pattern counts takes
less disk space. The option was removed in version 4.2b after the
algorithm for collaspsing sites into patterns was rewritten.
Version 4.3a, October 2009
mcmctree. A bug was introduced in version 4.3, which causes the root
age to be proposed incorrectly. The effect of the bug appears to be
subtle and is hard to assess. It means that the results from all
three clock models (clock = 0, 1, 2) are incorrect. Here is a bit more
detail about the bug. The internal maximum node age was set too high,
and caused loss of significant digits in a smart reflection algorithm
which reflects proposed values into the feasible interval. This
update fixes the bug. Please use this version to reanalyze your data
if you have used version 4.3. Version 4.2 does not have this bug.
mcmctree. I now supplies compiling options so that you can use hard
minimum hard maximum, or hard minimum soft maximum. The executables
are called mcmctreeHH and mcmctreeHS. See the commands for compiling
at the beginning of the file mcmctree.c.
Version 4.3, August 2009
mcmctree. The approximate likelihood calculation was unreliable,
because the Hessian matrix was not calculated reliably, especially if
the maximum likelihood estimates of some branch lengths are 0. I have
now implemented the approach of Seo, Kishino and Thorne (2004), which
appears to be more stable.
baseml & codeml: I changed these two programs to read fasta alignments
directly, working out the number of sequences and the sequence length
automatically. However, this does not work when you have multiple
alignments in one file, that is, if ndata = 2 or higher.
Version 4.2b, April 2009
mcmctree. There is a bug that leads to crash when usedata = 3 and when
the data at some loci have ambiguity characters while other loci do
not. usedata = 3 means that the program invokes baseml or codeml to
calculate the maximum likelihood estimates (MLEs) of branch lengths
and the variances of those MLEs. The temporary sequence alignment
files for the loci (tmp?.txt) generated by mcmctree have garbiage and
are unreadable by baseml (or codeml). This bug is now corrected.
(You can replace the wrong alignment file for the locus by your
original alignment and run baseml outside mcmctree.)
All programs. I rewrote the code for collapsing sites into patterns.
This should have no impact on user files, except that evolver does not
produce data files in the "pattern-count" format anymore. Now two
formats (paml/phylip format and PAUP nexus format) are used, specified
by the first line of the control data file (e.g., MCbase.dat,
MCcodon.dat, MCaa.dat). This pattern-count format is still accepted
as input by baseml, codeml, etc. (See the section on "Site pattern
counts" in the documentation).
baseml & codeml. A bug was introduced to the local clock model
(clock=6), so that the program generates the following error message:
"need fossils for this locus". For example, running codeml on the
control file /examples/MouseLemurs/codonml2.ctl will do so. This bug
is now fixed.
codeml: The bug introduced in version 4.2 concerning the free-ratios
model (Yang 1998 MBE) was not fixed properly in version 4.2a (see the
notes for 4.2a below). The results are incorrect when there are
exactly 7 branch types (for example when you run the free-ratios model
on a 5-species tree) but are correct for other numbers of branch
types. I hope this is now finally fixed.
Version 4.2a, February 2009
codeml: Two bugs were known to have been introduced in version 4.2
when I rewrote the code for the branch-site and clade models. The
first is the null hypothesis in the branch-site test (model = 2
NSsites = 2, fix_omega = 1 omega = 1). The reference for this model
is Yang, Wong & Nielsen (2005), and Zhang et al. (2005). The
alternative model (fix_omega = 0) is fine.
The second bug is in the free-ratios model (Yang 1998 MBE).
The results from those two models are incorrect and should be ignored
if you used version 4.2. Both mistakes are fixed in this update.
Thanks to W.P. Zhou and BYLee.
Version 4.2, December 2008
MCMCtree: The implementation of the lower bound and of the upper bound
is changed. The improper density of equation (15) of Yang and Rannala
(2006) for the lower bound is noted to push up divergence times, a
feature we consider undesirable. Now a truncated Cauchy distribution
is used instead. This will be described in more detail somewhere.
The default file name for the MCMC samples is mcmc.out. To change
this, add the following line in the control file mcmctree.ctl:
mcmcfile = mcmc.out
codeml: Clade models C and D are extended to accommodate more than two
branch types. The old models accept two branch types only, referred
to as foreground and background branches. The structure of the new
models are below. You use branch/node numbers to specify the branch
types (with #0 to be the default, followed by #1, #2, #3, ...). As
before, the BEB calculation is implemented for clade model C only, and
is not available for clade model D. The BEB calculation is expensive,
so I expect clade model C will work if you have only a few branch
types. For each additional branch type, the BEB computation is 10
times more expensive.
Site class Proportion w ratio
0 p0 0 < w0 < 1
1 p1 w1 (w1 = 1 is fixed in model C)
2 p2 w2, w3, w4, ...
I also added an option of fixing the last w to 1 in clade model C,
specified in the control file as follows. For example, if there are
three branch types in the tree (see the table above), the last w fixed
will be w4.
fix_omega = 1
omega = 1
The two versions of the clade model can be compared to test whether
the last w is > 1.
codeml: branch model with amino acid chemical properties, specified
using model = 2 and aaDist = 7. Under this model, the types of
branches are specified just like the branch model (by labelling
branches). For each branch type, the program estimates a few w's,
depending on specifications in the file OmegaAA.dat. This model seems
to be broken for a long time since its implementation. Look at the
readme.txt file in examples/mtCDNAape/.
Version 4.1, August 2008
MCMCtree: A few changes have been introduced to this program. The
format for specifying fossil calibrations in the tree file has been
changed, so that the gamma distribution is specified as 'G(alpha,
beta)' and the old format '>0.6=0.7<0.8' does not work anymore. The
bounds can be specified using either the old form '<0.8' or the new
form 'U(0.8)'. Two exotic distributions, skew normal and skew t, are
added, using the format 'SN(location, scale, shape)' and 'ST(location,
scale, shape, df)'. I edited the documentation pamlDOC.pdf and
rewrote the part for mcmctree. The format of the MCMC output file
mcmc.out has been modified so that it is readable by Andrew Rambaut's
Tracer program. Note that MCMCtree starts taking samples and writing
into the file only after burnin.
codeml. The mutation-selection models of Yang and Nielsen (2008
Mol. Biol. Evol. 25, 568-579) are specified using the options
CodonFreq = 6 or 7 * 6:FMutSel0, 7:FMutSel
estFreq = 1
Use examples/mtCDNAape/codeml.HC.ctl to duplicate results in table 1
in the paper. It seems that some results are in the mlc file while
some other results are in the rst and rst1 files.
codeml (branch model): When codon sequences are analyzed under the
branch model (model = 1 or 2, NSsites = 0), the main output file
includes a tree under the line "w ratios as labels for TreeView:".
This line was introduced in version 3.15. However, the branch lengths
in the tree had the dN values, while I intended them to be the
original branch lengths (which are t, the number of nucleotide
substitutions per codon). This bug may also affect the results of
ancestral reconstruction under the same branch model. This bug is
fixed now.
baseml and codeml (discrete-gamma model): My intention has been to use
the mean and not the median to represent all rates in each category
when the discrete-gamma model is implemented. See Yang (1994
J. Mol. Evol. 39:306-314) for the distinction. However, a bug caused
a few recent versions to use the median instead. This affects baseml
versions 3.14c and 3.15, while version 3.14b or earlier and version 4
or later are correct. codeml in version 4a & 4b were incorrect while
other versions are correct. I hope both baseml and codeml are correct
from version 4c.
baseml and codeml (branch and clade labels): The local clock models
(clock = 2) in the two programs and also the branch or branch-site
models in codeml use labels to identify branches. When nested clade
label ($) are used, the program may get confused so that the branches
are not labeled correctly. I think this is not fixed in version 4c.
Version 4b, January 2008
mcmctree. Changed a variable name in the control file from MaxRootAge
to RootAge. This should be specified in any of the following formats:
RootAge = <1.0 * constraint on root age, used if no fossil for root.
RootAge = '<1.0' * constraint on root age, used if no fossil for root.
RootAge = <1.0>0.8 * constraint on root age, used if no fossil for root.
RootAge = '<1.0>0.8' * constraint on root age, used if no fossil for root.
Added some text around the output, to make the results more comprehensible.
evolver (option 8). option 8 for calculating bipartition distances
between trees is broken. I changed the program to prepare for a
figure and forgot to remove the changes afterwards. This is now
fixed.
Version 4, July 2007
mcmctree. The MCMC algorithm of Rannala and Yang (2007) dealing with
rate drift is added. The section of manual on mcmctree is now
rewritten. An example data set is included in the
examples/DatingSoftBound/ folder.
codeml. This now implements the mutation-selection models of codon
substitution of Yang and Nielsen (2008). I should include more
details after the models are better tested, for example, after the
paper is written.
baseml. The log likelihood under the gamma-rates model (that is, when
alpha > 0 in the control file) is not calculated correctly, or not as
intended. Version 3.14a is correct, but versions 3.14c and 3.15 are
incorrect. I am not sure about version 3.14b, as I have not kept a
copy. The bug causes the median to be used to represent all rates in
each category of the discre-gamma model whereas the mean was intended.
See the discussion around equation (10) and the paragraph below it in
Yang (1994. J. Mol. Evol. 39:306-314). The bug is not expected to
have much effect on tree topologies or branch lengths. Thanks to
Simon Whelan for reporting the bug.
Version 3.15, March 2006
evolverNSbranches (simulation program to generate data under the
branch models of codon substitution). A bug was introduced after
version 3.14a so that versions 3.14b (?), 3.14c, and 3.15 have a bug
that makes the program abort prematurely. This is now corrected.
Thanks to N. Clark for reporting the bug.
Version 3.15, January 2006
evolver. The simulation option of the evolver program in version
3.14b (?), 3.14c, and 3.15 has a serious bug that makes the program
generate nonsensible sequences when the user tree has 0 internal
branch lengths. The bug was introduced between versions 3.14a (which
is correct) and 3.14c (which is wrong). I don't seem to have kept
3.14b to check whether it is in error. In this case, the smart idea
that led to the bug was the thought that no evolution occurs if the
branch length is 0 (or less than 1e-20, to be more precise). If the
branch length is 0, there is no need to evolve the sequence along the
branch, so I decided to copy the sequence at the start of the branch
to the node at the end of the branch. This would be fine, except that
I introduced a bug at the same time that caused the program to skip
generating sequences at all nodes that are descendents of the