Once we have summarised the posterior estimates, we can fit skew-t (ST) distributions to each of the
mean posterior time estimates for the 71 internal nodes of the mammal phylogeny. To do this, we have
written the R script 00_Fit_skewT.R
,
which uses the R function sn::st.mple
(Azzalini, 2020)
for that purpose. In order to reproduce the results, a seed number (i.e., 12345
) has been set.
Note that the method that optimises the finding of the best fitted ST distributions to the posterior
estimates for each node is the BFGS
(Nash, 1990).
We use a for
loop limited to 50 iterations where, during each iteration, the
BFGS
optimization approach tries to find the best ST distribution fitted to the corresponding node.
If this approach was to fail after 50 iterations, then the method by default in that R function (nlminb
)
is used.
NOTE: We first used the
nlminb
method (Azzalini, 2020) as a first choice of optimization search. Nevertheless, it was difficult to converge for some of the deeper nodes in the mammals phylogeny, while theBFGS
approach would find it either in the first iteration or after 2 or 3 searches. Therefore, we decided to use first theBFGS
optimization algorithm and, if it failed, thennlminb
would be used.
The file architecture is organised as follows:
03_Fit_ST_to_posteriors/
|- 00_fitST*/ <-- Directories "00_fitST", "00_fitST_new", and "00_fitST_newchr"
| |- logs/
| |- plots/
| |- Rdata/ <-- Only 2 out objects are provided in the
| | repository (lack of space). The other
| | objects can be generated if the script
| | is re-run or downloading this file: https://www.dropbox.com/s/fiyunh6puaro3kr/SeqBayesS1_FitST_old_Rdata_priorandpost.zip?dl=0
| |- Rout/
|
|- 01_trees_with_ST_calibs/
| |- Rinp/
| |- Rout/
|
|- 02_MCMCtree_prior_ST*/ <-- Directories "02_MCMCtree_prior_ST", "02_MCMCtree_prior_ST_new", and "02_MCMCtree_prior_ST_newchr"
| |- MCMCtree/ <-- only under "02_MCMCtree_prior_ST"
| |- plots/
|
|- 00_Fit_skewT.R
|- 01_Add_STcalibs_to_tree.R
|- 02_Eval_skewT.R
|- README.md
When running the code to fit ST distributions to each posterior node age, the following takes place
(note that the links provided refer to directory 00_fitST
):
logs
: thelog_file_convergence_BFGS.txt
file is generated. It contains a record of the trials needed to fit a ST distribution to a given node. If only one iteration has been needed, as it has happened in this analysis, only the headerWorking with node t_nXXX...
is printed out.plots
: we generate one plot per node in which the fitted ST distribution is drawn on top of the posterior distribution. The resulting images are used to evaluate how good the ST fits the posterior distribution.Rdata
: we generate two R objects,ST.fitted.dists.RData
andST.fitted.objects.RData
with the resulting ST distributions. This will allow us to later load them in other R scripts without the need to re-run the search. Objectsprior.divtimes.RData
andpost.divtimes.RData
can be generated if the script is re-run or downloading this file.Rout
: this directory contains a tab separated file with the values of the parameters of each ST distribution,ST.fitted.dists.G2.40.tsv
. The output fileST.fitted.dists.G2.40.tsv
remains in that main directory as it is subsequently used by the script01_Add_STcalibs_to_tree.R
to add the corresponding ST calibrations to the newick tree.
We then use the R script 01_Add_STcalibs_to_tree.R
to add the ST calibrations in MCMCtree
format to the tree file. This script will be using the following input files:
FigTree_72sp_nodelabels.tree
: the tree file with numerical labels that indiciate the node positions. This newick tree was obtained from theout.txt
file output byMCMCtree
during the estimation of posterior times.Node_calibrations_mammalia_72sp.txt
: mapping file with three columns: (i) the name of the clades for which calibrations had been used to in the first Bayesian analysis, (ii) the soft bound calibrations used in this analysis, and (iii) the numbers corresponding to the node position that match the newick tree described above.ST.fitted.dists.G2.40.tsv
: fitted ST distributions using thesn::st.mple
R function as described in step 1 above. It contains several columns: (i) the number corresponding to the node position inMCMCtree
format, i.e.,t_nXX
; (ii) four columns with the four parameters of the ST distribution, and (iii) two columns with the ST distribution written inMCMCtree
format to be appended in the tree with all the decimals or with just three decimals.
Last, two output files are created:
72sp_MAMMALS_atlantogenata_tarver2016_71STcalib.tree
: this file has all the nodes
that were calibrated with soft bound distributions in the first Bayesian analysis with the corresponding fitted ST distributions.72sp_MAMMALS_atlantogenata_tarver2016_71STcalib_rounded.tree
: same content that the file described above has, but the values for the parameters have been rounded to three decimals.
The tree with the ST distributions generated in the previous step (71 ST distributions) is used
again by MCMCtree
to collect samples from the prior. This is done to check that the ST distributions fitted
in R are sensible and correspond to our expectations (i.e., there is not a mismatch between the "user-specified" priors and the "effective" priors that MCMCtree
will be using as node calibrations).
All analyses regarding this step can be found in the directory 02_MCMCtree_prior_ST
. Its structure is the following:
MCMCtree
This directory contains the input files used to runMCMCtree
here as well as (i) a file with the inferred dated tree in newick format, (ii) the seed number for each MCMC run, and (iii) another file with the corresponding samples collected for the parameters of estimated. The resulting phylogeny and the samples collected for the two chains ran can be found here too. Note that there are other files output byMCMCtree
but, due to the limited size a file can have in a repository, they have not been included here.plots
The R script02_Eval_skewT.R
is then used to generate several plots in which we add the fitted ST distributions previously calculated (i.e., "analytical ST", which were saved inRData
files here and are now loaded in this step) against the estimated distributions that result from averaging across all the collected samples gathered byMCMCtree
when sampling from the prior (i.e., the estimates on prior times when calibrating the tree with the analytical ST distributions) and from the posterior (i.e., the posterior estimates inferred during the Bayesian dating analysis described here).
Unfortunately, we did not keep track of the R version with which we ran the scripts here, so you
might not get exactly the same results that you see here if you re-run the scripts with the R
function sn::st.mple
.
If you re-run the scripts with the results obtained with MCMCtree
(see here
to find the link to download the data) with the same R version we did (we are unsure about which one this was,
but it was one of the released R versions between 2018 and 2020), you will get the results we provide in
this directory when you use the data in 00_main_tree_T2/01_MCMCtree_posterior_old
.
Note that removing 11 genes out of the final filtered genes in the alignment did not have an impact on the
estimated posterior divergence times as we show
here
and here
(see also further comments here).
Consequently, we did not repeat all the steps described above to fit the ST distributions again to the alignment without
the 11 genes because we would not get any significant differences. You can also find in directory
00_fitST_new
the results with the generated fitted ST distributions to the posterior densities obtained when using the alignment
without these 11 genes -- you can see that there are no significant differences here
and in individual plots here
(found within 02_MCMCtree_prior_ST_new/plots
directory)
between the ST distributions.
We also re-run the analyses with calibrations updated following the updated geochronology as of September 2021.
Note that the estimated posterior divergence times do not considerably change when using this set of updated calibrations
if compared with the ones we had used before (see plots here).
You can find in directory 00_fitST_newchr
the results with the generated fitted ST distribution to the posterior densities using this set of updated calibartions. In addition,
you can see that there are no significant differences between this fitted ST distributions to the ones fitted to the
previous analysis with old calibrations (see here and in individual plots
here).
Therefore, considering that fitting ST distributions is done using a heuristics approach and that there are also
biases during the MCMC sampling, we decide to keep using the same ST distributions we have used in the previous
analysis.
Note: Calibration "Dipodidae-Muroidea" was placed in node 110 when using the old set of calibrations. This calibration is place in node 111 when using the updated set of calibrations.
The plots show that the ST distributions are sensible (see image below), and so they can be used to continue with the sequential
Bayesian dating analysis. The ST distributions estimated with the sn::st.mple
R function are plotted in red, those estimated when
sampling from the posterior with MCMCtree
when using the old calibrations in light green, and when sampling from the posterior
with MCMCtree
when using the calibrations that follow the updated geochronology as of September 2021 in black:
Fig S7. Assessment of the skew-t (ST) distributions fitted to the internal nodes of the 72-species mammal tree. The ST distributions estimated with the sn::st.mple R function are plotted in red, those estimated when sampling from the posterior with MCMCtree when using the old set of calibrations in light green, and when sampling from the posterior with MCMCtree when using the calibrations with geochronology updates (as of September 2021) in black.