diff --git a/README.md b/README.md index 1985217..c3a1344 100644 --- a/README.md +++ b/README.md @@ -1,62 +1,107 @@ # bs_inBV -A tool to help generate the file in.BV when using "complex" substitution models for MCMCTree. Particularly, the hessian is approximated by a bootstrap approach. -# Idea # -People have increasingly recognized the importance of using more complex substitution models in modeling sequence evolution, particularly for deep-time evolution and in case the sequence divergence is large. MCMCTree supports only a number of substitution models. These models, however, may be available in other software specifically used for tree construction, e.g., IQ-Tree and RAxML. +A tool with which the branch lengths, gradient, and Hessian used by `MCMCtree` to approximate the likelihood calculation are estimated with complex amino acid substitution models. Particularly, `bs_inBV` will return the `in.BV` file with the inferred branch lengths, gradient, and Hessian; the latter is approximated using a bootstrap approach. + +## Idea + +People have increasingly acknowledged the importance of using more complex amino acid substitution models in modeling sequence evolution, particularly when analysing deep phylogenies. `MCMCtree` supports various amino acid substitution models, being LG+F the most complex one. Nevertheless, more complex models are available in other phylogenetic software such as `IQ-TREE` or `RAxML`. + +When enabling the approximate likelihood calculation with `MCMCtree` (i.e., `usedata = 2` followed by `usedata = 3`; please consult the [PAML documentation for details](https://github.com/abacus-gene/paml/blob/master/doc/pamlDOC.pdf)), `CODEML` (`BASEML` if nucleotide data) will be first "called" by `MCMCtree` when using `usedata = 2` to estimate the branch lengths, the gradient, and the Hessian under a maximum-likelihood approach; vectors and matrix that will be output in the so-called `in.BV` file. Subsequently, `MCMCtree` will use the branch lengths, gradient, and Hessian in this `in.BV` file to approximate the likelihood calculation during the MCMC when enabling `usedata = 3` in the control file. Particularly, `bs_inBV` has been written to use complex amino acid substitution models available in `IQ-TREE` to estimate the branch lengths and **approximate the Hessian matrix by calculating the negative inverse of the bootstrap covariance matrix of branch length estimates**. In addition, `bs_inBV` will then set all the elements of the gradient vector to zero. -To work with the approximate likelihood method of MCMCTree (usedata=2), one needs to get the MLE (maximum likelihood estimate) of the branch length, as well as the gradient and hessian evaluated at the MLE. This tool takes the advantages of the abundant substitution models of IQ-Tree. Briefly, the tool uses IQ-Tree's estimate of the branch length, set the gradient to all zeros, and importantly, **approximate the hessian by calculating the negative inverse of the bootstrap covariance matrix of branch length estimates**. ![image](https://github.com/evolbeginner/bs_inBV/assets/8715751/6b7ae95a-f018-4331-8812-720601f637ed) +## Installation + +### Ruby gems -# Installation -Make sure [RUBY](https://www.ruby-lang.org/en/) is installed. +First and foremost, please make sure that [Ruby](https://www.ruby-lang.org/en/) is installed. In addition, you will need to install the following gems before running `bs_inBV` using the command `gem install `. The gems to be installed are the following: -Many scripts included in this product require the following RUBY packages. If any is not installed, please install it by `gem install package_name`. * parallel * colorize +* find +* getoptlong +* time +* fileutils +* tmpdir +* bio +* bio-nwk -This product also employs several other computational tools. Please ensure that you have them installed. -* [PAML](https://github.com/abacus-gene/paml) -* [IQ-Tree](http://www.iqtree.org/) -* [newick_utilities](https://github.com/tjunier/newick_utils) - -# Usage # -1. Perform a regular MCMCTree analysis. To save time, you can run with only the prior (e.g., set usedata = 0 in mcmctree.ctl) and very few iterations (set burnin=1, sampfreq=1, nsample=1 in mcmctree.ctl). - The reason for this step is to make sure the order of the tips in the file in.BV generated by MCMCTree is the same as that of the species.tree specified by the user. - -2. Extract a "reference" tree from the file in.BV generated by MCMCTree in the previous step. - - `sed '4!d' in.BV > ref.tre` - -3. Run the following - - `ruby create_hessian_by_bootstrapping.rb --ali alignment --calibrated_tree species_tree --outdir C20 --ref ref.tre --force -b 1000 --cpu 8 -m LG+G+C20 --mcmctree_ctl mcmctree.ctl --run_mcmctree --pmsf` - - Arguments: - * `--ali`: alignment file - * `--calibrated_tree`: the species tree used in regular MCMCTree - * `--outdir`: output directory - * `--force`: tells the program to overwrite the output if it exists - * `-b`: the no. of bootstraps (note that only traditional bootstrapping is allowed as IQ-TRee's UFB cannot be used for a species tree with a fixed topology) - * `--cpu`: no. of cores used in IQ-Tree - * `-m`: the model passed to IQ-Tree - * `--mcmctree_ctl`: the control file for MCMCTree - * `--pmsf`: use the PMSF approximation - -4. Output - Go to the folder "Sample". Run `ruby create_hessian_by_bootstrapping.rb --ali combined.phy --calibrated_tree species.trees --outdir C20 --ref ref.tre --force -b 100 --cpu 8 -m LG+G+C20 --mcmctree_ctl mcmctree.ctl --run_mcmctree --pmsf`. Note that as detailed above, the files "ref.tre" and "mcmctree.ctl" are obtained by running a regular MCMCTree analysis prior to using create_hessian_by_bootstrapping. The example data are simulated using LG+C20+G assuming a root age of 4.0 Ga. For the details of simulation, see [link](https://sishuowang2022.wordpress.com/2023/06/20/substitution-models-lgcxx-and-cxxlg-differ-in-iq-tree/). - * C20/split: IQ-Tree output - * C20/mcmctree: MCMCTree results. The branch length estimated by the specified model (in the example "-m LG+G+C20"), gradients, and the approximated hessian by bootstrap in the in.BV - -# Notes # -1. With `--run_mcmctree`, MCMCTree will be run directly after generating in.BV. In case you want only the file in.BV based on your specified model say LG+G+C60, please do not use `--run_mcmctree`. -2. Without `--pmsf` IQ-Tree is much slower because it is the traditional Cxx model that will be applied in IQ-Tree. -3. In case of problems due to invertible variance-covariance matrix of the branch lengths, try either increasing bootstrap no. or avoiding too closely related species. -4. For more details, please see **Note S4 and Figs. S9-S10** in the original refenrence (see below). -5. As suggested by Sandra Alvarez [link](https://github.com/evolbeginner/bs_inBV/issues/1), **installation of some necessary packages are not clearly stated in the current version of README.md**. So users should be very careful when using v1.0 of the present tool. - -# How to cite -Dating the bacterial tree of life based on ancient symbiosis Sishuo Wang, Haiwei Luo bioRxiv 2023.06.18.545440; doi: https://doi.org/10.1101/2023.06.18.545440. +### Third-party tools + +In order to run `bs_inBV`, you will need to install the following programs: + +* [`PAML`](https://github.com/abacus-gene/paml). Please refer to the [PAML Wiki](https://github.com/abacus-gene/paml/wiki#installation) for more details regarding the installation procedure. +* [`IQ-TREE`](http://www.iqtree.org/). Please refer to the [IQ-TREE website](http://www.iqtree.org/#download) for more details regarding the installation procedure. +* [`newick-utils`](https://github.com/tjunier/newick_utils). If you had any issues when installing `newick-utils` from the source code, please download [the pre-compiled binaries of Newick Utilities v1.6](https://web.archive.org/web/20210409163921/http://cegg.unige.ch/pub/newick-utils-1.6-Linux-x86_64-disabled-extra.tar.gz). These binaries are available on [this website](https://web.archive.org/web/20210409163921/http://cegg.unige.ch/newick_utils). +* [`R`](https://cran.r-project.org/). Please make sure that you follow the installation instructions on the main website. WSL users will need to install R via the terminal too. In addition, please make sure that the `MASS` R package is installed as it is used by script [`reorder_node.rb`](reorder_node.rb). + +### Changes to ruby scripts prior to running `bs_inBV` + +At the time of writing, `bs_inBV` has pre-defined some paths an aliases to execute third-party tools in some of its ruby scripts. If you are to use this tool under a different environment, please make sure you change the following lines to match your PC settings: + +* Users will need to update the path to PAML programs in line 40 in script [`do_mcmctree.rb`](lib/do_mcmctree.rb). If users have an alias to execute `MCMCtree`, they may also have to change line 43 accordingly. +* If users have not exported `IQ-TREE` to their PATH or have an alias different from the one specified in line 26 in script [`create_hessian_by_bootstrapping.rb`](create_hessian_by_bootstrapping.rb), they will need to change that too. The same applies to lines 27-28 for `newick-utils` programs and line 29 for `MCMCtree` in that same script. + +## Usage + +* If you have already run `MCMCtree` and have the `in.BV` file, please run the following command to obtain a "reference" tree that will be subsequently used by `bs_inBV`: + + ```sh + sed '4!d' in.BV > ref.tre + ``` + +* If you do not have an `in.BV` file, then you can run a quick dummy analysis with `MCMCtree` to obtain one. Note that the main reason for running this program now is just to make sure that the order of the tips that will be output in the `in.BV` is the same as that of the input tree file specified by the user. To save time, you can ask `MCMCtree` to sample from the prior (i.e., no data will be used, set `usedata = 0` in [the control file](sample/mcmctree.ctl)) and specify few iterations (e.g., set `burnin = 1`, `sampfreq = 1`, `nsample = 1` in [the control file](sample/mcmctree.ctl)). If you are using the control file given in the [`sample`](sample/mcmctree.ctl) directory, please also make sure that you set `usedata = 2`. If you have your own control file, please make the changes accordingly. Once you have a "dummy" `in.BV` file, you can extract the "reference" tree: -You may also need to cite corresponding papers for the use of PAML, IQ-Tree, and Newick_Utilities. + ```sh + sed '4!d' in.BV > ref.tre + ``` + +* Now, you are ready to run `bs_inBV`! Please execute this tool by running the following command (make sure that the paths to the third-party tools have been modified to match your settings as [aforementioned](README.md#third-party-tools)): + + ```sh + # Run from the same directory where you have this script + # You can use relative/absolute paths to your input files + ruby create_hessian_by_bootstrapping.rb --ali --calibrated_tree --outdir --ref /ref.tre --force -b 1000 --cpu -m LG+G+C20 --mcmctree_ctl /mcmctree.ctl --run_mcmctree --pmsf + ``` + + > Options: + > + > * `--ali`: path to alignment file. + > * `--calibrated_tree`: path to input calibrated tree file that will be used by `MCMCtree`. + > * `--outdir`: path to the output directory. + > * `--force`: tells the program to overwrite the output files, if any. + > * `-b`: number of bootstraps that `IQ-TREE` will run. Please note that only traditional bootstrapping is allowed as the UFB approach implemented in `IQ-TREE` cannot be used when using a fixed tree topology (which is required by `MCMCtree`). + > * `--cpu`: number of cores that will be used by `IQ-TREE`. + > * `-m`: amino acid substitution model that `IQ-TREE` will use for branch lengths inference. + > * `--mcmctree_ctl`: path to the control file that will execute `MCMCtree`. Please make sure that this control file has `usedata = 2 in.BV` if you want to run `MCMCtree` after `bs_inBV` outputs the `in.BV` file (enable this feature by adding option `--run_mcmctree`). + > * `--pmsf`: use the PMSF approximation + +### Example + +There is a dataset that you can use in the [`sample`](sample) directory to test the usage of `bs_inBV`. First, please clone this repository and follow all the steps detailed above with regards to [gems, third-party tools, and code changes during the installation procedure](README.md#installation). Once everything is ready, then go to directory [`sample`](sample) and execute the following command: + +```sh +# Run this command from directory `sample` +ruby ../create_hessian_by_bootstrapping.rb --ali combined.phy --calibrated_tree species.trees --outdir C20 --ref ref.tre --force -b 100 --cpu 8 -m LG+G+C20 --mcmctree_ctl mcmctree.ctl --run_mcmctree --pmsf +``` + +Once the command above finishes, you will see the following output files: + +* `C20/split`: this directory will contain all the output files generated by `IQ-TREE`. The name you choose for option `--outdir` will be given to the first directory (e.g., `C20` in this example). +* `C20/mcmctree`: if option `--run_mcmctree` is not enabled, this directory will contain your input files and the `in.BV` file. If you decide to run `MCMCtree` by enabling option `--run_mcmctree`, then you will also find the output files generated by `MCMCtree` in this directory. Please note that the `in.BV` file will contain the vector of branch length estimated using the amino acid substitution model defined by the user (e.g.,`-m LG+G+C20` in this example), the gradient vector, and the Hessian matrix approximated using a bootstrap approach. + +> **NOTE 1**: Please check above to learn how to generate file `ref.tre`. For more details about how to configure the control file for `MCMCtree`, please check the [PAML documentation](https://github.com/abacus-gene/paml/blob/master/doc/pamlDOC.pdf). +> **NOTE 2**: The example data were simulated under model LG+C20+G and a root age of 4.0 Ga. For more details about the simulation, please [read this blog post](https://sishuowang2022.wordpress.com/2023/06/20/substitution-models-lgcxx-and-cxxlg-differ-in-iq-tree/). + +## Notes + +1. When using `--run_mcmctree`, `MCMCtree` will be execute right after generating the `in.BV` file. If you only want to obtain the `in.BV` file based on your specified amino model (e.g., LG+G+C60), please do not use `--run_mcmctree`. +2. Without `--pmsf`, `IQ-TREE` is much slower because the traditional Cxx model is used by `IQ-TREE`. +3. In case you experience any issues related to inverting the variance-covariance matrix of the branch lengths, please try to (i) increase the number of bootstrap or (ii) avoid including too closely-related species in your phylogeny. +4. For more details, please see **Note S4 and Figs. S9-S10** in the original reference (see below). + +## How to cite + +Dating the bacterial tree of life based on ancient symbiosis Sishuo Wang, Haiwei Luo bioRxiv 2023.06.18.545440; doi: https://doi.org/10.1101/2023.06.18.545440. +You may also need to cite corresponding papers for the use of `PAML`, `IQ-TREE`, and `Newick Utilities`. diff --git a/create_hessian_by_bootstrapping.rb b/create_hessian_by_bootstrapping.rb index ff1ba75..9addce7 100644 --- a/create_hessian_by_bootstrapping.rb +++ b/create_hessian_by_bootstrapping.rb @@ -9,9 +9,11 @@ require 'find' require 'colorize' -require_relative 'lib/Dir' -require_relative 'lib/processbar' -require_relative 'lib/do_mcmctree.rb' +$LOAD_PATH << './lib' + +require 'Dir.rb' +require 'processbar.rb' +require 'do_mcmctree.rb' ############################################################ @@ -21,7 +23,7 @@ DIR = File.dirname(__FILE__) LIB_DIR = File.join(DIR, 'lib') -IQTREE = 'iqtree' +IQTREE = 'iqtree2' NW_STATS = 'nw_stats' NW_TOPOLOGY = 'nw_topology' MCMCTREE = 'mcmctree' @@ -277,6 +279,4 @@ def split_ali_file(ali_file) `bash #{FIGTREE2NWK} -i FigTree.tre > figtree.nwk` puts "Done!" if $? == 0 end -end - - +end \ No newline at end of file diff --git a/lib/Dir.rb b/lib/Dir.rb index 8259686..08b7395 100644 --- a/lib/Dir.rb +++ b/lib/Dir.rb @@ -1,3 +1,5 @@ +#! /usr/bin/env ruby + require 'find' @@ -76,6 +78,4 @@ def getFilesBySuffices(indir, suffices) def get_file_path(file) path = File.symlink?(file) ? File.readlink(file) : file return(path) -end - - +end \ No newline at end of file diff --git a/lib/do_mcmctree.rb b/lib/do_mcmctree.rb index 348a614..5169847 100644 --- a/lib/do_mcmctree.rb +++ b/lib/do_mcmctree.rb @@ -28,7 +28,7 @@ require 'time' require 'colorize' -require 'Dir' +require 'Dir.rb' ################################################################# @@ -37,7 +37,7 @@ DIR = File.dirname(__FILE__) FIGTREE2NWK = File.join(DIR, "figtree2tree.sh") -PAML_DIR = File.expand_path("~/software/phylo/paml/") +PAML_DIR = File.expand_path("~/PAML/bin/") MCMCTREE_CTL = File.join(PAML_DIR, 'mcmctree.ctl') CODEML_CTL = File.join(PAML_DIR, 'codeml.ctl') MCMCTREE = "mcmctree" @@ -433,4 +433,4 @@ def run_codeml(outdir, seqtype, ndata, alpha, ncatG, cpu) puts puts -end +end \ No newline at end of file diff --git a/sample/mcmctree.ctl b/sample/mcmctree.ctl index 8513a23..f1d5774 100644 --- a/sample/mcmctree.ctl +++ b/sample/mcmctree.ctl @@ -1,33 +1,29 @@ seed = -1 -* seqfile = tmp/seqs.phy -* treefile = tmp/calibration_tree.txt seqfile = combined.phy treefile = species.trees mcmcfile = mcmc.txt outfile = out.txt - ndata = 1 - seqtype = 2 - usedata = 2 in.BV 1 - clock = 2 -* RootAge = 'B(3.20,4.51,0.025,0.01)' * safe constraint on root age, used if no fossil for root. + ndata = 1 * Number of partitions + seqtype = 2 * 0: nucleotides; 1:codons; 2:AAs + usedata = 2 in.BV * 0: no data (prior); 1:exact likelihood; + * 2:approximate likelihood; 3:out.BV (in.BV) + clock = 3 * 1: global clock; 2: independent rates; 3: correlated rates - model = 0 * 0 : JC69, 1: K80, 2: F81, 3: F84, 4: HKY85 - alpha = 0.5 - ncatG = 4 - duplication = 0 - cleandata = 0 * remove sites With ambiguity data (1:yes, 0:no)? + model = 3 * models for AAs or codon-translated AAs: + * 0:poisson, 1:proportional,2:Empirical,3:Empirical+F + * 6:FromCodon, 8:REVaa_0, 9:REVaa(nr=189) alpha = 0.5 + ncatG = 4 * No. categories in discrete gamma + alpha = 0.5 * alpha for gamma rates at sites - BDparas = 1 1 0 + BDparas = 1 1 0.1 * birth, death, sampling kappa_gamma = 6 2 * gamma prior for kappa alpha_gamma = 1 1 * gamma prior for alpha - rgene_gamma = 1 50 1 - sigma2_gamma = 1 10 1 + rgene_gamma = 1 50 * gammaDir prior for rate for genes + sigma2_gamma = 1 10 * gammaDir prior for sigma^2 (for clock=2 or 3) - finetune = 1: .1 .1 .1 .1 .1 .1 * auto (0 or 1): times, musigma2, rates, mixing, paras, FossilErr - - print = 1 - burnin = 1000 - sampfreq = 2 - nsample = 5000 + print = 1 * 0: no mcmc sample; 1: everything except branch rates 2: everything + burnin = 100000 + sampfreq = 1000 + nsample = 20000