Skip to content

Commit

Permalink
Merge pull request #2 from sabifo4/main
Browse files Browse the repository at this point in the history
Update scripts and README.md file to deal with Issue #1
  • Loading branch information
evolbeginner authored Apr 6, 2024
2 parents 57fc543 + e2bd262 commit 80cea17
Show file tree
Hide file tree
Showing 5 changed files with 125 additions and 84 deletions.
145 changes: 95 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,62 +1,107 @@
# bs_inBV
A tool to help generate the file in.BV when using "complex" substitution models for MCMCTree. Particularly, the hessian is approximated by a bootstrap approach.

# Idea #
People have increasingly recognized the importance of using more complex substitution models in modeling sequence evolution, particularly for deep-time evolution and in case the sequence divergence is large. MCMCTree supports only a number of substitution models. These models, however, may be available in other software specifically used for tree construction, e.g., IQ-Tree and RAxML.
A tool with which the branch lengths, gradient, and Hessian used by `MCMCtree` to approximate the likelihood calculation are estimated with complex amino acid substitution models. Particularly, `bs_inBV` will return the `in.BV` file with the inferred branch lengths, gradient, and Hessian; the latter is approximated using a bootstrap approach.

## Idea

People have increasingly acknowledged the importance of using more complex amino acid substitution models in modeling sequence evolution, particularly when analysing deep phylogenies. `MCMCtree` supports various amino acid substitution models, being LG+F the most complex one. Nevertheless, more complex models are available in other phylogenetic software such as `IQ-TREE` or `RAxML`.

When enabling the approximate likelihood calculation with `MCMCtree` (i.e., `usedata = 2` followed by `usedata = 3`; please consult the [PAML documentation for details](https://github.com/abacus-gene/paml/blob/master/doc/pamlDOC.pdf)), `CODEML` (`BASEML` if nucleotide data) will be first "called" by `MCMCtree` when using `usedata = 2` to estimate the branch lengths, the gradient, and the Hessian under a maximum-likelihood approach; vectors and matrix that will be output in the so-called `in.BV` file. Subsequently, `MCMCtree` will use the branch lengths, gradient, and Hessian in this `in.BV` file to approximate the likelihood calculation during the MCMC when enabling `usedata = 3` in the control file. Particularly, `bs_inBV` has been written to use complex amino acid substitution models available in `IQ-TREE` to estimate the branch lengths and **approximate the Hessian matrix by calculating the negative inverse of the bootstrap covariance matrix of branch length estimates**. In addition, `bs_inBV` will then set all the elements of the gradient vector to zero.

To work with the approximate likelihood method of MCMCTree (usedata=2), one needs to get the MLE (maximum likelihood estimate) of the branch length, as well as the gradient and hessian evaluated at the MLE. This tool takes the advantages of the abundant substitution models of IQ-Tree. Briefly, the tool uses IQ-Tree's estimate of the branch length, set the gradient to all zeros, and importantly, **approximate the hessian by calculating the negative inverse of the bootstrap covariance matrix of branch length estimates**.
![image](https://github.com/evolbeginner/bs_inBV/assets/8715751/6b7ae95a-f018-4331-8812-720601f637ed)

## Installation

### Ruby gems

# Installation
Make sure [RUBY](https://www.ruby-lang.org/en/) is installed.
First and foremost, please make sure that [Ruby](https://www.ruby-lang.org/en/) is installed. In addition, you will need to install the following gems before running `bs_inBV` using the command `gem install <gem_name>`. The gems to be installed are the following:

Many scripts included in this product require the following RUBY packages. If any is not installed, please install it by `gem install package_name`.
* parallel
* colorize
* find
* getoptlong
* time
* fileutils
* tmpdir
* bio
* bio-nwk

This product also employs several other computational tools. Please ensure that you have them installed.
* [PAML](https://github.com/abacus-gene/paml)
* [IQ-Tree](http://www.iqtree.org/)
* [newick_utilities](https://github.com/tjunier/newick_utils)

# Usage #
1. Perform a regular MCMCTree analysis. To save time, you can run with only the prior (e.g., set usedata = 0 in mcmctree.ctl) and very few iterations (set burnin=1, sampfreq=1, nsample=1 in mcmctree.ctl).
The reason for this step is to make sure the order of the tips in the file in.BV generated by MCMCTree is the same as that of the species.tree specified by the user.

2. Extract a "reference" tree from the file in.BV generated by MCMCTree in the previous step.

`sed '4!d' in.BV > ref.tre`

3. Run the following

`ruby create_hessian_by_bootstrapping.rb --ali alignment --calibrated_tree species_tree --outdir C20 --ref ref.tre --force -b 1000 --cpu 8 -m LG+G+C20 --mcmctree_ctl mcmctree.ctl --run_mcmctree --pmsf`

Arguments:
* `--ali`: alignment file
* `--calibrated_tree`: the species tree used in regular MCMCTree
* `--outdir`: output directory
* `--force`: tells the program to overwrite the output if it exists
* `-b`: the no. of bootstraps (note that only traditional bootstrapping is allowed as IQ-TRee's UFB cannot be used for a species tree with a fixed topology)
* `--cpu`: no. of cores used in IQ-Tree
* `-m`: the model passed to IQ-Tree
* `--mcmctree_ctl`: the control file for MCMCTree
* `--pmsf`: use the PMSF approximation

4. Output
Go to the folder "Sample". Run `ruby create_hessian_by_bootstrapping.rb --ali combined.phy --calibrated_tree species.trees --outdir C20 --ref ref.tre --force -b 100 --cpu 8 -m LG+G+C20 --mcmctree_ctl mcmctree.ctl --run_mcmctree --pmsf`. Note that as detailed above, the files "ref.tre" and "mcmctree.ctl" are obtained by running a regular MCMCTree analysis prior to using create_hessian_by_bootstrapping. The example data are simulated using LG+C20+G assuming a root age of 4.0 Ga. For the details of simulation, see [link](https://sishuowang2022.wordpress.com/2023/06/20/substitution-models-lgcxx-and-cxxlg-differ-in-iq-tree/).
* C20/split: IQ-Tree output
* C20/mcmctree: MCMCTree results. The branch length estimated by the specified model (in the example "-m LG+G+C20"), gradients, and the approximated hessian by bootstrap in the in.BV

# Notes #
1. With `--run_mcmctree`, MCMCTree will be run directly after generating in.BV. In case you want only the file in.BV based on your specified model say LG+G+C60, please do not use `--run_mcmctree`.
2. Without `--pmsf` IQ-Tree is much slower because it is the traditional Cxx model that will be applied in IQ-Tree.
3. In case of problems due to invertible variance-covariance matrix of the branch lengths, try either increasing bootstrap no. or avoiding too closely related species.
4. For more details, please see **Note S4 and Figs. S9-S10** in the original refenrence (see below).
5. As suggested by Sandra Alvarez [link](https://github.com/evolbeginner/bs_inBV/issues/1), **installation of some necessary packages are not clearly stated in the current version of README.md**. So users should be very careful when using v1.0 of the present tool.

# How to cite
Dating the bacterial tree of life based on ancient symbiosis Sishuo Wang, Haiwei Luo bioRxiv 2023.06.18.545440; doi: https://doi.org/10.1101/2023.06.18.545440.
### Third-party tools

In order to run `bs_inBV`, you will need to install the following programs:

* [`PAML`](https://github.com/abacus-gene/paml). Please refer to the [PAML Wiki](https://github.com/abacus-gene/paml/wiki#installation) for more details regarding the installation procedure.
* [`IQ-TREE`](http://www.iqtree.org/). Please refer to the [IQ-TREE website](http://www.iqtree.org/#download) for more details regarding the installation procedure.
* [`newick-utils`](https://github.com/tjunier/newick_utils). If you had any issues when installing `newick-utils` from the source code, please download [the pre-compiled binaries of Newick Utilities v1.6](https://web.archive.org/web/20210409163921/http://cegg.unige.ch/pub/newick-utils-1.6-Linux-x86_64-disabled-extra.tar.gz). These binaries are available on [this website](https://web.archive.org/web/20210409163921/http://cegg.unige.ch/newick_utils).
* [`R`](https://cran.r-project.org/). Please make sure that you follow the installation instructions on the main website. WSL users will need to install R via the terminal too. In addition, please make sure that the `MASS` R package is installed as it is used by script [`reorder_node.rb`](reorder_node.rb).

### Changes to ruby scripts prior to running `bs_inBV`

At the time of writing, `bs_inBV` has pre-defined some paths an aliases to execute third-party tools in some of its ruby scripts. If you are to use this tool under a different environment, please make sure you change the following lines to match your PC settings:

* Users will need to update the path to PAML programs in line 40 in script [`do_mcmctree.rb`](lib/do_mcmctree.rb). If users have an alias to execute `MCMCtree`, they may also have to change line 43 accordingly.
* If users have not exported `IQ-TREE` to their PATH or have an alias different from the one specified in line 26 in script [`create_hessian_by_bootstrapping.rb`](create_hessian_by_bootstrapping.rb), they will need to change that too. The same applies to lines 27-28 for `newick-utils` programs and line 29 for `MCMCtree` in that same script.

## Usage

* If you have already run `MCMCtree` and have the `in.BV` file, please run the following command to obtain a "reference" tree that will be subsequently used by `bs_inBV`:

```sh
sed '4!d' in.BV > ref.tre
```

* If you do not have an `in.BV` file, then you can run a quick dummy analysis with `MCMCtree` to obtain one. Note that the main reason for running this program now is just to make sure that the order of the tips that will be output in the `in.BV` is the same as that of the input tree file specified by the user. To save time, you can ask `MCMCtree` to sample from the prior (i.e., no data will be used, set `usedata = 0` in [the control file](sample/mcmctree.ctl)) and specify few iterations (e.g., set `burnin = 1`, `sampfreq = 1`, `nsample = 1` in [the control file](sample/mcmctree.ctl)). If you are using the control file given in the [`sample`](sample/mcmctree.ctl) directory, please also make sure that you set `usedata = 2`. If you have your own control file, please make the changes accordingly. Once you have a "dummy" `in.BV` file, you can extract the "reference" tree:

You may also need to cite corresponding papers for the use of PAML, IQ-Tree, and Newick_Utilities.
```sh
sed '4!d' in.BV > ref.tre
```

* Now, you are ready to run `bs_inBV`! Please execute this tool by running the following command (make sure that the paths to the third-party tools have been modified to match your settings as [aforementioned](README.md#third-party-tools)):

```sh
# Run from the same directory where you have this script
# You can use relative/absolute paths to your input files
ruby create_hessian_by_bootstrapping.rb --ali <path_to_alignment_file> --calibrated_tree <path_to_calibrated_tree_file> --outdir <path_to_output_directory> --ref <path_to_reference_tree>/ref.tre --force -b 1000 --cpu <number_CPUs_for_IQTREE_to_use> -m LG+G+C20 --mcmctree_ctl <path_to_MCMCtree_control_file>/mcmctree.ctl --run_mcmctree --pmsf
```

> Options:
>
> * `--ali`: path to alignment file.
> * `--calibrated_tree`: path to input calibrated tree file that will be used by `MCMCtree`.
> * `--outdir`: path to the output directory.
> * `--force`: tells the program to overwrite the output files, if any.
> * `-b`: number of bootstraps that `IQ-TREE` will run. Please note that only traditional bootstrapping is allowed as the UFB approach implemented in `IQ-TREE` cannot be used when using a fixed tree topology (which is required by `MCMCtree`).
> * `--cpu`: number of cores that will be used by `IQ-TREE`.
> * `-m`: amino acid substitution model that `IQ-TREE` will use for branch lengths inference.
> * `--mcmctree_ctl`: path to the control file that will execute `MCMCtree`. Please make sure that this control file has `usedata = 2 in.BV` if you want to run `MCMCtree` after `bs_inBV` outputs the `in.BV` file (enable this feature by adding option `--run_mcmctree`).
> * `--pmsf`: use the PMSF approximation
### Example

There is a dataset that you can use in the [`sample`](sample) directory to test the usage of `bs_inBV`. First, please clone this repository and follow all the steps detailed above with regards to [gems, third-party tools, and code changes during the installation procedure](README.md#installation). Once everything is ready, then go to directory [`sample`](sample) and execute the following command:

```sh
# Run this command from directory `sample`
ruby ../create_hessian_by_bootstrapping.rb --ali combined.phy --calibrated_tree species.trees --outdir C20 --ref ref.tre --force -b 100 --cpu 8 -m LG+G+C20 --mcmctree_ctl mcmctree.ctl --run_mcmctree --pmsf
```

Once the command above finishes, you will see the following output files:

* `C20/split`: this directory will contain all the output files generated by `IQ-TREE`. The name you choose for option `--outdir` will be given to the first directory (e.g., `C20` in this example).
* `C20/mcmctree`: if option `--run_mcmctree` is not enabled, this directory will contain your input files and the `in.BV` file. If you decide to run `MCMCtree` by enabling option `--run_mcmctree`, then you will also find the output files generated by `MCMCtree` in this directory. Please note that the `in.BV` file will contain the vector of branch length estimated using the amino acid substitution model defined by the user (e.g.,`-m LG+G+C20` in this example), the gradient vector, and the Hessian matrix approximated using a bootstrap approach.

> **NOTE 1**: Please check above to learn how to generate file `ref.tre`. For more details about how to configure the control file for `MCMCtree`, please check the [PAML documentation](https://github.com/abacus-gene/paml/blob/master/doc/pamlDOC.pdf).
> **NOTE 2**: The example data were simulated under model LG+C20+G and a root age of 4.0 Ga. For more details about the simulation, please [read this blog post](https://sishuowang2022.wordpress.com/2023/06/20/substitution-models-lgcxx-and-cxxlg-differ-in-iq-tree/).
## Notes

1. When using `--run_mcmctree`, `MCMCtree` will be execute right after generating the `in.BV` file. If you only want to obtain the `in.BV` file based on your specified amino model (e.g., LG+G+C60), please do not use `--run_mcmctree`.
2. Without `--pmsf`, `IQ-TREE` is much slower because the traditional Cxx model is used by `IQ-TREE`.
3. In case you experience any issues related to inverting the variance-covariance matrix of the branch lengths, please try to (i) increase the number of bootstrap or (ii) avoid including too closely-related species in your phylogeny.
4. For more details, please see **Note S4 and Figs. S9-S10** in the original reference (see below).

## How to cite

Dating the bacterial tree of life based on ancient symbiosis Sishuo Wang, Haiwei Luo bioRxiv 2023.06.18.545440; doi: https://doi.org/10.1101/2023.06.18.545440.

You may also need to cite corresponding papers for the use of `PAML`, `IQ-TREE`, and `Newick Utilities`.
14 changes: 7 additions & 7 deletions create_hessian_by_bootstrapping.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,11 @@
require 'find'
require 'colorize'

require_relative 'lib/Dir'
require_relative 'lib/processbar'
require_relative 'lib/do_mcmctree.rb'
$LOAD_PATH << './lib'

require 'Dir.rb'
require 'processbar.rb'
require 'do_mcmctree.rb'


############################################################
Expand All @@ -21,7 +23,7 @@
DIR = File.dirname(__FILE__)
LIB_DIR = File.join(DIR, 'lib')

IQTREE = 'iqtree'
IQTREE = 'iqtree2'
NW_STATS = 'nw_stats'
NW_TOPOLOGY = 'nw_topology'
MCMCTREE = 'mcmctree'
Expand Down Expand Up @@ -277,6 +279,4 @@ def split_ali_file(ali_file)
`bash #{FIGTREE2NWK} -i FigTree.tre > figtree.nwk`
puts "Done!" if $? == 0
end
end


end
6 changes: 3 additions & 3 deletions lib/Dir.rb
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
#! /usr/bin/env ruby

require 'find'


Expand Down Expand Up @@ -76,6 +78,4 @@ def getFilesBySuffices(indir, suffices)
def get_file_path(file)
path = File.symlink?(file) ? File.readlink(file) : file
return(path)
end


end
6 changes: 3 additions & 3 deletions lib/do_mcmctree.rb
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
require 'time'
require 'colorize'

require 'Dir'
require 'Dir.rb'


#################################################################
Expand All @@ -37,7 +37,7 @@
DIR = File.dirname(__FILE__)
FIGTREE2NWK = File.join(DIR, "figtree2tree.sh")

PAML_DIR = File.expand_path("~/software/phylo/paml/")
PAML_DIR = File.expand_path("~/PAML/bin/")
MCMCTREE_CTL = File.join(PAML_DIR, 'mcmctree.ctl')
CODEML_CTL = File.join(PAML_DIR, 'codeml.ctl')
MCMCTREE = "mcmctree"
Expand Down Expand Up @@ -433,4 +433,4 @@ def run_codeml(outdir, seqtype, ndata, alpha, ncatG, cpu)
puts
puts

end
end
38 changes: 17 additions & 21 deletions sample/mcmctree.ctl
Original file line number Diff line number Diff line change
@@ -1,33 +1,29 @@
seed = -1
* seqfile = tmp/seqs.phy
* treefile = tmp/calibration_tree.txt
seqfile = combined.phy
treefile = species.trees
mcmcfile = mcmc.txt
outfile = out.txt

ndata = 1
seqtype = 2
usedata = 2 in.BV 1
clock = 2
* RootAge = 'B(3.20,4.51,0.025,0.01)' * safe constraint on root age, used if no fossil for root.
ndata = 1 * Number of partitions
seqtype = 2 * 0: nucleotides; 1:codons; 2:AAs
usedata = 2 in.BV * 0: no data (prior); 1:exact likelihood;
* 2:approximate likelihood; 3:out.BV (in.BV)
clock = 3 * 1: global clock; 2: independent rates; 3: correlated rates

model = 0 * 0 : JC69, 1: K80, 2: F81, 3: F84, 4: HKY85
alpha = 0.5
ncatG = 4
duplication = 0
cleandata = 0 * remove sites With ambiguity data (1:yes, 0:no)?
model = 3 * models for AAs or codon-translated AAs:
* 0:poisson, 1:proportional,2:Empirical,3:Empirical+F
* 6:FromCodon, 8:REVaa_0, 9:REVaa(nr=189) alpha = 0.5
ncatG = 4 * No. categories in discrete gamma
alpha = 0.5 * alpha for gamma rates at sites

BDparas = 1 1 0
BDparas = 1 1 0.1 * birth, death, sampling
kappa_gamma = 6 2 * gamma prior for kappa
alpha_gamma = 1 1 * gamma prior for alpha

rgene_gamma = 1 50 1
sigma2_gamma = 1 10 1
rgene_gamma = 1 50 * gammaDir prior for rate for genes
sigma2_gamma = 1 10 * gammaDir prior for sigma^2 (for clock=2 or 3)

finetune = 1: .1 .1 .1 .1 .1 .1 * auto (0 or 1): times, musigma2, rates, mixing, paras, FossilErr

print = 1
burnin = 1000
sampfreq = 2
nsample = 5000
print = 1 * 0: no mcmc sample; 1: everything except branch rates 2: everything
burnin = 100000
sampfreq = 1000
nsample = 20000

0 comments on commit 80cea17

Please sign in to comment.