Epievo includes tools for simulation and inference of epigenome evolution. The epigenome evolution is modeled by a context-dependent continuous-time Markov process, and estimated based on a Monte Carlo Expectation-Maximization algorithm.
epievo
requires a C++ compiler that supports the C++11 standard.
First, clone smithlab_cpp
source code repo in the directory you want,
and set the environment variable:
git clone git@github.com:smithlabcode/smithlab_cpp.git
export SMITHLAB_CPP=$PWD/smithlab_cpp
Then, download epievo
source code and compile:
git clone https://github.com/smithlabcode/epievo.git
cd epievo
make
make install
epievo_sim
takes model parameters to
simulate epigenomic states of each sepecies
and internal jumps, based on a context-dependent continuous-time
Markov model. It will create two outputs:
epigenomic states at each genomic position and each
node,
and global jumps ordered by node label, genomic
position and evolutionary time (file name could be provided after argument -p
).
epievo_sim [options] <parameter file> <outfile>
Users can use epievo_sim
to simulate multiple species (using argument -t
to
provide a phylogenetic tree), or one single branch by using
argument -T
to set total evolution time.
Epigenomic states of root species can be fixed by providing a states file
after -r
. By default, epievo_sim
will normalize input parameters following
"one-mutation-per-unit-time-per-site" rule. To use un-normalized
parameters, users can use flat -S
.
The global jumps data structure allows fast forward simulation,
and (local paths data structure is more efficient and used in
backward history inference.
Global-jump files can be converted to local paths through program global_jumps_to_paths
:
global_jumps_to_paths [options] <statefile> <jumpfile> <outfile>
Users can pass Phylogenetic tree in newick format after argument -t
,
or set the total evolution time of a single branch after argument -T
.
epievo_est_complete
is used to estimate model parameters
when the complete information of evolution process is given
(evolution paths and tree) are known.
The maximum-likelihood estimates are obtained based on gradient-ascent approach.
Initial values of model parameters are required.
By default, epievo_est_complete
will not update tree branches. To
learn model parameters and branches together, flag -b
needs to be specified.
If only one branch is considered, the evolutionary time should be specified after the
argument -T
.
epievo_est_complete [options] <parameter file> (<tree file>) <local paths>
In practice, epigenomic states are only observed in leaf species.
Programs epievo_initialization
and epievo_est_params_histories
allow
users to estimate evolution paths and
model parameters simultaneously, given
the leaf data and a starting tree (e.g. setting all branches to ones).
epievo_initialization
is used to generate initial evolution histories
and model parameters through heuristics and site-independent-model-based
methods.
If only one branch is included in the data, users should pass
the evolutionary time after -T
flag.
epievo_initialization [options] (<tree file>) <states file>
Program epievo_est_params_histories
runs a MCEM algorithm to estimate model
parameters and sample evolution histories iteratively, which requires
initial parameters, local paths to be provided.
By default, only model parameters will be estimated and printed to output
file (specified by -p
).
To estimate branch lengths simultanesously,
users need to pass the -b
flag.
If only one branch is included in the data, users should pass
the evolutionary time after -T
flag.
Other training parameters include MCEM total iterations (-i
),
MCMC sample size (-B
) and MCMC burn-in length (-L
).
epievo_est_params_histories [options] <parameter file> (<tree file>) <local paths>
Program epievo_sim_pairwise
runs a MCMC algorithm to infer epigenomic evolution
between two given state-sequences. The output will be local paths
between ending sequences.
epievo_sim_pairwise [OPTIONS] <parameter file> <states file>
If only one branch is included in the data, users should pass the -T
flag.
MCMC burn-in length can be specified after argument -L
.
Given a directory of multiple local paths (with .local_paths
suffix),
the average historical states can be calculated in equally spaced time windows using
program average_paths
:
average_paths [OPTIONS] <input directory>
Number of time windows can be specified after the argument -n
.
In the output file,
average epigenomic states along each branch are organzed in a matrix
(time windows X sites
).
The command below will generate the complete evolution information from
a phylogenetic tree in tree.nwk
, and model parameters in test.param
.
cd test
../bin/epievo_sim -v -n 10000 -p test.global_jumps -t tree.nwk test.param test.states
Two output files will be generated from epievo_sim
.
test.global_jumps
contains mutations ordered by position
and time, and test.states
contains epigenomic states at each position and each node.
To run inference programs, the global jumps should be converted to local paths first, by running:
../bin/global_jumps_to_paths -v -t tree.nwk test.states test.global_jumps test.local_paths
The command below will estimate model parameters (saved in test.param.updated
) from
tree file tree.nwk
, local paths test.local_paths
, given starting parameters
test.param
.
../bin/epievo_est_complete -v -o test.param.updated test.param tree.nwk test.local_paths
Now, we can try to initialize the inference procedure from a tree tree.nwk
and
leaf data observed.states
. Initial estimates of parameters and evolution histories
will be saved in test.param.init
and test.local_paths.init
respectively.
../bin/epievo_initialization -v -p test.param.init -o test.local_paths.init tree.nwk observed.states
Then, the command below will run a MCEM procedure to estimate model parameters and
evolution histories, which will be printed in test.local_path.est
and test.local_path.est
respectively.
../bin/epievo_est_params_histories -v -o test.local_path.est -p test.local_paths.est \
test.param.init tree.nwk test.local_paths.init
Our model parameters are organized in below format:
stationary 0.85 0.9
baseline -0.5 -1.5
The stationary line includes stationary distribution of horizontal Markov transition probabilities T00 and T11. Baseline parameters control the symmetric-context mutation rates r0_0 and r1_1.
EpiEvo
supports Newick
format for tree representation.
EpiEvo
use below format to present epigenomic state at each (aligned) position in each species.
Species labels are consistent to node labels in the tree file, and get sorted in preorder:
#NODE1 NODE2 NODE3 ...
site1 state_site1_node1 state_site1_node2 state_site1_node3
site2 state_site2_node1 state_site2_node2 state_site2_node3
...
The header line has 1 fewer fields than rest lines. After the header line, the first column shows genomic positions. The rest columns show binary states in corresponding species and genomic site.
EpiEvo
use global jumps to retrieve positions and times of mutations.
Global jumps are sorted by time then position:
ROOT:NODE1
[Sequence of binary states]
NODE:NODE2
mut1_time mut1_site
mut2_time mut2_site
...
NODE:NODE3
...
The block of root node will only include a sequence of binary states. Then, each node block contains a list of time and position of mutation events.
Local path is another way to organize mutation events. Different from global jumps, the local path is a list of mutation times at each position. The format is below:
NODE:NODE1
NODE:NODE2
site1 site1_initial_state site1_total_time site1_mut1_time site1_mut2_time ...
site2 site2_initial_state site2_total_time site2_mut1_time site2_mut2_time ...
...
NODE:NODE3
...
Again, the root node block has no mutation information.
Andrew D. Smith andrewds@usc.edu
Xiaojing Ji xiaojinj@usc.edu
Copyright (C) 2018-2020 University of Southern California, Andrew D. Smith
Current Authors: Andrew D. Smith, Xiaojing Ji and Jenny Qu
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.