Isolator analyzes RNA-Seq experiments.
There are many methods to analyze RNA-Seq data. Isolator differs in few important ways.
- Isolator has a particular focus on producing stable, consistent estimates. Maximum likelihood approaches produce unstable point estimates: small changes in the data can result in drastically different results, conflating downstream analysis like clustering or PCA. Isolator produces estimates that are in general, simultaneously more stable and more accurate other methods.
- It implements a full hierarchical Bayesian model of an entire RNA-Seq experiment. Pooling information leads more accurate estimates of effect size, particularily at low-levels. Conditions can be compared or clustered without throwing anything out.
- It saves all the samples generated by the sampler, which can be process to compute posterior probabilities for arbitrarily complex questions, far beyond the confines of pairwise tests.
- It aggressively corrects for technical effects, such as random priming bias, GC-bias, 3' bias, and fragmentation effects.
- Compared to other MCMC approaches, it is exceedingly efficient. Though, generally slower that modern maximum likelihood approaches.
Note: I've run this tool on a lot of data, evaluated it with multiple benchmarks, and am writing a paper about it, but it's still alpha software. Contact me if you'd like help using it to analyze your data.
Currently, Isolator must be installed by cloning the git repository and building the source code. It has been tested on OS X and Linux.
To build Isolator from source, you will need to first install Boost, HDF5, and cmake.
On Debian, Ubuntu, or similar.
apt-get install libboost-dev libboost-thread-dev libboost-system-dev libboost-timer-dev libhdf5-dev cmake
The easiest way to install dependencies is with Homebrew.
brew install boost hdf5 cmake
git clone https://github.com/dcjones/isolator.git
cd isolator/build
cmake ..
make
make install
This will install the isolator
program to the default destination (usually
/usr/local/bin
).
Isolator works in two steps: analysis and summarization. The analysis step will run the sampler for a number of iterations collecting estimates of the model parameters, which it will output to an HDF5 file. This output can then mu summarized and turned into regular tab-delimined tables in a variety of ways.
Input to isolator is a GTF giving transcript/gene annotations along with one or more BAM/SAM files giving the aligned reads for each sample or replicate in the experiment.
Optionally, the genome sequence can also be provided in FASTA format, allowing isolator to correct for various forms of sequence bias, typically resulting in more accurate quantification.
Experiments consist of one or more BAM (or SAM) files grouped into conditions. Isolator puts no restriction on the number of conditions or the number of samples within conditions, though if the goal is to compare two or more biological conditions, using multiple replicates is strongy recommended.
The analyze command is invoked like:
isolator analyze gene_annotations.gtf \
a1.bam,a2.bam,a3.bam b1.bam,b2.bam,b3.bam
BAM files within the same condition are grouped with commas, and conditions are separated with spaced. Analyze is a general purpose command. It can also be invoked on a single bam file.
isolator analyze gene_annotations.gtf a.bam
The analyze command will run for a while and output a file named (by default)
isolator-output.h5
. It is an HDF5 file, which is
a standardized data format than can be accessed using wide variety of tools.
However, data stored in this file is a raw form, and typically far more
information than is needed. To quickly generate to-the-point results, there is
the isolator summarize
command.
The isolator analyze
estimates parameters for a full probabilistic model of
the experiment. It is not a simple statistical test that attempts to answer a
single question (e.g. whether a gene is differentialy expressed). As a result, the
results generated from isolator analyze
can be used to examine the data in
many ways.
This is where the isolator summarize
comes in. It provides a number of
sub-commands to generate simple tables of results from the sampler output.
What follows are several examples. To see a list of the summarize sub-commands,
run isolator summarize --list
.
The simplest sort of summarization one might perform is to generate estimates of transcript expression.
isolator summarize transcript-expression isolator-output.h5
This will generate a tab-delimited file (named transcript-expression.tsv
) that looks like:
gene_name gene_id transcript_id sample1_tpm sample2_tpm sample3_tpm sample4_tpm sample5_tpm sample6_tpm
mt-Tf ENSMUSG00000064336 ENSMUST00000082387 3.064260e+01 3.542358e+01 3.068786e+01 3.283976e+01 3.785287e+01 3.647695e+01
mt-Rnr1 ENSMUSG00000064337 ENSMUST00000082388 1.812662e+03 2.566914e+03 2.470344e+03 2.728776e+03 2.445997e+03 2.241069e+03
mt-Tv ENSMUSG00000064338 ENSMUST00000082389 3.002667e+01 3.085850e+01 3.081490e+01 3.373214e+01 3.507645e+01 3.934946e+01
mt-Rnr2 ENSMUSG00000064339 ENSMUST00000082390 1.337085e+03 1.855240e+03 2.015555e+03 1.996198e+03 1.914501e+03 1.927667e+03
Each column gives a point estimate of the transcripts per million (TPM).
Credible intervals (i.e. Bayesian confidence interval) for each point estimate
can also be generated by passing the --credible
argument. The following
command will generate a table that include lower and upper bounds of 95%
credible interval for each point estimate.
isolator summarize transcript-expression --credible=0.95 isolator-output.h5
Expression can also be summarized on the gene level, according to the gene_id
field in the input GTF file.
isolator summarize gene-expression isolator-output.h5
Since Isolator follows a Bayesian methodology, it does not assign p-values to differential expression, but rather estimates a probability that the condition-wise tendendency has changed by more than some minimum fold change. The researcher may choose the minimum fold change (it is set to a log2 fold change of 1.5 by default).
The command to generate a table of differential expression results:
isolator summarize differential-transcript-expression isolator-output.h5
# OR
isolator summarize differential-gene-expression isolator-output.h5
The minmum effect size can be set with the --effect-size
argument.
E.g. passing --effect-size 2
will cause the summarize command to compute the
probability of a 2-fold or greater change in expression between conditions.
Expression is a general term for the relative number of copies of an mRNA. A change in expression can be driven by a change in transcription or in post-transcriptional processing (e.g. splicing). Transcriptional and post-transcriptional effects are explicitly modeled in Isolator and splicing can be separately tested.
to summarize condition pairwise changes in splicing:
isolator summarize differential-splicing isolator-output.h5
There are also methods built in to test for splicing changes at the feature (i.e. cassette exons and retained introns, currently) level, which is often more robust due to varying annotation quality.
isolator summarize differential-feature-splicing isolator-output.h5
These commands similarly accept the -a
, -b
, and --effect-size
arguments to
set the conditions being tested and the minimum effect size, respectively.
Note: for simplicitly the term "splicing" is in used here, when in fact it will capture any change in expression dynamics occouring after transcription initiation, including changes in splicing and transcription termination.
In the previous examples we invoked isolator analyze
by passing some number of
BAM files as arguments. This can be a pain with complex experiments involving
many conditions. Furthermore, to keept track of all these conditions, we may
wish to provide names. The analyze command can also be invoked with a
YAML file describing the experiment and the location of the
BAM files.
This file is structured like the following.
condition1-name:
condition1-replicate1-name: cond1-repl1-reads.bam
condition1-replicate2-name: cond1-repl2-reads.bam
condition2-name:
condition2-replicate1-name: cond2-repl1-reads.bam
condition2-replicate2-name: cond2-repl2-reads.bam
If we same this to experiment.yml
, we can then invoke the analyze command more
conveniently:
isolator analyze genes.gtf experiment.yml
For a more concrete example, here's an example of a specification of an experiment involving six conditions, each with either three or two replicates.
h7-1yr:
1-h7-1yr: 1-h7-1yr.bam
2-h7-1yr: 2-h7-1yr.bam
3-h7-1yr: 3-h7-1yr.bam
h7-day20:
4-h7-day20: 4-h7-day20.bam
5-h7-day20: 5-h7-day20.bam
6-h7-day20: 6-h7-day20.bam
empty-vector:
8-empty-vector: 8-empty-vector.bam
9-empty-vector: 9-empty-vector.bam
let-7g-oe:
10-let-7g-oe: 10-let-7g-oe.bam
11-let-7g-oe: 11-let-7g-oe.bam
12-let-7g-oe: 12-let-7g-oe.bam
fetal-ventricle:
13-fetal-ventricle: 13-fetal-ventricle.bam
14-fetal-ventricle: 14-fetal-ventricle.bam
fetal-atrium:
15-fetal-atrium: 15-fetal-atrium.bam
16-fetal-atrium: 16-fetal-atrium.bam