This README covers the details of carrying out an analysis of nucleosome positing from high-throughput sequencing data using the methods of Blocker and Airoldi (2016). Analyzing a single experiment separates into 3 broad phases:
-
Data management: Aligning, parsing, and reducing the raw sequencing data (typically FASTQ files) into the form required in for statistical analysis.
-
Estimation: Estimating the segmentation of the genome and digestion-error templates from the reduced sequencing data. Running Bayesian deconvolution (via distributed HMC). Processing MCMC draws into posterior summaries.
-
Analysis: Subsequent biological analyses, using the estimates from deconvolution as inputs (features). Clustering, selecting regions of interest, assessing reproducibility, and so on.
Two examples are provided, one toy based on a single chromosome of fake data in
examples/toy
and one based on the gene-by-gene analysis of H. sapiens
chromosome 21 from the Gaffney et al. (2012) data in examples/human
. The
former provides examples of running on an LSF-managed cluster, whereas the
latter provides examples of running on a SLURM-managed cluster. cplate
can be
run on MPI-compatible cloud clusters such as
StarCluster.
Every script in cplate
uses YAML/JSON configuration files. Each file
describes all of the data, parameters, and outputs for a single experiment's
dataset. These files must be created during the data management phase of the
analysis. There are a lot of fields, paths, and patterns to configure in each
file, but they are entirely machine- and human-readable for each configuration.
The config
folder contains example.yml
, a fully commented example of this
configuration file for a small dataset. This file is the canonical reference
for the YAML configuration structure and requirements.
Each phase uses a distinct set of tools. The data management phase uses:
-
bowtie: Aligns fragments obtained from high-throughput sequencing. Takes raw FASTQ files and a reference genome as input, output a SAM file containing alignments. This would be better replaced by bwa, a more modern aligner that can handle the ALT contigs of hg38.
-
samtools and pysam: Tools for manipulating SAM and BAM files, which are standards for the storage of alignments from high-throughput sequencing. The entire SAM specification is available at http://samtools.sourceforge.net/SAM1.pdf.
-
pipeline
: A custom Python package that (withsamtools
andpysam
) does the bulk of the heavy lifting for parsing the SAM files and extracting the relevant information. -
bash
scripts: Assortedbash
scripts to coordinate everything. Simple, easily maintained glue.
All of the above components can be substituted with any workflow that provides
fragment center counts and length distributions in the formats required by
cplate
.
The estimation phase requires fewer tools, but they are more specialized:
-
cplate
: The grand kahuna. The big one. Where all of the deconvolution action happens. This is a custom Python package that handles all of the segmentation, template estimation, deconvolution, and posterior summaries. It's big, it's complex, but it's also modular. -
bash
scripts: More glue. These are primarilybsub
scripts that interface the distributed HMC with the Harvard Odyssey cluster. -
R and Python scripts: Specialized R and Python scripts for the particular phases. The most important of these is
R/analyze_fdr.R
, which runs the FDR calibration specified in Blocker and Airoldi (2016). Another very useful pair isdetections_to_bed.py
andclusters_to_bed.py
, which convert detection and cluster output fromcplate
into the BED format, which is a standard in the field and can be viewed in IGV and similar tools. BED files can also be used with the Galaxy platform, bedtools, and other standard bioinformatics packages.
The analysis phase takes a more diverse set of tools, and their selection is almost entirely up to the analyst.
For the example data management phase, bowtie
, samtools
, pysam
, and
pipeline
must be installed. pipeline
requires Python 2.7 or newer and
numpy
in addition to pysam
. Using the GUI version of pipeline
requires
the wx
Python package and a working wx
installation.
For estimation, cplate
requires the Python packages mpi4py
, yaml
, numpy
,
and scipy
.
For analyze_fdr.R
, the plyr
, stringr
, reshape2
, yaml
, qvalue
, and
ascii
packages are required from CRAN. The qvalue
package from Bioconductor
is also required.
To illustrate how the entire analysis works, we're going to go through an
example from the raw FASTQ files to the final posterior summaries. The
estimation portion corresponds to the example.yml
file. However, the data
management portion does not have a corresponding data set for practical reasons.
Throughout this example, we will be working with the following directory structure:
[user@system work]$ ls .
config data results logs
Suppose we receive a set of FASTQ files from our collaborators as
experiment.fastq.tar.gz
. We extract this into the data
directory and
find two files:
[user@system data]$ ls .
experiment.R1.fastq experiment.R2.fastq
These correspond to the forward and reversed ends from paired-end sequencing.
The first step in our data processing is to align these reads to a reference
genome. Assuming that we're working with S. Cerevisiae data, we can use the
files provided at the bowtie
website.
Their reference genomes consist of a compressed set of .ebwt
files, all of
which should be extracted into the data
directory.
As an aside, the .ebwt
file extension refers to the Burrows-Wheeler
transformation used by bowtie
for efficient alignment.
After this extraction, we have:
[user@system data]$ ls .
experiment.R1.fastq experiment.R2.fastq s_cerevisiae.1.ebwt
s_cerevisiae.2.ebwt s_cerevisiae.3.ebwt s_cerevisiae.4.ebwt
s_cerevisiae.rev.1.ebwt s_cerevisiae.rev.2.ebwt
We're now ready to run bowtie
and align our reads to the reference genome. A
typical command for this (run in the data
directory) will be:
[user@system data]$ N_THREADS=$LSB_DJOB_NUMPROC
[user@system data]$ time bowtie --phred33-quals -q \
-n 1 --best -M 1 \
-I 100 -X 300 \
--threads $N_THREADS \
-S \
s_cerevisiae \
-1 experiment.R1.fastq -2 experiment.R2.fastq \
1>experiment.sam \
2>align_experiment.log
This takes a bit of explanation and some quality time with the
bowtie
manual to parse.
There are only a few classes of options and arguments to bowtie
used above,
though:
-
--phred33-quals -q
tellsbowtie
that the inputs are FASTQ files and how to interpret the quality strings from these files. -
-n 1 --best -M 1 -I 100 -X 300
sets the alignment policy.-n 1
tellsbowtie
to discard any alignment with more than 1 nucleotide mismatched.--best -M 1
tellsbowtie
to report a single randomly-sampled alignment among all valid alignments with the best quality.-I 100 -X 300
tellsbowtie
to consider only those alignments with lengths between 100 and 300 base pairs; this only makes sense because we are looking for nucleosomal DNA, which is about 150bp in length. -
--threads $N_THREADS
tellsbowtie
to use multiple threads for its alignment. This can speed up alignment immensely; I've found 12 cores on a single Odyssey node to be very effective for S. cerevisiae datasets. If this is running inside of an LSF job, the$LSB_DJOB_NUMPROC
environment variable will contain the number of processor allocated to the job. -
-S
tellsbowtie
to provide SAM-formatted output. Without this, it's in abowtie
-specific format. -
s_cerevisiae
is the name of the reference genome (note the lack of file extension). It's also the only positional argument in this whole command. -
-1 experiment.R1.fastq -2 experiment.R2.fastq
specifies that the given files each contain one end from paired end sequencing. -
1>experiment.sam
sends the output frombowtie
, which is printed tostdout
by default, toexperiment.sam
-
2>align_experiment.log
sends thestderr
output ofbowtie
to the given log file. This contains useful statistics about the proportions of reads that aligned at different levels of specificity.
Once this alignment is complete, there are three steps left:
# Parse SAM output into read list
[user@system data]$ time parseSAMOutput.py \
experiment.sam \
1>reads_experiment.txt \
2>parseSAM_experiment.log
# Extract length distribution from read list
[user@system data]$ time buildReadLengthDist.py \
reads_experiment.txt \
1>lengthDist_experiment.txt \
2>buildLengthDist_experiment.log
# Convert reads to counts of centers per base pair
[user@system data]$ time readsToChromCounts.py \
--randomize --seed=20130530 \
reads_experiment.txt \
1>y_experiment.txt \
2>count_experiment.log
All three scripts here are part of the pipeline
package. The first,
parseSAMOutput.py
, converts the SAM output from bowtie
to a condensed list
of aligned reads. This simplifies subsequent processing and any read-level
analyses that you choose to do later. In the process of doing this conversion,
parseSAMOutput.py
converts the SAM input to BAM (binary SAM), then to a sorted
BAM file. These conversions make later processing much more efficient. The BAM
file is also much smaller than the SAM file. The latter can be removed after
this step, as parseSAMOutput.py
can also (intelligently) use BAM and sorted
BAM inputs if you need to rerun it.
The second script, buildReadLengthDist.py
, extracts the distribution of
aligned read lengths from the condensed list of reads. Its output consists of a
space-separated file with two columns. The first column is the read length in
base pairs, and the second is the number of aligned reads with that length. This
is needed to estimate the distribution of digestion errors.
The third scripts, readsToChromCounts.py
, reduces the list of aligned reads to
the number of aligned reads centered at each base pair of the genome. It outputs
a ragged, comma-separated array to stdout
. Each row of this array contains the
counts for each base pair of a single chromosome. The
--randomize --seed=20130530
options tell the script to randomly round fragment
centers to integers, using the given RNG seed. I recommend setting the seed
explicitly to ensure that your processing is reproducible.
With this data (and our config/example.yml
file) in hand, we can move on to
estimation. This needs to be done in a particular sequence, but everything uses
the cplate
package. The sequence is
-
cplate_estimate_template
: Estimate template -
cplate_segment_genome
: Segment genome -
cplate_simulate_null
: Simulate from permutation null -
cplate_deconvolve_mcmc
: Run Bayesian deconvolution via distributed HMC on observed and null data. -
cplate_summarise_mcmc
,cplate_summarise_clusers_mcmc
,cplate_summarise_clusters_mcmc
: Extract posterior summaries from MCMC draws
Each of these scripts takes one (or more, for some of them) YAML configuration
files as inputs. So long as these configuration files are properly configured,
everything gets pulled from and put to the right place without further tweaking
and settings. Each script also has -h
and --help
options that provide
descriptions of all options that may be needed.
To start, we estimate the template using
[user@system work]$ cplate_estimate_template config/example.yml
Then, we turn to the segmentation. This requires an additional file specifying
the location of each ORF in the genome, as specified in the script's --help
:
[user@system work]$ cplate_segment_genome --help
Usage: cplate_segment_genome [options] GENEINDEX CONFIG [CONFIG ...]
Options:
-l/--minlength= Minimum length of segments. Defaults to 800.
-s/--sep= Separator for GENEINDEX input. Defaults to \t.
-v/--verbose= Set verbosity of output. Defaults to 0.
-h, --help Show this help message and exit
Segments a genome using the hierarchical merging algorithm of Blocker and
Airoldi 2016.
GENEINDEX must be a path to a SEP-delimited file containing at least the
following fields with appropriate column labels:
- chromosome : Integer (starting from 1) chromosome number
- start : Beginning of each ORF (TSS, preferably)
- stop : End of each TSS
Can have start > stop or stop > start depending on the orientation of each
gene along the chromosome.
Details of the required format for the YAML CONFIG files can be found it
further documentation.
So, to segment the genome for our example, we run:
[user@system work]$ cplate_segment_genome --minlength=800 \
data/gene_index.txt config/example.yml
We can then simulate reads according to our permutation null with:
[user@system work]$ cplate_simulate_null config/example.yml
With those steps complete we can run the MCMC-based deconvolution algorithm on
the observed and simulated data. This uses the cplate_deconvolve_mcmc
script,
which has the following --help
:
Usage: cplate_deconvolve_mcmc [options] CONFIG [CONFIG ...]
Options:
-h, --help Show this help message and exit
-c CHROM, --chrom=CHROM
Comma-separated indices of chromosomes to analyze;
defaults to 1
--null Run using null input from CONFIG
--both Run using both actual and null input from CONFIG
--all Run all chromosomes
Details of the required format for the YAML CONFIG files can be found it
further documentation.
These options are used for cplate_deconvolve_mcmc
and all of the
cplate_summarise*
scripts. By default, each script will run on chromosome 1 of
a single file. cplate_deconvolve_mcmc
is unique in that it must be called via
mpirun
or its specialized equivalent on a given cluster. So, to run MCMC-based
deconvolution on our example, we could run the following from the command line:
[user@system work]$ mpirun -np 4 cplate_deconvolve_mcmc --all --both \
config/example.yml
This would run the distributed MCMC-based deconvolution all
chromosomes in our
example (there's only 1) for both
the observed and null datasets. The -np 4
option to mpirun
tells MPI to use 4 processors.
For actual datasets, we often want far more than 4 processors. An example of
doing so through a LSF cluster can be found in the scripts
folder as
mcmc_example.bsub
:
#BSUB -J mcmcExample
#BSUB -q airoldi
#BSUB -n 120
#BSUB -a openmpi
#BSUB -oo logs/mcmc_example.log
#BSUB -eo logs/mcmc_example.err
# Run MCMC on observed data
mpirun.lsf -np $LSB_DJOB_NUMPROC cplate_deconvolve_mcmc \
--all config/example.yml \
1>logs/mcmc_example_obs.log \
2>logs/mcmc_example_obs.err
# Run MCMC on simulated null data
mpirun.lsf -np $LSB_DJOB_NUMPROC cplate_deconvolve_mcmc \
--all --null config/example.yml \
1>logs/mcmc_example_null.log \
2>logs/mcmc_example_null.err
This script requests 120 cores (not necessarily contiguous) from the airoldi
queue, then uses all of these to run the distributed Bayesian deconvolution
(HMC) algorithm. The stdout
and stderr
output from each MCMC run and the
overall job are piped to appropriate files in the logs/
directory.
In some cases, it can be useful to call cplate_deconvolve_mcmc
separately for
each chromosome via a bash
loop. This looks odd, but it may be needed due to
inefficient or incomplete garbage collection in Python between chromosomes when
multiple chromosomes are used in a single call to cplate_deconvolve_mcmc
. An
example for 16 chromosomes would be:
NCHROM=16
for (( CHROM=1; CHROM <= $NCHROM; CHROM++))
do
# Run MCMC on observed data
mpirun.lsf -np $LSB_DJOB_NUMPROC cplate_deconvolve_mcmc \
--chrom=$CHROM config/example.yml \
1>logs/mcmc_example_obs_chrom`printf '%02d' $CHROM`.log \
2>logs/mcmc_example_obs_chrom`printf '%02d' $CHROM`.err
# Run MCMC on simulated null data
mpirun.lsf -np $LSB_DJOB_NUMPROC cplate_deconvolve_mcmc \
--chrom=$CHROM --null config/example.yml \
1>logs/mcmc_example_null_chrom`printf '%02d' $CHROM`.log \
2>logs/mcmc_example_null_chrom`printf '%02d' $CHROM`.err
done
Once these MCMC runs are complete, the posterior summaries are straightforward
to run. However, they do require a substantial amount of memory.
An example bsub
script for running the summaries is provided as
scripts/summaries_example.bsub
and reproduced below:
#BSUB -J summariesExample
#BSUB -q airoldi
#BSUB -n 12
#BSUB -R "span[ptile=12]"
#BSUB -a openmpi
#BSUB -oo logs/summaries_example.log
#BSUB -eo logs/summaries_example.err
# Iterate over null and nonnull cases
for NULL in "" --null
do
# Run base pair level summaries
cplate_summarise_mcmc --all --mmap $NULL config/example.yml
# Run hyperparameter summaries
cplate_summarise_params_mcmc --all $NULL config/example.yml
# Run cluster level summaries
cplate_summarise_clusters_mcmc --all $NULL config/example.yml
done
The --mmap
option for cplate_summarise_mcmc
is very important. It tells the
script to access the MCMC draws iteratively without loading everything into
memory at once. It's not very fast, but it has a huge effect on memory usage.
Once all of these summaries are complete, we need to calibrate our detections to
a given FDR. This requires some manual input, but it's quite straightforward.
First, we run the analyze_fdr.R
script to obtain the thresholds corresponding
to various FDRs:
[user@system work] Rscript analyze_fdr.R \
config/example.yml results/fdr_example.txt
For further details on analyze_fdr.R
, it has a thorough -h/--help
message.
Once this has been run, we inspect the results/fdr_example.txt
file and select
the appropriate threshold for the pm
level and FDR of interest. Then, we edit
the config/example.yml
file to reflect this selection.
Then, we can run the detection algorithm again to reflect this selection with:
[user@system work] cplate_detect_mcmc --all config/example.yml
This is not the most elegant part of the workflow, and it can certainly be refined. In particular, it would be better for the configuration files to specify an FDR with the resulting threshold detection threshold extracted and stored automatically.