The rlsim
package is a collection of tools for simulating RNA-seq
library construction, aiming to reproduce the most important factors
which are known to introduce significant biases in the currently used
protocols: hexamer priming,
PCR amplification and size selection.
It allows for a systematic exploration of the effects of the individual biasing
factors and their interactions on downstream applications by simulating
data under a variety of parameter sets.
The implicit simulation model implemented in the main tool (rlsim
) is
inspired by the actual library preparation protocols and it is more
general than the models used by the bias correction methods hence it allows for
a fair assessment of their performance.
Although the simulation model was kept as simple as possible in order to
aid usability, it still has too many parameters to be inferred from data
produced by standard RNA-seq experiments. However, simulating datasets
with properties similar to specific datasets is often useful. To address
this, the package provides a tool (effest
) implementing simple
approaches for estimating the parameters which can be recovered from
standard RNA-seq data (GC-dependent amplification efficiencies, fragment
size distribution, relative expression levels).
An associated manuscript is in preparation, meanwhile the package should be cited as:
- Botond Sipos, Greg Slodkowicz, Tim Massingham, Nick Goldman (2013) Realistic simulations reveal extensive sample-specificity of RNA-seq biases arXiv:1308.3172
The analysis pipeline used to generate the results is available at github.com/sbotond/paper-rlsim.
rlsim
was brought to you by the Goldman group from EMBL-EBI.
-
Simulation of priming biases loosely based on a nearest-neighbor thermodynamic model.
-
Exact simulation of PCR amplification on the level of individual fragments (consistent across expression levels, no approximations).
-
Fragment-specific amplification efficiencies determined by GC-content and length.
-
Possibility to simulate PCR and sampling pseudo-replicates.
-
Simulation of size selection and polyadenylation with flexible target distributions.
-
Estimation of GC-dependent amplification efficiencies from real data, relying on assumptions about locality of biases and the mean efficiency of the fragment pool.
-
Estimation of relative expression levels.
-
Estimation of empirical fragment size distribution, model selection between normal vs. skew normal distributions.
-
Able to simulate experiments on the human transcriptome over a wide range of expression levels on a desktop machine.
The following basic examples can be run from the src/
directory in the
package source tree:
- Re-sample expression levels from a mixture of gamma distributions with two components with mean 5,000 and 10,000:
$ ../tools/sel -d "0.5:g:(5000, 0.1) + 0.5:g:(10000, 100)" \
test/basic/test_transcripts.fas > my_transcripts.fas
The output my\_transcripts.fas is a Fasta file annotated with the
expression levels:
>ENST00000371588$9430
GCTTCCGGCATCTGGCTCAGTTCCGCCATGGCCTCCTTGGAAGTCAGTCGTAGTCCTCGCAGGTCTCGGCGGGAGCTG ...
...
- Simulate 10,000 fragments with default parameters, plot rlsim report:
$ ./rlsim -n 10000 my_transcripts.fas > frags.fas
$ ../tools/plot_rlsim_report
The output file frags.fas contains the simulated fragments:
>Frag_0 ENST00000374005 (Strand - Offset 1883 -- 2298)
AGAGAATAGAGGGTAGAAGGGAAATTCTTGGCACCTGGACTAGAGTGAGATAAAAGGAGAGTAGGAAAGCAGTGA ...
...
- Simulate paired-end sequencing using simNGS and the runfile shipped with the package source:
$ cat frags.fas | simNGS -p paired -o fastq -O reads test/cov/s_4_0066.runfile
The files `reads_end1.fq` and `reads_end1.fq` contain the simulated
paired-end reads:
@Frag_24929 refB (Strand - Offset 11583 -- 12067) 151M
GCCCCGAGTAGTTCTGGGCGGGGCCCCCGCGGCCAGCGCCGCCCACTATATATTATTTATTCTAACTATT ...
+
GGGEFG(EGFGGDGBFEGGG;FD7GEGGDGGA=GGGDG7FFGGGGGFGCAGGGGGFF?GGGFG@GGAGGG ...
...
- Simulate 500,000 fragments, skew normal fragment size distribution
with a spike,
after_prim_double
fragmentation method, 15 PCR cycles with the specified efficiency parameters, using 4 cores, verbose mode:
$ ./rlsim -n 50000 -d "0.9:sn:(600,50,4,300,2000) + 0.1:n:(700,1,600,2000)" \
-f after_prim_double -c 15 -eg "(1.0,0.5,0.8)" -el "(0.1,0.7,1.0)" \
-t 4 -v my_transcripts.fas > frags.fas
$ ../tools/plot_rlsim_report
- Estimate parameters from SAM file sorted by read name using (verbose mode):
$ ../tools/effest -v -f ../tools/test/ref.fas ../tools/test/aln1.sam
- Estimate parameters from SAM file sorted by read name using – assume 15 PCR cycles and a pool efficiency of 0.9:
$ ../tools/effest -v -c 15 -m 0.9 -f ../tools/test/ref.fas ../tools/test/aln1.sam
- Simulate 20,000 fragments using the raw parameters estimated by
effest
, set minimum GC-dependent efficiency to 0.5 (verbose mode):
$ ./rlsim -v -n 20000 -j raw_params.json -jm 0.5 my_transcripts.fas > frags.fas
$ ../tools/plot_rlsim_report
Please consult the package documentation for more help on the tools and the technical background.
The BioStar Q&A forum (http://www.biostars.org) is an excellent place to get additional help.
The author of the package will monitor the posts having the rlsim
tag.
The package runs on 64-bit GNU/Linux operating systems. The rlsim tool
is written in golang and shipped as a statically linked
executable for the amd64
Linux platforms, hence it has no
dependencies.
The rlsim tool can be built for other architectures supported by the
compiler, however only the amd64
architecture is supported and the
32-bit binaries might not work properly.
The parameter estimation tool (effest
) and the additional tools are
written in Python 2.x
and depend on a couple of packages:
-
numpy >= 1.6.2
-
matplotlib >= 1.1.0
-
scipy >= 0.10.1
-
biopython >= 1.60
-
A modified version of the HTSeq package available from the GitHub repository under https://github.com/sbotond/rlsim/tree/master/misc.
The official releases of the HTSeq package contain a bug causing segmenation fault when parsing certain paired-end datasets. Please use the modified version from the rlsim repository!
These packages (with the exception of the modified HTSeq) are readily installable from the Python Package Index using the pip tool by issuing the following command:
pip install numpy matplotlib scipy biopython
The additional tool tools depend on samtools samtools. Simulating Illumina sequencing of the fragments can be done by using the simNGS package (recommended) or any other sequencing simulator.
The release tarballs can be obtained from the releases directory:
wget https://github.com/sbotond/rlsim/blob/master/releases/rlsim-latest_amd64.tar.gz?raw=true t.tgz -O rlsim-latest_amd64.tar.gz
The unpacked release directory contains the following files:
-
bin/
– directory containing the executables:-
rlsim
– the main tool simulating library construction -
effest
– tool for estimating selected parameters from real datasets -
sel
– tool for sampling expression levels -
plot_rlsim_report
– tool for plotting therlsim
report -
pb_plot
– tool for visualising sequence biases -
cov_cmp
– tool for comparing coverage trends across datasets -
plot_cov
– tool for plotting transcript read coverage colored by reference base
-
-
COPYING – GPL v3 licence
-
README.md – short instructions in markdown format
-
rlsim_manual.pdf
– package manual
The executables under bin/
can be installed by copying them to a
directory listed in the $PATH
environmental variable.
Build dependencies:
-
The package is built using
make
in a standardLinux
environment. -
Building the rlsim tool requires the
Go
compiler which can be installed as described on the projects website. -
A standard
LaTeX
installation withpdflatex
has to be present in order to produce the package documentation. -
Regenerating the test datasets needs the bwa bwa and samtools commands to be installed.
The package source can be obtained by cloning the GitHub
repository
and built by issuing make
in the top level directory:
git clone https://github.com/sbotond/rlsim.git
cd rlsim
make
A tarball can be built by issuing:
make release
rlsim
can be compiled with the recent version (>=4.7.2) of the gccgo
compiler:
-
Install the gc
Go
compiler suite on the projects website, as the build process uses thego
tool. -
Install gccgo as described here, or through your package manager.
-
Issue
make gccbuild
under `src/'.
Benchmarks indicate that the gccgo
build is faster on average, however the difference in runtime is not substantial.
Please note that the experimental gccgo
builds are not supported. Feel free to use them, but please do not submit bug reports if anything goes wrong.
Simulate RNA-seq library preparation with priming biases, PCR biases and size selection (version: 1.3).
Usage:
rlsim [arguments] [transcriptome fasta files (optional)]
Optional arguments:
argument type default
-n requested fragments int
-d fragment size distribution string "1.0:sn:(189, 24, -1.09975, 76, 294)"
-f fragmentation method string "after_prim_double"
-b strand bias float 0.5
-c PCR cycles int 11
-p priming bias parameter float 5.0
-k primer length int 6
-a poly(A) tail size dist. string [check source]
-flg fragment loss probability float 0.0
-m expression level multiplier float 1.0
-e fixed PCR efficiency float 0.0
-eg GC efficiency parameters
as "(shape, min, max)": raw from SRR521457
-el length efficiency parameters
as "(shape, min, max)":
shape float 0.0
min float 1.0
max float 1.0
-j raw parameter file string
superseeds -d, -c, -eg
-jm minimum raw gc efficiency float 0.0
-r report file string "rlsim_report.json"
-t number of cores to use int 4
-g keep fragments in memory bool false
-si initial random seed int from UTC time
-sp pcr random seed int auto
-ss sampling random seed int auto
-gobdir fragment directory string "rlsim_gob_$PID"
-v toggle verbose mode bool false
-h print usage and exit bool false
-V print version and exit bool false
-prof write CPU profiling info string ""
-gcfreq trigger garbage collection int 100
after this many transcripts
-randt generate RNG test files bool false
Examples:
rlsim -n 2000000 transcripts.fa
cat transcripts.fa | rlsim -n 2000000
For more details consult the package manual at:
https://github.com/sbotond/rlsim/tree/master/doc/rlsim_manual.pdf
usage: effest [-h] [-f ref_fasta] [-i iso_list] [-c nr_cycles] [-m mean_eff]
[-M max_eff] [-d dist_fam] [-g out_fasta] [-j out_json]
[-e expr_mul] [-a] [-t] [-w step_size] [-s out_count_file]
[-k in_count_file] [-p out_prior_file] [-o in_prior_file]
[-q min_qual] [-r report_file] [-l log_file] [-v]
[input file]
Estimate GC-dependent fragment amplification efficiencies and fragment size
distribution from paired-end RNA-seq data mapped to transcriptome (version
1.1).
positional arguments:
input file Aligned *paired end* reads in SAM format sorted by
*name*.
optional arguments:
-h, --help show this help message and exit
-f ref_fasta Reference fasta.
-i iso_list List of single isoform transcripts.
-c nr_cycles Number of PCR cycles (11).
-m mean_eff Assumed pool efficiency (0.87).
-M max_eff Assumed maximum efficiency (None).
-d dist_fam Distribution to model fragment size distribution
(n|sn|*auto*).
-g out_fasta Output fasta.
-j out_json File to store estimated raw parameters (raw_params.json).
-e expr_mul Expression level multiplier (10000.0).
-a Do not use GC efficiency correction on expression levels
(False).
-t Trim off old expression values (True).
-w step_size Sliding window size / step size ratio (5).
-s out_count_file Pickle counts to the specified file (effest_counts.pk).
-k in_count_file Load counts from specifies file.
-p out_prior_file Pickle fragment prior to the specified file
(effest_pr.pk).
-o in_prior_file Load fragment prior from the specified pickle file.
-q min_qual Minimum mapping quality (0).
-r report_file Report PDF (effest_report.pdf).
-l log_file Log file.
-v Toggle verbose mode (False).
usage: plot_rlsim_report [-h] [input file]
Plot rlsim report (version 1.0).
positional arguments:
input file rlsim report file.
optional arguments:
-h, --help show this help message and exit
usage: sel [-h] [-t] [-d dist_param] [-b nr_bins] [-r report_fil]
[input fasta file]
Sample expression levels from a mixture of gamma distributions (version 1.0).
positional arguments:
input fasta file Transcripts and expression levels in Fasta format.
optional arguments:
-h, --help show this help message and exit
-t Trim sequence names.
-d dist_param Expression level distribution.
-b nr_bins Number of bins in histogram.
-r report_fil Report PDF file.
usage: pb_plot [-h] [-f ref_fasta] [-r report_file] [-w winsize] [-i tr_list]
[-q min_qual] [-p pickle_file] [-s]
[input file]
Visualise sequence biases around fragment start/end (version 1.1).
positional arguments:
input file Aligned *paired end* reads in SAM format.
optional arguments:
-h, --help show this help message and exit
-f ref_fasta Reference sequences in fasta format.
-r report_file Name of PDF report file.
-w winsize Window size.
-i tr_list List of single isoform transcripts.
-q min_qual Minimum mapping quality.
-p pickle_file Results pickle file.
-s Assume single ended dataset.
usage: cov_cmp [-h] -f ref_fasta [-g] [-t nr_top] [-c min_cov] [-i iso_list]
[-l min_length] [-x] [-y] [-r report_file] [-q min_qual]
[-p pickle_file] [-v] [-s]
input file input file
Compare relative coverage trends between the *expressed* transcripts of two
datasets (version 1.1).
positional arguments:
input file Two sets of aligned *paired end* reads in SAM format.
optional arguments:
-h, --help show this help message and exit
-f ref_fasta Reference sequences in fasta format.
-g Do not color by AT/GC.
-t nr_top Plot at least this many top matches (30).
-c min_cov Minimum number of fragments per transcript (20).
-i iso_list List of single isoform genes.
-l min_length Minimum transcript length.
-x Sort by correlation coefficients.
-y Plot pairwise cumulative coverage.
-r report_file Name of PDF report file.
-q min_qual Minimum mapping quality (0).
-p pickle_file Results pickle file.
-v Toggle verbose mode.
-s Assume single ended dataset.
usage: plot_cov [-h] -r ref_fasta -b bam [-o outfile]
Plot read coverage colored by the reference base (AT - blue, GC - red). This
tools requires samtools to be installed in path.
optional arguments:
-h, --help show this help message and exit
-r ref_fasta Reference transcriptome.
-b bam Position sorted and indexed BAM file.
-o outfile Output PDF (plot_cov.pdf).