Outdated
Note that this pipeline is no longer under active development. You can find the latest version at https://github.com/LieberInstitute/SPEAQeasy which is described in the following pre-print: https://www.biorxiv.org/content/10.1101/2020.12.11.386789v1.
This pipeline requires several R packages. You can install them by running:
qsub pipeline_R_setup.sh
- Make a directory to deposit subfolders and processed files:
$DIR
.
Often, $DIR/FASTQ
contains the fastq files for this experiment but the fastq files don't necessarily have to be in this location.
-
In
$DIR
, make text file calledsamples.manifest
which has to be a manifest file just like the ones used in Rail-RNA and Myrna. The exception is that the paths have to be local as no URLs are supported. It has the following structure:- (for a set of unpaired input reads)
<FASTQ FILE>(tab)<optional MD5>(tab)<sample label>
- (for a set of paired input reads)
<FASTQ FILE 1>(tab)<optional MD5 1>(tab)<FASTQ FILE 2>(tab)<optional MD5 2>(tab)<sample label>
- (for a set of unpaired input reads)
The following extensions are supported: fastq.gz
, fq.gz
, fastq
, fq
.
For example, a samples.manifest
file can look like this (these are paired-end samples):
/dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L005_R1_001.fastq.gz 0 /dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L005_R2_001.fastq.gz 0 sample1
/dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L006_R1_001.fastq.gz 0 /dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L006_R2_001.fastq.gz 0 sample2
If you reads are split in multiple files and you want to merge them, simply repeat a sample id. The merged files will be saved at in a directory called merged_fastq
.
/dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L005_R2_001.fastq.gz 0 /dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L005_R2_001.fastq.gz 0 sample1
/dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L006_R2_001.fastq.gz 0 /dcl01/lieber/ajaffe/Nina/GSK_PhaseII/data/Sample_R10126_C1BP4ACXX/R10126_C1BP4ACXX_GAGATTCC_L006_R2_001.fastq.gz 0 sample1
- Run pipeline with following call in
$DIR
:
## You need a compute node to run this script!
qrsh
## Change the path if you cloned this repo elsewhere!
bash /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/sh/rnaseq-run-all.sh --help
Note that you have to use bash and not sh, otherwise you'll get an error like.
-
experiment: main identifier, experiment name.
-
prefix: additional identifier (could be date or reference genome used).
-
reference: supported genomes are
hg19, hg38, mm10, rn6
-
stranded:
FALSE
by default. The other valid options areforward
andreverse
. -
ercc:
FALSE
by default and will be turned toTRUE
if specified. Specify if ERCC mix 1 was added. -
cores: defaults to 8. Specifies how many cores to use per job for the jobs that are parallelized.
-
large:
TRUE
if you want to use double the default memory settings. Useful for large projects (many samples and/or many reads).FALSE
by default and doesn't need to be specified. -
fullcov:
TRUE
if you want to create the fullCoverage object. Set toFALSE
by default. Note that the fullCoverage object is not needed (by default) since we create the normalized mean BigWig already and can use it to define the ERs withderfinder::findRegions()
, then use the resulting GRanges object and the paths to the BigWig files inderfinder::getRegionCoverage(fullCov = NULL, files = bigWigs, regions = outputFrom_findRegions)
or alternatively write the regions to a BED file with rtracklayer, create the counts with bwtool and then read them into R manually (similar to what we did in recount-website). -
bashfolder: defaults to
/dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/sh
. It's the directory where the shell files are located at. You only need to specify it if you cloned this repository somewhere else. -
annofolder: defaults to
/dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/Annotation
. We currently don't have scripts for you to automatically reproduce that directory, but you can copy it to another location if you want to have more control on it and change some files. -
Hidden run files will be created with all calls needed to run the pipeline. Each step is submitted to SGE cluster and queued to run sequentially. Steps that can be parallelized are submitted as array jobs. If you with to manually run a given step, check the default options by using the
--help
option. For example:
## Change the path if you cloned this repo elsewhere!
bash /dcl01/lieber/ajaffe/Emily/RNAseq-pipeline/sh/step3-hisat2.sh --help
- Completion emails: by default you will only get an email if a job failed. If you want to get completion emails add the empty file
${DIR}/.send_emails
. You can create it with:
touch ${DIR}/.send_emails
- Cluster queue: by default the shared queue is used, but if you want to specify one create the hidden file
${DIR}/.queue
with the name of the queue inside of it (no spaces, no new lines). For example:
bluejay
Do not save shared
into ${DIR}/.queue
. Otherwise your jobs won't run.
For reproducibility purposes, the information about the version of the pipeline and contents of ANNO_FOLDER
will be stored in logs/pipeline_information.txt
.
Note that you will need the jaffelab package installed among other packages. Most of them can be installed automatically by scripts in the pipeline, but jaffelab
requires you to create the GITHUB_PAT
environment variable following the instructions by devtools::install_github()
.
FastQC Flags (pass/warn/fail)
Flag | Description |
---|---|
FQCbasicStats | Composite statistic: overall pass/warn/fail |
perBaseQual | Quality values across all bases at each position in the FastQ file |
perTileQual | Quality scores from each tile across all bases to see if there was loss in quality associated with only one part of the flowcell |
perSeqQual | Report if a subset of sequences have universally low quality values |
perBaseContent | G, A, T, and C should have roughly equal proportions at each base position |
GCcontent | GC content across the length of each sequence |
Ncontent | Percentage of base calls at each position for which an N was called |
SeqLengthDist | Distribution of fragment sizes (should all be uniform length) |
SeqDuplication | Degree of duplication for every sequence in the library |
OverrepSeqs | Identical sequences that appear at a higher rate than expected |
AdapterContent | Proportion of your library which has adapter sequences at each position |
KmerContent | Measures the number of each kmer at each position in your library and then looks for significant deviations from an even coverage at all positions |
FastQC metrics (There will be two sets of metrics for each sample if library is paired-end.)
Metric | Description |
---|---|
SeqLength | Length of the shortest and longest sequence in the set |
percentGC | Overall %GC of all bases in all sequences |
phredXX-XX | Median Phred score at read positions XX (taken at different positions depending on sequence length) |
phredGT30 | Proportion of reads above Phred score of 30 |
phredGT35 | Proportion of reads above Phred score of 35 |
AdapterXX-XX | Adapter content for Illumina standard adapter at positions XX (taken at different positions depending on sequence length) |
HISAT2 metrics
Metric | Description |
---|---|
numReads | Total number of reads in fastq file |
numMapped | Number of reads that were aligned by HISAT2 to the reference genome |
numUnmapped | Number of reads that did not align to the reference genome |
overallMapRate | Proportion of total reads mapped (numMapped / numReads) |
concordMapRate | If library is paired-end, percentage of read pairs that mapped concordantly |