This page contains the code to run the Rare Disease Blood RNA-seq pipeline in order to highlight candidate genes from genetic data, tarnscriptome and phenotype information.
The following software and data are necessary (versions are the ones that were used in the paper):
- cutadapt 1.11
- STAR 2.4.0j
- Picard tools
- Samtools 0.1.19-96b5f2294a
- BCFtools 1.8
- RSEM v1.2.21
- Vcfanno 0.2.7
- ASEReadCounter 3.8-0-ge9d806836
- Python 2.7
- R 3.3.1
- EXAC data
These scripts rely on the existence of a metadata file containing information on samples to analyze.
- 1.1 Variant filtering
- 1.2 VCF Annotation
- 2.1 Adapter trimming
- 2.2 Alignment
- 2.3 Filters
- 2.4 Quantification
- 2.5 Gene expression data processing
- 2.6 Splicing data processing
- 3.1 Get phenotype to gene link from Human Phenotype Ontology
- 3.2 Parse HPO database
- 4.1 Expression outlier filtering
- 4.2 Splicing outlier filtering
The variant QC generally follows the ExAC guideline as in ref1 and ref2.
- keep only "PASS" variants, determined by the GATK variant quality score recalibration (VQSR)
adhoc filter for non-GATK processed vcfs: keep "." or "num prev seen samples > 30"
- as in "hail" api
.filter_genotypes(
"""
(g.isHomRef && (g.ad[0] / g.dp < 0.8 || g.gq < 20 || g.dp < 20)) ||
(g.isHomVar && (g.ad[1] / g.dp < 0.8 || g.pl[0] < 20 || g.dp < 20)) ||
(g.isHet && ( (g.ad[0] + g.ad[1]) / g.dp < 0.8 || g.ad[1] / g.dp < 0.20 || g.pl[0] < 20 || g.dp < 20))
""", keep = False)
- as in VCF filters:
Filters | FORMAT/GT | FORMAT/DP | FORMAT/GQ | FORMAT/PL | FORMAT/AD |
---|---|---|---|---|---|
keep | =='0/0' | >20 | >20 | PL[0]<20 | AD[0]/DP>0.8 |
keep | =='0/1' | >20 | >20 | PL[1]<20 | AD[0]/DP>0.2 & AD[1]/DP>0.2 & (AD[0]+AD[1])/DP>0.2 |
keep | =='1/1' | >20 | >20 | PL[2]<20 | AD[1]/DP>0.8 |
note 1: PL = -10*log10(likelihood), most likely being PL=0;
note 2: FORMAT information is different across sits/samples/institutions, cannot apply uniform filters
- CGS GT:AD:DP:GQ:PL
- CGS GT:GQ:PL
- CHEO GT
- CHEO GT:AD:DP:GQ:PL
- CHEO GT:GQ:PL
- CHEO GT:PL
- CHEO GT:PL:DP:SP:GQ
- UDN GT:AD
- UDN GT:AD:DP
- UDN GT:AD:DP:GQ:PGT:PID:PL
- UDN GT:AD:DP:GQ:PL
- UDN GT:AD:DP:PGT:PID
- UDN GT:AD:GQ:PGT:PID:PL
- UDN GT:AD:GQ:PL
- UDN GT:AD:PGT:PID
- UDN GT:VR:RR:DP:GQ
Script: generate_variantfilters.sh
Input: VCF files from each individual
Output: Text file containing list of sites passing filters across individuals
bash generate_variantfilters.sh | parallel --jobs 20
- Filter vcf files
Input: VCF files and list of sites passing or not filters
Output: Filtered VCF
Script: filter_allvcf.sh
bash filter_allvcf.sh | parallel --jobs 20
Annotation with gnomAD allele frequency and CADD score
Input: Filtered VCF
Output: Filtered VCF in a unified format across samples
Script: homogenize_vcf_single_sample.sh
find $filtered_variant_dir/*.filtered.vcf.gz | parallel "basename {} .filtered.vcf.gz" | parallel -j 8 "bash homogenize_vcf_single_sample.sh $filtered_variant_dir/{}.filtered.vcf.gz {} $homogenized_vcf_dir" > log_file.txt 2>&1 &
This script will run on 1 chromosome at a time and concatenate results at the end.
Input: Homogenized filtered VCF files from previsous step.
Output: VCF annotated for CADD score and gnomAD allele frequency
Script: launch_vcf_anno_chr.sh This script uses vcf_anno_chr.sh and concatenate_gnomadres.sh
bash launch_vcf_anno_chr.sh > log_file_name.txt 2>&1 &
- Filter for variants with Minor Allele frequency <= 0.01 (keeps singletons)
- Bedtools intersect with gene bed file to get gene name associated with rare variant
Script: launch_RD_vcf_to_RVbed.sh
bash launch_RD_vcf_to_RVbed.sh $annotated_vcf_dir >log_file_name.txt 2>&1 &
This data is used in figure 3c to estimate the number of rare variants that are within 20bp of splicing junctions Script: RV_near_junction_window.sh
bash RV_near_junction_window.sh > log_file_name.txt 2>&1
ref1 Ganna, A., Satterstrom, F. K., Zekavat, S. M., Das, I., Kurki, M. I., Churchhouse, C., … Neale, B. M. (2018). Quantifying the Impact of Rare and Ultra-rare Coding Variation across the Phenotypic Spectrum. American Journal of Human Genetics, 102(6), 1204–1211. https://github.com/andgan/ultra_rare_pheno_spectrum
ref2 Lek, M., Karczewski, K. J., Minikel, E. V, Samocha, K. E., Banks, E., Fennell, T., … Exome Aggregation Consortium. (2016). Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616), 285–91.
The following steps will allow you to go from adapter trimming to quantification. If you want to run 2.1 to 2.4 with one script, you can use RD_analysis.sh
bash RD_analysis.sh <batch_number> > log_file_name.txt 2>&1 &
For step by step analysis, follow this:
Input: Fastq files
Output: Trimmed Fastq files
Parameters:
fastq_dir
: Input directory that contains FASTQ files. The following code assumes that there are gzipped, paired R1 and R2 files for each sample, and it excludes "Undetermined" FASTQ files generated bybcl2fastq
.cutadapt_script
: trim_adapters_150bpreads.sh #include path to script
ls *merge_R1.fastq.gz | sed 's/_/\t/'| awk '{print $1}' |awk -v fastq_dir=$FASTQ_DIR 'BEGIN{OFS="\t"}{print fastq_dir"/"$1"_merge_R1.fastq.gz",fastq_dir"/"$1"_merge_R1.trimmed.fastq.gz", fastq_dir"/"$1"_merge_R2.fastq.gz",fastq_dir"/"$1"_merge_R2.trimmed.fastq.gz", fastq_dir"/"$1"_merge_short_reads.fastq.gz",fastq_dir"/"$1"_fastq_trimadapters.log"}' | \
parallel --jobs 15 --col-sep "\t" "${cutadapt_script} {1} {2} {3} {4} {5} {6}"
Alignment of trimmed reads using STAR
Input: Trimmed fastq files
Output: Genome and transcriptome bam files
Parameters:
index_dir
: Directory containing STAR INDEXbam_dir
: Directory for BAM file outputSTAR_script
:STAR_alignment_rare_disease.sh #include path to scriptReference genome
: hg19.faGene annotation File
: Gencode v19 GTF
ls *merge_R1.trimmed.fastq.gz | sed 's/_/\t/'| awk '{print $1}' |awk -v fastq_dir=$FASTQ_DIR -v bam_dir=$BAM_DIR 'BEGIN{OFS="\t"}{print fastq_dir"/"$1"_merge_R1.trimmed.fastq.gz", fastq_dir"/"$1"_merge_R2.trimmed.fastq.gz",bam_dir"/"$1}' |\
parallel --jobs 5 --col-sep "\t" "${STAR_script} {1} {2} {3} $<index_dir>/STAR_INDEX_OVERHANG_150/ 150"
Input: Transcriptome bam
Output: Filtered Transcriptome bam
Parameters:
FILTER_script
: Filters_bam_uniq_mq30_duplicates_rare_disease_transcriptome.sh #include path to script
cd $BAM_DIR
ls ${BAM_DIR}/*.Aligned.toTranscriptome.out.bam | parallel --jobs 10 --col-sep "\t" "${FILTER_script} {1}"
Filter junction file for junctions detected with at least 10 uniquely spanning reads.
Input: SJ.out.tab files for each sample. This file is an output of STAR alignement
Output: Filtered junction files (.SJ.out_filtered_uniq_$threshold.tab)
Parameters:
Junc_file
: Junction file generated by STARthreshold
: Number of unique reads spanning the junction (Set to 10 in the analysis)prefix
: prefix used for output (i.e sample name)
cat $Junc_file | awk -v T=$threshold '$7>=T ' > $prefix.SJ.out_filtered_uniq_$threshold.tab
Use RSEM v.1.2.21 for quantification to obtain gene-level and transcript-level quantification
Parameters:
RSEM_script
: rsem_calculate_expression_raredisease.sh #include path to script
ls *.Aligned.toTranscriptome.out_mapq30_sorted_dedup_byread.bam | sed 's/\./\t/'|awk '{print $1}' |awk -v bam_dir=$BAM_DIR 'BEGIN{OFS="\t"}{print bam_dir"/"$1".Aligned.toTranscriptome.out_mapq30_sorted_dedup_byread.bam",$1}' | \
parallel --jobs 10 --col-sep "\t" "bash ${RSEM_script} {1} {2}"
Input: gene level expression levels as output by RSEM
Output: Matrix of normalized counts
Script: sva_pipeline.r
Steps:
- Read in raw count data output by RSEM
- Filter genes for minimum expression threshold (in the paper - TPM>0.5 in >50% of samples from each batch)
- Log transform
- Scale and center to generate expression Z-scores
- Run SVA
- Regress out SVs and SV+regression splines for SVs significantly associated with sample batch
- Run QC checks
Input: Corrected counts
Output: Gene Sample Zscores file
Script: count_to_zscores.R
- Read in corrected gene expression count data (corrected_counts/gene/blood/[gene_count].txt)
- Center and scale gene expression counts matrix to generate Z-scores
- Write output
This set of scripts is generating junction ratios and calculating splicing Z-scores for every sample
To get started you will need
- Metadata file
- Tissue to perform the analysis on (i.e. Blood)
- Junction files generated during STAR alignment (filtered for junctions with at least 10 reads uniquely spanning)
Parameters:
metadatafile.tsv
: file containing metadata information regarding the samples to analyze.tissue
: What tissue (from the metadata file) to analyze (i.e Blood)junc_dir
: Directory containing filtered junctions obtained after 2.3.2output_dir
: Directory in which output will be writtenfreeze_sample_only
: TRUE/FALSE statement whether to include only samples for which in_freeze column set to yes in metadata file.DGN_samples
: TRUE/FALSE statement wether to include or not DGN samples in the analysisPIVUS_samples
: TRUE/FALSE statement wether to include or not PIVUS samples in the analysis
Script: splicing_outlier_analysis.sh Relies on make_list_junction_files.R and splicing_outlier.py
Example command line:
bash splicing_outlier_analysis.sh \
metadatafile.tsv \ # metadata file to use
Blood \ # tissue of analysis
$junc_dir \ # folder containing filtered junction files
$output_dir \ # output folder
TRUE \ # wether or not to perform on samples from the freeze only
FALSE \ # wether or not to include DGN samples
TRUE > log_file.txt 2>&1 & # wether or not to include PIVUS cohort in the analysis
During this step, missing splicing ratios are imputed, data is corrected for batch effects and Zscores are calulated.
Input: _junc_ratios_filtered.txt from 2.6.2
Output: file with zscores for each evaluated junction.
Script: splicing_ratio_to_zscores.R
Rscript splicing_ratio_to_zscores.R
Download latest version of Human Phenotype Ontology (https://hpo.jax.org)
phenotype_to_gene.txt gene_to_pehnotype.txt
- Download latest version of [HPO obo] (https://hpo.jax.org/app/download/ontology)
- Parse database to get parent and child terms parse_hpoterms.py
- build_outlier_list.r This script outputs the z-score for each gene in the corrected data. A separate file is output for each sample. The working directory is /srv/scratch/restricted/rare_diseases/analysis/outlier_analysis/expression_level/. Arguments:
- absolute path to latest corrected expression data matrix
- absolute path to latest metadata file
- get_rare_var_gene.sh This script takes the outlier files from build_outlier_list.r and maps rare variants (MAF < 0.01) in genes or +/- 10kb around genes. This is done on a per-sample level. The output is a file containing gene z-score, position of rare variant, allele frequency, and cadd for each sample-gene pair. The filename is "outliers_rare_var_combined_10kb.txt".
This step is filtering expression outlier data according to different criteria. Candidate genes at each filtering step are printed in output
Parameters:
metadata file
: Metadata file containing sample informationexac file
: containing pLI score across genesHPO gene to phenotype file
HPO phenotype to gene file
HPO parent - child - alternative ID
: obtained with parse_hpoterms.pyoutliers_rare_var_combined_10kb.txt
: expression Z-score results with RV information obtained in 4.1.1
Script: filter_expression_candidates.R
Filters:
EXP_OUTLIER
: under-expression outlier (Zscore<=-2)EXP_OUTLIER_PLI
: under-expression outliers in genes with pLI score >=0.9EXP_OUTLIER_HPO
: under-expression outliers in genes linked to the phenotype (HPO terms)EXP_OUTLIER_RV
: under-expression outliers in genes with a rare variant within 10kbEXP_OUTLIER_RV_CADD
: under-expression outliers in genes with a deleterious rare variant within 10kb (CADD score >=10)EXP_OUTLIER_RV_CADD_HPO
: under-expression outliers in genes linked to the phenotype (HPO terms) with a deleterious rare variant within 10kb (CADD score >=10).
- distance=20bp
Script:splicing_outlier_genes_with_RV_window.sh
Input: splicing outlier file and sample rare variant files
Output: file containing rare variant information around splicing junctions
# sort splicing outlier file
sort -k1,1 -k2,2n splicing_outlier_file.txt> splicing_outlier_file_sorted.txt
# combine splicing outlier file with variant data within a 20bp window of junction files
bash splicing_outlier_genes_with_RV_window.sh splicing_outlier_file_sorted.txt > <log_file> 2>&1 &
Splicing outlier data is filtered accroding to criteria presented in the manuscript. Candidate genes at each filtering step are printed in a seperate text file.
Script: filter_splicing_candidates.R
Filters:
SPLI_OUTLIER
: Splicing outlier gene (at least one junction with |Z-score|>=2)SPLI_OUTLIER_PLI
: Splicing outlier in genes with pLI score >=0.9SPLI_OUTLIER_HPO
: Splicing outlier in genes linked to the phenotype (HPO terms)SPLI_OUTLIER_RV
: Splicing outlier in genes with a rare variant within 20bp of the outlier junctionSPLI_OUTLIER_RV_CADD
: Splicing outlier in genes with a deleterious rare variant within 20bp of the outlier junction(CADD score >=10)SPLI_OUTLIER_RV_CADD_HPO
: Splicing outlier in genes linked to the phenotype (HPO terms) with a deleterious rare variant within 20bp of the outlier junction(CADD score >=10)
We recommend using the most stringent filter for final results (SPLI_OUTLIER_RV_CADD_HPO
).
The last step consists in combining candidate genes obtained using both expression outlier and splicing outlier information. We recommend to use our most stringent filters as they show the best results at highlighting the causal gene in our analyses.
The most stringent filter consist in filtering the expression/splicing outliers for genes:
- with deleterious rare variants nearby
- that are phenotypically relevant
Those results correspond to files with prefix EXP_OUTLIER_RV_CADD_HPO
and SPLI_OUTLIER_RV_CADD_HPO
If no candidate is prioritized using those lists, the user can then look at other filter steps.