pipeline is a computational engine for genetic variant detection in a single sample, or in a batch of samples. Almost every step in the pipeline is done via a Makefile (GNU make). These makefiles can be used on their own to accomplish common bioinformatics operations, or they can be used together in a shell script to compose a pipeline.
pipeline is well suited for processing large number of familiar cohorts, and has been deployed on 462 families from SPARK collection at Simons Foundation.
- Input: BAM file(s)
- Optionally process BAM files according to GATK best practices
- Compute callable regions, and subdivide genome into bins of approximately equal size for parallelization
- Call variants with a choice of GATK HaplotypeCaller in GVCF mode, Freebayes, Platypus
- Apply GATK variant recalibration
- Apply hard variant filters
Validation was done against CEUTrio, NA12878
Besides Python 2.7, JDK, GNU make, the following packages are required
- GATK
- freebayes
- platypus
- piccard
- samtools
- sambamba
- bedtools
- bedops
- bgzip, tabix
- bcftools
- vcflib
- SnpEff
All (except GATK) can be installed with an install of bcbio-nextgen.
cd ~
git clone https://github.com/simonsfoundation/pipeline.git
To setup the configuration for the pipeline, you need to edit include_example.mk file that defines Makefile variables.
The pipeline is run as follows:
~/pipeline/ppln/pipe03.sh \
/path/to/input/bams/ \ #dir with bam file(s)
/path/to/output/dir \ #will be created for final output, metrics, log.
familycode \ # common prefix for a group of BAM files (familycode*.bam): family, batch etc;If your BAM files do not have a common prefix, create one via symbolic links
WG \ #binning method EX, WG(recommended)
0 \ #set to 0; if set to 1, pipeline will use existing regions (testing purposes)
tmp \ #if "tmp", work in /tmp, else work in output dir
~/pipeline/ppln/include_example.mk \ #makefile with variable definition
YES \ #if YES/NO - delete/don't delete intermediate files
,Reorder,FixGroups,FilterBam,DedupBam,Metrics,IndelRealign,BQRecalibrate,Freebayes,Platypus,HaplotypeCallerGVCF, \
1 \#if 1, remove working dir on exit
/path/to/pipeline/ppln \
20 \#max number of physical cpu cores to utilize
all \ # 1-12 if process only region defined in /ppln/data, all - work on full file
NO # YES/NO delete/not delete input bam files
- Grid Engine
- Slurm
Submitting via sbatch
sbatch -N1 --exclusive -J batch2 -e batch2.err -o batch2.out --wrap="/mnt/xfs1/home/asalomatov/projects/pipeline/ppln/pipe03.sh /mnt/xfs1/scratch/asalomatov/data/SPARK/bam/batch_2 /mnt/xfs1/scratch/asalomatov/data/SPARK/vars/b2/all batch2 WG 0 work /mnt/xfs1/home/asalomatov/projects/pipeline/ppln/include_example.mk YES ,FixGroups,HaplotypeCallerGVCF,Platypus,Freebayes, 0 /mnt/xfs1/home/asalomatov/projects/pipeline/ppln 25 all NO"
(freebayes tutorial, bcbio blog post)
- Download chromosome 20 high coverage bam file, Broad Institute's truth set, and NIST Genome in a Bottle target regions:
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20130103_high_cov_trio_bams/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20130103_high_cov_trio_bams/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam.bai
wget http://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/working/20130806_broad_na12878_truth_set/NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz
wget http://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/working/20130806_broad_na12878_truth_set/NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz.tbi
wget -O NA12878-callable.bed.gz ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/NIST/union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.17.bed.gz
- Run the pipeline:
sbatch -J NA12878 -N 1 --exclusive ~/pipeline/ppln/pipe03.sh ./ ./NA12878 NA12878 WG 0 tmp ~/pipeline/ppln/include_example.mk 0 ,Reorder,FixGroups,FilterBam,DedupBam,Metrics,IndelRealign,BQRecalibrate,HaplotypeCaller,Freebayes,Platypus,HaplotypeCallerGVCF,RecalibVariants, 1 ~/pipeline/ppln/ 20 all
- Restrict our consideration to chromosome 20, and to the confidently callable regions:
mkdir chr20
tabix -h NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz 20 | vcfintersect -b NA12878-callable.bed | bgzip -c > chr20/NA12878.wgs.broad_truth_set.20131119-chr20.vcf.gz
tabix -p vcf chr20/NA12878.wgs.broad_truth_set.20131119-chr20.vcf.gz
Do the same to our variant calls.
- Use
vcf-compare
to gauge the concordance between our calls and the true positives from the truth set:
zcat NA12878.wgs.broad_truth_set.20131119-chr20.vcf.gz | grep "^#\|TruthStatus=TRUE_POSITIVE" | bgzip -c > NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz
tabix -p vcf NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz
For Haplotype Caller:
vcf-compare NA12878-HC-vars-flr-call.vcf.gz ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz | grep ^VN
VN 1298 NA12878-HC-vars-flr-call.vcf.gz (1.7%)
VN 1740 ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (2.3%)
VN 73287 ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (97.7%) NA12878-HC-vars-flr-call.vcf.gz (98.3%)
For Haplotype Caller in GVCF mode:
vcf-compare NA12878-JHC-vars-call.vcf.gz ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz | grep ^VN
VN 1400 NA12878-JHC-vars-call.vcf.gz (1.9%)
VN 1729 ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (2.3%)
VN 73298 ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (97.7%) NA12878-JHC-vars-call.vcf.gz (98.1%)
For Freebayes:
vcf-compare NA12878-FB-vars-call.vcf.gz ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz | grep ^VN
VN 445 NA12878-FB-vars-call.vcf.gz (0.6%)
VN 3131 ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (4.2%)
VN 71896 ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (95.8%) NA12878-FB-vars-call.vcf.gz (99.4%)
For Platypus:
vcf-compare NA12878-PL-vars-call.vcf.gz ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz | grep ^VN
VN 2486 ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (3.3%)
VN 3071 NA12878-PL-vars-call.vcf.gz (4.1%)
VN 72541 ../NA12878.wgs.broad_truth_set.20131119-chr20-TRUE_POS.vcf.gz (96.7%) NA12878-PL-vars-call.vcf.gz (95.9%)
- Download chromosome 20 alignments for NA1278, NA12891, NA12892, and create symbolic links:
ln -s NA12878.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam CEUTrio.NA12878.chr20.20120522.bam
ln -s NA12891.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam CEUTrio.NA12891.chr20.20120522.bam
ln -s NA12892.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam CEUTrio.NA12892.chr20.20120522.bam
ln -s NA12878.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam.bai CEUTrio.NA12878.chr20.20120522.bam.bai
ln -s NA12891.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam.bai CEUTrio.NA12891.chr20.20120522.bam.bai
ln -s NA12892.chrom20.ILLUMINA.bwa.CEU.high_coverage.20120522.bam.bai CEUTrio.NA12892.chr20.20120522.bam.bai
- Download a set of high confidence calls for the trio:
wget -O GiaB_NIST_RTG_v0_2.vcf.gz ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/GIAB_integration/NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz
tabix -f -p vcf GiaB_NIST_RTG_v0_2.vcf.gz
- Run the pipeline
sbatch -J CEUTrio -N 1 --exclusive ~/pipeline/ppln/pipe03.sh ./ ./CEUTrio CEUTrio WG 0 tmp ~/pipeline/ppln/include_example.mk 0 ,Reorder,FixGroups,FilterBam,DedupBam,Metrics,IndelRealign,BQRecalibrate,HaplotypeCaller,Freebayes,Platypus,HaplotypeCallerGVCF,RecalibVariants, 1 ~/pipeline/ppln/ 20 all
- Filter out loci where the proband is HomRef, one can use
SnpSift
to accomplish this task.
vt normalize -r $GENOMEREF CEUTrio-FB-vars.vcf.gz | java -jar SnpSift.jar filter " GEN[0].GT != '0/0' & GEN[0].GT != '0|0' " | vcfintersect -b ../../NA12878-callable.bed | bgzip -c > CEUTrio-FB-vars-NoHomRef-call.vcf.gz
- Compare with vcf-compare.
Haplotype Caller:
vcf-compare CEUTrio/CEUTrio-HC-vars-NoHomRef-call.vcf.gz GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz | grep ^VN
VN 456 GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (0.7%)
VN 5978 CEUTrio/CEUTrio-HC-vars-NoHomRef-call.vcf.gz (7.9%)
VN 69434 CEUTrio/CEUTrio-HC-vars-NoHomRef-call.vcf.gz (92.1%) GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (99.3%)
Haplotype Caller in GVCF mode:
vcf-compare CEUTrio/CEUTrio-JHC-vars-NoHomRef-call.vcf.gz GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz | grep ^VN
VN 470 GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (0.7%)
VN 5653 CEUTrio/CEUTrio-JHC-vars-NoHomRef-call.vcf.gz (7.5%)
VN 69420 CEUTrio/CEUTrio-JHC-vars-NoHomRef-call.vcf.gz (92.5%) GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (99.3%)
Freebayes:
vcf-compare CEUTrio/CEUTrio-FB-vars-NoHomRef-call.vcf.gz GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz | grep ^VN
VN 791 GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (1.1%)
VN 3784 CEUTrio/CEUTrio-FB-vars-NoHomRef-call.vcf.gz (5.2%)
VN 69099 CEUTrio/CEUTrio-FB-vars-NoHomRef-call.vcf.gz (94.8%) GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (98.9%)
Platypus:
vcf-compare CEUTrio/CEUTrio-PL-vars-NoHomRef-call.vcf.gz GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz | grep ^VN
VN 857 GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (1.2%)
VN 7769 CEUTrio/CEUTrio-PL-vars-NoHomRef-call.vcf.gz (10.1%)
VN 69033 CEUTrio/CEUTrio-PL-vars-NoHomRef-call.vcf.gz (89.9%) GiaB_NIST_RTG_v0_2-chr20-norm.vcf.gz (98.8%)
`,FilterBam,DedupBam,Metrics,IndelRealign,BQRecalibrate,HaplotypeCaller,Freebayes,Platypus,HaplotypeCallerGVCF,``