Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do VQSR for HaplotypeCaller calls #89

Closed
maxulysse opened this issue Jan 27, 2020 · 6 comments
Closed

Do VQSR for HaplotypeCaller calls #89

maxulysse opened this issue Jan 27, 2020 · 6 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@maxulysse
Copy link
Member

Issue by @malinlarsson, moved from SciLifeLab#513

Filter the calls from HaplotypeCaller with Variant Quality Score Recalibration according to GATK best practise (Tools VariantRecalibrator, ApplyRecalibration, see https://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-score-recalibration-vqsr, cf https://gatk.broadinstitute.org/hc/en-us/articles/360037594511-VariantRecalibrator or a more recent version)

Useful comment by @apeltzer

Keep in mind, that you'd need a "bit" of data for doing VQSR properly. The recommendation was to use at least 30 WES or 1WGS sample for performing VQSR.

I started working on a solution (in NGI-ExoSeq) to automatically download 35 of the 1000G Exome Datasets, run HaplotypeCaller on them and use them for analysis procedures with less than the minimum required 30 WES samples.

@maxulysse maxulysse added the enhancement New feature or request label Jan 27, 2020
@maxulysse maxulysse self-assigned this Jan 27, 2020
@maxulysse maxulysse assigned szilvajuhos and unassigned maxulysse Mar 10, 2020
@szilvajuhos
Copy link
Contributor

I am playing with VQSR (Variant Quality Score Recalibration https://gatk.broadinstitute.org/hc/en-us/articles/360036734411-VariantRecalibrator ) that is available for HaplotypeCaller: this procedure adds PASS filter for the VCF, classifying germline variants for high or low confidence. What I can see is that we are starting from ~5M raw germline calls:

szilva@munin ~/d1/dev/VQSR $ awk '!/^#/{print}' /data1/P2233/P2233_101T_P2233_120N/VariantCalling/HaplotypeCaller/haplotypecaller_P2233_120N.AF.vcf | wc -l
4912476

and snpEff leaves them intact:

szilva@munin ~/d1/dev/VQSR $ awk '!/^#/{print}' /data1/P2233/P2233_101T_P2233_120N/Annotation/SnpEff/haplotypecaller_P2233_120N.AF.snpEff.ann.vcf | wc -l
4912476

on the other hand, VEP simply ignores those lines where there is no annotation available:

szilva@munin ~/d1/dev/VQSR $ awk '!/^#/{print}' /data1/P2233/P2233_101T_P2233_120N/Annotation/VEP.CADD/haplotypecaller_P2233_120N.AF.snpEff.ann.vep.ann.vcf| wc -l
1648613

If I do a recalibration on the VEP results, more than half of those will be thrown away:

szilva@munin ~/d1/dev/VQSR $ zcat haplotypecaller_P2233_120N.AF.snpEff.ann.vep.ann.recalibrated.vcf.gz| awk '/PASS/{print}'| wc -l
726759

without recalibration there are ~600 "high-impact" germline variants:

szilva@munin ~/d1/dev/VQSR $ zcat haplotypecaller_P2233_120N.AF.snpEff.ann.vep.ann.recalibrated.vcf.gz| awk '!/^#/&&/HIGH/{print}'| wc -l
575

Recalibration throws ~75% of these (supposedly false positive) high impact away:

szilva@munin ~/d1/dev/VQSR $ zcat haplotypecaller_P2233_120N.AF.snpEff.ann.vep.ann.recalibrated.vcf.gz| awk '!/^#/&&/HIGH/&&/PASS/{print}'| wc -l
139

@szilvajuhos
Copy link
Contributor

Actual script used

#!/bin/bash -x                                                                                                                                                                                                                                
GRCH38=/data1/references/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta                                                                                                                            
REFDIR=/data1/references/annotations/GATK_bundle                                                                                                                                                                                              
HAPMAP=${REFDIR}/hapmap_3.3.hg38.vcf.gz                             # training with high-confidence TPs                                                                                                                                       
OMNI1000G=${REFDIR}/1000G_omni2.5.hg38.vcf.gz                       # training with some FPs                                                                                                                                                  
PHASE11000G=${REFDIR}/1000G_phase1.snps.high_confidence.hg38.vcf.gz                                                                                                                                                                           
DBSNP146=${REFDIR}/dbsnp_146.hg38.vcf.gz                            # validation                                                                                                                                                              
LOCALBASE=`basename $1`                                                                                                                                                                                                                       
VQSROUT=${LOCALBASE%.vcf*}                                                                                                                                                                                                                    
                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                              
# SNP recalibration                                                                                                                                                                                                                           
                                                                                                                                                                                                                                              
singularity exec /data1/containers/sarek-latest.simg gatk VariantRecalibrator \                                                                                                                                                               
  -R ${GRCH38} \                                                                                                                                                                                                                              
  -V $1 \                                                                                                                                                                                                                                     
  --resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${HAPMAP} \                                                                                                                                                               
  --resource:omni,known=false,training=true,truth=false,prior=12.0 ${OMNI1000G} \                                                                                                                                                             
  --resource:1000G,known=false,training=true,truth=false,prior=10.0 ${PHASE11000G} \                                                                                                                                                          
  --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${DBSNP146} \                                                                                                                                                              
   -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \                                                                                                                                                                            
   -mode SNP \                                                                                                                                                                                                                                
   -O ${VQSROUT}.VQSR.recal.vcf \                                                                                                                                                                                                             
   --tranches-file ${VQSROUT}.VQSR.tranches \                                                                                                                                                                                                 
   --rscript-file ${VQSROUT}.plots.R      # needs r-ggplot2 conda package                                                                                                                                                                     
                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                              
singularity exec /data1/containers/sarek-latest.simg gatk ApplyVQSR \                                                                                                                                                                         
  -R ${GRCH38} \                                                                                                                                                                                                                              
  -V $1 \                                                                                                                                                                                                                                     
  -O ${VQSROUT}.recalibrated.vcf.gz \                                                                                                                                                                                                         
  --truth-sensitivity-filter-level 99.0 \                                                                                                                                                                                                     
  --tranches-file  ${VQSROUT}.VQSR.tranches \                                                                                                                                                                                                 
  --recal-file ${VQSROUT}.VQSR.recal.vcf \                                                                                                                                                                                                    
  -mode SNP                                                                                                                                                                                                                                   

@apeltzer
Copy link
Member

Super cool work @szilvajuhos ! :-)

@szilvajuhos
Copy link
Contributor

szilvajuhos commented Mar 26, 2020

I have to correct myself - colleague (Teresita) pointed out that we (as in Sarek) are using a VEP settings that is not printing out common variants: so VEP works just fine as asked, but not printing out the common ones. This implies we should do VQSR before annotation (as planned). Only have to find time to do the actual implementation/testing.

@szilvajuhos
Copy link
Contributor

Right now VSQR is working for WGS samples, but I will need help in adding some GATK bundle files to iGenomes and to config in general. The script I can use for VSQR is below. Comparison is at dev...szilvajuhos:vsqr

#!/bin/bash -x
nextflow \
  -log HC.log run main.nf \
  -profile munin \
  --step variantcalling \
  --tools haplotypecaller \
  --no_gvcf \
  --input VSQR/HC_test_single.tsv \
  --vsqr \
  --skip_qc all \
  --hapmap /data1/references/annotations/GATK_bundle/hapmap_3.3.hg38.vcf.gz \
  --hapmap_index /data1/references/annotations/GATK_bundle/hapmap_3.3.hg38.vcf.gz.tbi \
  --known_indels /data1/references/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
  --known_indels_index /data1/references/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi $1

@FriederikeHanssen
Copy link
Contributor

VQSR for joint germline calling is added here #595 , filtering for single samples was added in #557 using CNNScoreVariants and FilterVariantTranches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants