Skip to content

Variant_Recalibrator

Chaochih Liu edited this page Mar 4, 2021 · 6 revisions

Basic Usage

The Variant_Recalibrator handler uses the Genome Analysis Toolkit (GATK) and user-provided prior sets of "truth" variants to create a model that attempts to separate true variants from false positives. A VCF file is output with the FILTER field annotated with either "PASS" or a label that denotes a false positive. Because Variant_Recalibrator requires a large training set, it will not function properly on datasets with less than 30 exome samples. GATK has empirically found in human data that you need at least 1 whole genome or 30 exome samples to have enough variant sites for decent modeling. What matters is having a large number of variant sites, so this minimum number of samples is only an estimate. See documentation on Variant Quality Score Recalibration for details.

To run Variant_Recalibrator, all common variables and handler-specific variables must be defined within the configuration file. Once the variables have been defined, Variant_Recalibrator can be submitted to a job scheduler with the following command (assuming that you are in the directory containing sequence_handling):

./sequence_handling Variant_Recalibrator Config

Where Config is the full file path to the configuration file.

Handler-Specific Variables

The following are a list of variables that need to be defined within Config. In addition to the handler-specific variables, all common variables must be defined.

Variable Function
VR_QSUB QSub settings for batch submission. Recommended settings are "mem=250gb,nodes=1:ppn=16,walltime=24:00:00".
VR_QUEUE The specific queue where the job will be submitted. Attempting to run sequence_handling while on a different server than the one specified will create an error message. Choose from: "lab", "mesabi", "ram256g", or other queues shown here. Recommended queue is "ram256g".
VR_REF The full file path to the reference. For barley, use the full pseudomolecular reference here, not the parts reference.
VR_VCF_LIST A list of full file paths to chromosomal VCF files from Genotype_GVCFs. This can be generated with sample_list_generator.sh.
HC_PRIOR The prior for the high-confidence subset. Recommended value: 5
RESOURCE_# The resource VCF files used to train the model. These should be from the same organism and reference version as your samples. At least one resource and prior pair is required, but up to four are allowed. Put "NA" for missing resource files and priors.
PRIOR_# The prior for each reference VCF file (above). A higher prior indicates a greater degree of confidence that the resource variants are true. At least one resource and prior pair is required, but up to four are allowed. Put "NA" for missing resource files and priors.

Output

Variant_Recalibrator generates a recalibrated VCF file. The VCF file can be found at

# Raw output from ApplyVQSR
${OUT_DIR}/Variant_Recalibrator/${PROJECT}_snps.recalibrated.vcf.gz
# If INDEL mode was run
${OUT_DIR}/Variant_Recalibrator/${PROJECT}_indel.recalibrated.vcf.gz

# Selected pass sites only
${OUT_DIR}/Variant_Recalibrator/${PROJECT}_snps.recalibrated.pass_sites.vcf.gz
# If INDEL mode was run
${OUT_DIR}/Variant_Recalibrator/${PROJECT}_indel.recalibrated.pass_sites.vcf.gz

Dependencies

Variant_Recalibrator depends on GATK for recalibrating the VCF and VCFtools for combining the chromosomal VCF parts. If the reference dictionary needs to be generated, Variant_Recalibrator also depends on Picard. Python3, PBS, and GNU Parallel are also required for operation. Please check the dependencies page to ensure that you are using the required version of each dependency.