Skip to content

Variant_Filtering

Chaochih Liu edited this page Jun 2, 2022 · 6 revisions

The Variant Filtering documentation is split into two main sections: 1) Filtering for GATK 4 calls and 2) Filtering for GATK3 calls. The reason for this is there are many differences between how GATK 3 and GATK 4 calls should be handled.

Important Caveat: Variant filtering is specific to each study's goals, so while this handler can do some variant filtering we suggest thinking about filtering criteria that makes sense for your study and using your own scripts to handle this step. Here are some collections of filtering scripts (available upon publication) for inspiration:

Navigation

  1. Variant Filtering GATK 4 calls
  2. Variant Filtering GATK 3 calls

Variant Filtering GATK 4 calls

Basic Usage

The Variant_Filtering handler creates a single variant call format (VCF) file that contains only high-quality sites and genotypes for your samples. This filtering is performed in multiple steps using several different user-defined parameters with after-filtering plots (similar to the Pre_Variant_Filtering plots) generated. The steps are summarized as follows:

  1. Filter out low-quality sites.
  2. Filter out low depth and extremely high depth sites.
  3. Filter out low GQ sites.
  4. Filter on additional annotations: GQ, SOR, FS, and MQ.
  5. Filter sites with high proportions of missingness.
  6. a) Filter out highly heterozygous sites. b) Optional: if conditions are met, also filter on Excess Heterozygosity.
  7. Remove sites that aren't polymorphic.
  8. Filter to only biallelic sites.
  9. Generate plots following filtering (annotation graphs, missingness plots, and heterozygosity/excess heterozygosity/inbreeding coefficients plots).

To run Variant_Filtering, all common variables and handler-specific variables must be defined within the configuration file. Once the variables have been defined, Variant_Filtering can be submitted to a job scheduler with the following command (assuming that you are in the directory containing sequence_handling):

./sequence_handling Variant_Filtering Config

Where Config is the full file path to the configuration file.

Handler-Specific Variables

The following are a list of variables that need to be defined within Config. In addition to the handler-specific variables, all common variables must be defined.

Variable Function
VF_QUEUE Queue we are using for batch submission. If using Slurm, users can specify multiple partitions that are comma-separated "small,ram256g,ram1t". If using PBS, specify one queue "small".
VF_SBATCH If using Slurm, this is the Slurm settings for batch submission. Delete if not using Slurm. Recommended settings are "--nodes=1 --ntasks-per-node=16 --mem=110gb --tmp=22gb -t 72:00:00 --mail-type=ALL --mail-user=${EMAIL} -p ${VF_QUEUE}"
CHS_QSUB If using PBS, this is QSub settings for batch submission. Delete if not using PBS. Recommended settings are "mem=110gb,nodes=1:ppn=16,walltime=72:00:00".
VF_VCF The full filepath to the VCF file we want to filter. If you used the Variant_Recalibrator and Pre_Variant_Filtering handlers, this will be the file that ends in .recalibrated.pass_sites.vcf.gz in the Pre_Variant_Filtering directory.
VF_QUAL_CUTOFF The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40
VF_MIN_DP The minimum number of reads needed to support a genotype. Genotypes under this threshold will be set to missing. Recommended value: 5
VF_MAX_DP The maximum number of reads needed to support a genotype. Genotypes over this threshold will be filtered out. Recommended value: 95th percentile of the raw DP per sample percentile table. To determine a cutoff, please utilize the outputs from running Pre_Variant_Filtering.
VF_GQ_CUTOFF The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Variant_Filtering multiple times.
VF_QD_CUTOFF Sites below this cutoff will be filtered out. Please see GATK's documentation under the VariantsToTable function for details on this annotation.
VF_SOR_CUTOFF Sites above this cutoff will be filtered out. Please see GATK's documentation under the VariantsToTable function for details on this annotation.
VF_FS_CUTOFF Sites above this cutoff will be filtered out. Please see GATK's documentation under the VariantsToTable function for details on this annotation.
VF_MQ_CUTOFF Sites below this cutoff will be filtered out. Please see GATK's documentation under the VariantsToTable function for details on this annotation.
VF_MISS The proportion of missingness we are willing to tolerate. Sites with a missingness greater than this threshold will be filtered out.
VF_MAX_HET The maximum proportion of heterozygous genotypes allowed per site. Sites with a proportion above this threshold will be filtered out.
VF_GEN_NUM The number of genomic regions to subset for graphing annotation purposes.
VF_GEN_LEN The length of genomic regions to subset for graphing annotation purposes.
Optional: The following variables for filtering have caveats and will only be used when filtering if the conditions are met. However, they must be filled out.
VF_DIPLOID Are the samples diploid? Valid options are 1) true and 2) false
VF_EXCESS_HET If we have diploid samples (i.e., VF_DIPLOID="true"), what excess heterozygosity cutoff should we use? Refer to the Pre_Variant_Filtering plots to determine an appropriate cutoff. See ExcessHet documentation for details. Important: If not using this filter, put "NA" (i.e., VF_EXCESS_HET="NA"). This variable cannot be left blank, doing so may cause other unexpected errors to occur.

Output

Variant_Filtering generates a finished VCF file. The VCF file can be found at

${OUT_DIR}/Variant_Filtering/${PROJECT}_final.vcf

Files in the subdirectory ${OUT_DIR}/Variant_Filtering/Intermediates are kept in case the user wants to do additional troubleshooting, but these can eventually be deleted once the user is happy with the ${OUT_DIR}/Variant_Filtering/${PROJECT}_final.vcf file.

Dependencies

Variant_Filtering depends on the following software:

Dependency Use
bcftools 1.10.2 Most filtering steps are done using bcftools.
GATK 4.1.2 Used for excess heterozygosity filtering and generating variant tables for plots.
R 3.6.3 (or greater) Used for 1) plotting annotation distributions and 2) plotting individual and site missingness. Requires R packages: ggplot and gridExtra
vcftools 0.1.14 Used to generate files with individual and site missingness info for visualizing

Variant Filtering GATK 3 calls

Basic Usage

The Variant_Filtering handler creates a single variant call format (VCF) file that contains only high-quality sites and genotypes for your samples. This filtering is performed in multiple steps using several different user-defined parameters and before-and-after percentile tables are generated. The steps are as follows:

  1. Sites that have been tagged as false positives by the Variant_Recalibrator handler are filtered out.
  2. If the data is exome capture, the sites outside the exome capture region are filtered out using vcflib. If not, then nothing happens.
  3. Insertions and deletions (indels) are filtered out using VCFtools.
  4. Percentile tables of DP per sample and GQ are generated for the "raw" VCF file.
  5. A custom python3 script is run to filter out genotypes with unbalanced heterozygotes or read counts based on the user parameters defined in the configuration file.
  6. Monomorphic sites generated by the last step are filtered out using VCFtools.
  7. Percentile tables of DP per sample and GQ are generated for the heterozygote-balanced VCF file.
  8. A custom python3 script is run to filter out low-quality sites based on the user parameters defined in the configuration file.
  9. Percentile tables of DP per sample and GQ are generated for the finished VCF file.

In step five, the filtering script does the following, based on the user parameters defined in the configuration file:

  • Genotypes that are under the minimum number of reads cutoff or over the maximum number of reads cutoff are set to missing.
  • Samples with a unbalanced heterozygote call are set to missing, where unbalanced is defined by the user as above or below a certain deviation from 50/50 reference/alternative reads.
  • Genotypes below the GQ or DP thresholds are set to missing.

In step eight, the filtering script does the following, based on the user parameters defined in the configuration file:

  • Filters out indels and sites with more than two alleles
  • If the quality score is missing or the site quality score is too low, filters out the site
  • If too many samples are heterozygous, filters out the site
  • If too many samples are "bad" (missing, low quality, or low depth), filters out the site

To run Variant_Filtering, all common variables and handler-specific variables must be defined within the configuration file. Once the variables have been defined, Variant_Filtering can be submitted to a job scheduler with the following command (assuming that you are in the directory containing sequence_handling):

./sequence_handling Variant_Filtering Config

Where Config is the full file path to the configuration file.

Handler-Specific Variables

The following are a list of variables that need to be defined within Config. In addition to the handler-specific variables, all common variables must be defined.

Variable Function
CHS_QSUB QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00".
VF_VCF The full file path to the recalibrated VCF file from Variant_Recalibrator. If you used Variant_Recalibrator, leave as the default VF_VCF=${OUT_DIR}/Variant_Recalibrator/${PROJECT}_recalibrated.vcf.
VF_CAPTURE_REGIONS The full file path to the capture regions file in BED format. For barley, use the full pseudomolecular BED file here, not the parts BED file. If not exome capture, put "NA".
MIN_DP The minimum number of reads needed to support a genotype. Genotypes under this threshold will be set to missing. Recommended value: 5
MAX_DP The maximum number of reads needed to support a genotype. Genotypes over this threshold will be set to missing. Recommended value: 95th percentile of the raw DP per sample percentile table. This may involve a "guess and check" strategy and running Variant_Filtering multiple times.
MAX_DEV The maximum percent deviation from 50/50 reference/alternative reads allowed in heterozygotes. For example, MAX_DEV=0.1 allows 60/40 ref/alt and also 40/60 ref/alt but not 70/30 or 30/70 ref/alt reads. Recommended value: 0.1
DP_PER_SAMPLE_CUTOFF The depth per sample (DP) cutoff. If a sample's DP is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 5
GQ_CUTOFF The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Variant_Filtering multiple times.
MAX_BAD The maximum number of "bad" (low GQ, low DP, or missing genotype data) samples allowed at a site. Sites with more "bad" samples than this threshold will be filtered out. Recommended value: total number of samples * 0.2 (rounded to the nearest whole number)
MAX_HET The maximum number of samples at a site that can be heterozygous. Sites with more heterozygous samples than this threshold will be filtered out. Recommended value: total number of samples * 0.9 (rounded to the nearest whole number)
QUAL_CUTOFF The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40

Output

Variant_Filtering generates a finished VCF file. The VCF file can be found at

${OUT_DIR}/Variant_Filtering/${PROJECT}_final.vcf

Dependencies

Variant_Filtering depends on VCFtools and vcflib for manipulating the VCF file. Variant_Filtering also calls on helper scripts that require R and Python3. In addition, PBS is required for basic operation. Please check the dependencies page to ensure that you are using the required version of each dependency.