-
Notifications
You must be signed in to change notification settings - Fork 8
Variant_Filtering
The Variant Filtering documentation is split into two main sections: 1) Filtering for GATK 4 calls and 2) Filtering for GATK3 calls. The reason for this is there are many differences between how GATK 3 and GATK 4 calls should be handled.
Important Caveat: Variant filtering is specific to each study's goals, so while this handler can do some variant filtering we suggest thinking about filtering criteria that makes sense for your study and using your own scripts to handle this step. Here are some collections of filtering scripts (available upon publication) for inspiration:
- https://github.com/MorrellLAB/Barley_Inversions/tree/master/00_sequence_processing/Post_GATK_Filtering
- https://github.com/MorrellLAB/Selective_Sweeps/tree/master/00_sequence_processing/Post_GATK_Filtering
The Variant_Filtering
handler creates a single variant call format (VCF) file that contains only high-quality sites and genotypes for your samples. This filtering is performed in multiple steps using several different user-defined parameters with after-filtering plots (similar to the Pre_Variant_Filtering plots) generated. The steps are summarized as follows:
- Filter out low-quality sites.
- Filter out low depth and extremely high depth sites.
- Filter out low GQ sites.
- Filter on additional annotations: GQ, SOR, FS, and MQ.
- Filter sites with high proportions of missingness.
- a) Filter out highly heterozygous sites. b) Optional: if conditions are met, also filter on Excess Heterozygosity.
- Remove sites that aren't polymorphic.
- Filter to only biallelic sites.
- Generate plots following filtering (annotation graphs, missingness plots, and heterozygosity/excess heterozygosity/inbreeding coefficients plots).
To run Variant_Filtering
, all common variables and handler-specific variables must be defined within the configuration file. Once the variables have been defined, Variant_Filtering can be submitted to a job scheduler with the following command (assuming that you are in the directory containing sequence_handling
):
./sequence_handling Variant_Filtering Config
Where Config is the full file path to the configuration file.
The following are a list of variables that need to be defined within Config
. In addition to the handler-specific variables, all common variables must be defined.
Variable | Function |
---|---|
VF_QUEUE |
Queue we are using for batch submission. If using Slurm, users can specify multiple partitions that are comma-separated "small,ram256g,ram1t" . If using PBS, specify one queue "small" . |
VF_SBATCH |
If using Slurm, this is the Slurm settings for batch submission. Delete if not using Slurm. Recommended settings are "--nodes=1 --ntasks-per-node=16 --mem=110gb --tmp=22gb -t 72:00:00 --mail-type=ALL --mail-user=${EMAIL} -p ${VF_QUEUE}"
|
CHS_QSUB |
If using PBS, this is QSub settings for batch submission. Delete if not using PBS. Recommended settings are "mem=110gb,nodes=1:ppn=16,walltime=72:00:00" . |
VF_VCF |
The full filepath to the VCF file we want to filter. If you used the Variant_Recalibrator and Pre_Variant_Filtering handlers, this will be the file that ends in .recalibrated.pass_sites.vcf.gz in the Pre_Variant_Filtering directory. |
VF_QUAL_CUTOFF |
The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40 |
VF_MIN_DP |
The minimum number of reads needed to support a genotype. Genotypes under this threshold will be set to missing. Recommended value: 5 |
VF_MAX_DP |
The maximum number of reads needed to support a genotype. Genotypes over this threshold will be filtered out. Recommended value: 95th percentile of the raw DP per sample percentile table. To determine a cutoff, please utilize the outputs from running Pre_Variant_Filtering. |
VF_GQ_CUTOFF |
The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Variant_Filtering multiple times. |
VF_QD_CUTOFF |
Sites below this cutoff will be filtered out. Please see GATK's documentation under the VariantsToTable function for details on this annotation. |
VF_SOR_CUTOFF |
Sites above this cutoff will be filtered out. Please see GATK's documentation under the VariantsToTable function for details on this annotation. |
VF_FS_CUTOFF |
Sites above this cutoff will be filtered out. Please see GATK's documentation under the VariantsToTable function for details on this annotation. |
VF_MQ_CUTOFF |
Sites below this cutoff will be filtered out. Please see GATK's documentation under the VariantsToTable function for details on this annotation. |
VF_MISS |
The proportion of missingness we are willing to tolerate. Sites with a missingness greater than this threshold will be filtered out. |
VF_MAX_HET |
The maximum proportion of heterozygous genotypes allowed per site. Sites with a proportion above this threshold will be filtered out. |
VF_GEN_NUM |
The number of genomic regions to subset for graphing annotation purposes. |
VF_GEN_LEN |
The length of genomic regions to subset for graphing annotation purposes. |
Optional: | The following variables for filtering have caveats and will only be used when filtering if the conditions are met. However, they must be filled out. |
VF_DIPLOID |
Are the samples diploid? Valid options are 1) true and 2) false
|
VF_EXCESS_HET |
If we have diploid samples (i.e., VF_DIPLOID="true" ), what excess heterozygosity cutoff should we use? Refer to the Pre_Variant_Filtering plots to determine an appropriate cutoff. See ExcessHet documentation for details. Important: If not using this filter, put "NA" (i.e., VF_EXCESS_HET="NA" ). This variable cannot be left blank, doing so may cause other unexpected errors to occur. |
Variant_Filtering generates a finished VCF file. The VCF file can be found at
${OUT_DIR}/Variant_Filtering/${PROJECT}_final.vcf
Files in the subdirectory ${OUT_DIR}/Variant_Filtering/Intermediates
are kept in case the user wants to do additional troubleshooting, but these can eventually be deleted once the user is happy with the ${OUT_DIR}/Variant_Filtering/${PROJECT}_final.vcf
file.
Variant_Filtering depends on the following software:
Dependency | Use |
---|---|
bcftools 1.10.2 | Most filtering steps are done using bcftools. |
GATK 4.1.2 | Used for excess heterozygosity filtering and generating variant tables for plots. |
R 3.6.3 (or greater) | Used for 1) plotting annotation distributions and 2) plotting individual and site missingness. Requires R packages: ggplot and gridExtra
|
vcftools 0.1.14 | Used to generate files with individual and site missingness info for visualizing |
Next: Variant_Analysis
The Variant_Filtering handler creates a single variant call format (VCF) file that contains only high-quality sites and genotypes for your samples. This filtering is performed in multiple steps using several different user-defined parameters and before-and-after percentile tables are generated. The steps are as follows:
- Sites that have been tagged as false positives by the Variant_Recalibrator handler are filtered out.
- If the data is exome capture, the sites outside the exome capture region are filtered out using vcflib. If not, then nothing happens.
- Insertions and deletions (indels) are filtered out using VCFtools.
- Percentile tables of DP per sample and GQ are generated for the "raw" VCF file.
- A custom python3 script is run to filter out genotypes with unbalanced heterozygotes or read counts based on the user parameters defined in the configuration file.
- Monomorphic sites generated by the last step are filtered out using VCFtools.
- Percentile tables of DP per sample and GQ are generated for the heterozygote-balanced VCF file.
- A custom python3 script is run to filter out low-quality sites based on the user parameters defined in the configuration file.
- Percentile tables of DP per sample and GQ are generated for the finished VCF file.
In step five, the filtering script does the following, based on the user parameters defined in the configuration file:
- Genotypes that are under the minimum number of reads cutoff or over the maximum number of reads cutoff are set to missing.
- Samples with a unbalanced heterozygote call are set to missing, where unbalanced is defined by the user as above or below a certain deviation from 50/50 reference/alternative reads.
- Genotypes below the GQ or DP thresholds are set to missing.
In step eight, the filtering script does the following, based on the user parameters defined in the configuration file:
- Filters out indels and sites with more than two alleles
- If the quality score is missing or the site quality score is too low, filters out the site
- If too many samples are heterozygous, filters out the site
- If too many samples are "bad" (missing, low quality, or low depth), filters out the site
To run Variant_Filtering, all common variables and handler-specific variables must be defined within the configuration file. Once the variables have been defined, Variant_Filtering can be submitted to a job scheduler with the following command (assuming that you are in the directory containing sequence_handling
):
./sequence_handling Variant_Filtering Config
Where Config is the full file path to the configuration file.
The following are a list of variables that need to be defined within Config
. In addition to the handler-specific variables, all common variables must be defined.
Variable | Function |
---|---|
CHS_QSUB |
QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00" . |
VF_VCF |
The full file path to the recalibrated VCF file from Variant_Recalibrator. If you used Variant_Recalibrator, leave as the default VF_VCF=${OUT_DIR}/Variant_Recalibrator/${PROJECT}_recalibrated.vcf . |
VF_CAPTURE_REGIONS |
The full file path to the capture regions file in BED format. For barley, use the full pseudomolecular BED file here, not the parts BED file. If not exome capture, put "NA". |
MIN_DP |
The minimum number of reads needed to support a genotype. Genotypes under this threshold will be set to missing. Recommended value: 5 |
MAX_DP |
The maximum number of reads needed to support a genotype. Genotypes over this threshold will be set to missing. Recommended value: 95th percentile of the raw DP per sample percentile table. This may involve a "guess and check" strategy and running Variant_Filtering multiple times. |
MAX_DEV |
The maximum percent deviation from 50/50 reference/alternative reads allowed in heterozygotes. For example, MAX_DEV=0.1 allows 60/40 ref/alt and also 40/60 ref/alt but not 70/30 or 30/70 ref/alt reads. Recommended value: 0.1 |
DP_PER_SAMPLE_CUTOFF |
The depth per sample (DP) cutoff. If a sample's DP is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 5 |
GQ_CUTOFF |
The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Variant_Filtering multiple times. |
MAX_BAD |
The maximum number of "bad" (low GQ, low DP, or missing genotype data) samples allowed at a site. Sites with more "bad" samples than this threshold will be filtered out. Recommended value: total number of samples * 0.2 (rounded to the nearest whole number) |
MAX_HET |
The maximum number of samples at a site that can be heterozygous. Sites with more heterozygous samples than this threshold will be filtered out. Recommended value: total number of samples * 0.9 (rounded to the nearest whole number) |
QUAL_CUTOFF |
The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40 |
Variant_Filtering generates a finished VCF file. The VCF file can be found at
${OUT_DIR}/Variant_Filtering/${PROJECT}_final.vcf
Variant_Filtering depends on VCFtools and vcflib for manipulating the VCF file. Variant_Filtering also calls on helper scripts that require R and Python3. In addition, PBS is required for basic operation. Please check the dependencies page to ensure that you are using the required version of each dependency.
Next: Variant_Analysis
- Getting Started
- Recommended Workflow
- Configuration
- Dependencies
- sample_list_generator.sh
- Slurm specific options
- Common Problems and Errors