Skip to content

Create_HC_Subset

Chaochih Liu edited this page Aug 18, 2020 · 16 revisions

Basic Usage

The Create_HC_Subset handler creates a single variant call format (VCF) file that contains only the high-confidence (HC) sites for your samples. This set of high-confidence sites will be used in Variant Recalibrator. This filtering is performed in multiple steps using several different user-defined parameters and before-and-after percentile tables are generated. The process below can be run as many times as necessary to identify a set of suitable high confidence variants, just re-run this handler. This handler is set up so that any (time-consuming) step that only needs to be run one time (e.g., we only need to concatenate our raw split vcf files one time and do not need to re-do this step even upon re-running this handler) has a check to see if the file exists. If the file exists, that step will be skipped and the handler will proceed to the next step (steps with this check are noted below). For each step, if there are differences between handling VCFs called using GATK 3.8 vs. GATK 4, those will be noted accordingly. The steps are as follows:

Step 1: File preparation

GATK 4: VCF files output from Genotype_GVCFs will likely be split into multiple parts (regions). First, concatenate these split VCFs into a single raw vcf file using bcftools. Since this step only needs to be run once and is time-consuming for large datasets, the handler checks if this file already exists and proceeds to the next step if it does exist. Note: the handler only checks if the filename exists, so if you have a truncated file please delete it before re-running the handler.

GATK 3:

  • The genomic part VCF files from Genotype_GVCFs are gzipped, while preserving the original files.
  • The gzipped files are concatenated using bcftools into a single VCF file.
  • If the data is exome capture, the sites outside the exome capture region are filtered out using vcflib. If not, then nothing happens.

Step 2: Filter indels (only applies to GATK 3, in GATK 4 user will decide if they want to filter indels in the Variant_Recalibrator handler)

Insertions and deletions (indels) are filtered out using VCFtools. This step also only needs to be run once and is time-consuming for large datasets. The handler checks if the concatenated VCF file output from step 1 is larger than 200GB. If the concatenated VCF file is <200GB, the handler uses the concatenated raw VCF file output from step 1. If the concatenated VCF is >200GB, the handler works with the list of split vcf files specified in the config and runs the indel filtering in parallel. Afterward, the handler checks that all split vcf files have been processed and concatenates the *_no_indels.recode.vcf files into one file. Upon re-running the handler, the handler will check if this file already exists and proceed to the next step if it exists. If the file does not exist, the handler will generate this file.

Step 3: Percentile tables for raw VCF file

Percentile tables of DP per sample and GQ are generated for the "raw" VCF file. Again, this only needs to be run once and the handler checks if the file exists already.

Step 4: Filter low-confidence sites

A custom python3 script is run to filter out low-confidence sites based on the user parameters defined in the configuration file.

The filtering script does the following, based on the user parameters defined in the configuration file:

  • Filters out indels and sites with more than two alleles
  • If the quality score is missing or the site quality score is too low, filters out the site
  • If too many samples are heterozygous, filters out the site
  • If too many samples are "bad" (missing, low quality, or low depth), filters out the site

Step 5: Percentile tables for filtered VCF file.

Percentile tables of DP per sample and GQ are generated for the filtered VCF file.

GATK 3 specific step: Barley specific parts to pseudomolecular positions conversion

If the organism is barley, the parts positions are converted into pseudomolecular positions using a custom python3 script. If not, then nothing happens.

Note: This step is not part of the GATK 4 Create_HC_Subset because it is not necessary.

Step 6: Remove sites that aren't polymorphic

Use vcftools to remove sites that aren't polymorphic (minor allele count of 0).

Additional VCF exploration tools (optional)

If you would like to explore your VCF files to help decide which filtering criteria to use, we have a Jupyter Notebook template that can be modified for this purpose. It is located in the HelperScripts subdirectory and is called VCF_Exploration_and_Filtering_Notebook.ipynb. If you decide it is appropriate for your application, you can also use the notebook to filter your VCF in place of Create_HC_Subset.

The VCF_Exploration_and_Filtering_Notebook.ipynb requires your VCF file be in hdf5 format. In HelperScripts there is a set of scripts you can use to do this VCF to hdf5 format conversion. If you have a small VCF file, you can directly run the vcf_to_h5.py script interactively. If you need longer runtimes, the run_vcf_to_h5.sh script calls on the vcf_to_h5.py script and can be submitted as a job to run on MSI.

Run Handler

To run Create_HC_Subset, all common variables and handler-specific variables must be defined within the configuration file. Once the variables have been defined, Create_HC_Subset can be submitted to a job scheduler with the following command (assuming that you are in the directory containing sequence_handling):

./sequence_handling Create_HC_Subset Config

Where Config is the full file path to the configuration file.

Handler-Specific Variables

The following are a list of variables that need to be defined within Config. In addition to the handler-specific variables, all common variables must be defined.

Variable Function
CHS_QSUB QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00".
CHS_VCF_LIST A list of full file paths to the chromosome part VCF files from Genotype_GVCFs. This can be generated with sample_list_generator.sh.
CAPTURE_REGIONS The full file path to the capture regions file in BED format. This should be the same file as the REGIONS_FILE in Coverage_Mapping. If not exome capture, put "NA".
CHS_DP_PER_SAMPLE_CUTOFF The depth per sample (DP) cutoff. If a sample's DP is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 5
CHS_GQ_CUTOFF The genotyping quality (GQ) cutoff. If a sample's GQ is below this threshold, it will count as a "bad" sample for that site, meaning that it is more likely that the site will be filtered out. Recommended value: 10th percentile of the raw GQ percentile table. This may involve a "guess and check" strategy and running Create_HC_Subset multiple times.
CHS_MAX_BAD The maximum number of "bad" (low GQ, low DP, or missing genotype data) samples allowed at a site. Sites with more "bad" samples than this threshold will be filtered out. Recommended value: total number of samples * 0.2 (rounded to the nearest whole number)
CHS_MAX_HET The maximum number of samples at a site that can be heterozygous. Sites with more heterozygous samples than this threshold will be filtered out. Recommended value: total number of samples * 0.9 (rounded to the nearest whole number)
CHS_QUAL_CUTOFF The site quality score (QUAL) cutoff. Sites with a QUAL below this cutoff will be excluded. Recommended value: 40

Output

Create_HC_Subset generates a VCF file containing only the high-confidence variants. The VCF file can be found at

${OUT_DIR}/Create_HC_Subset/${PROJECT}_high_confidence_subset.vcf

Dependencies

Create_HC_Subset depends on the tools listed below. This handler also calls on helper scripts that require some additional dependencies below:

Dependency Version Use
VCFtools 0.1.14 Manipulating the VCF file
vcflib 1.0.0_rc2 Manipulating the VCF file
R and package bigmemory 3.6.3 Helper script percentiles.R
GNU Parallel 20190122 Parallel processing of parts of handler
Python3 3.7.1 Manipulating intermediate files

In addition, PBS is required for basic operation. Please check the dependencies page to ensure that you are using the required version of each dependency.

Clone this wiki locally