Skip to content

PLINK QC pipeline

Scott.Hazelhurst edited this page May 28, 2017 · 7 revisions

The main pipeline is the PLINK QC pipeline. It takes as input PLINK bed,bim,fam files and performs quality control on the data according to the parameters specified in the config file.

The Nextflow script file is called plink-qc.nf. This could be called, for example, by running nextflow run plink-qc.nf. More details of running options can be found in Pipeline options.

The output of the QC is a set of PLINK files that can be used for GWAS, as well as PDF report that describes the QC steps.

Input

There are two possible inputs to this pipeline: a set of PLINK files, or the calls in Illumina TOP/BOTTOM format.

PLINK format

We expect that most users will run the pipeline giving as input PLINK 1.9 bed, bim and fam filess. In this mode, the pipeline is capable of doing QC on any number of input file sets. The key Nextflow parameters to set are:

  • work_dir : the directory in which you will run the workflow. This will typically be the h3agwas directory which you cloned;
  • input, output and script directories: the default is that these are subdirectories of the work_dir and there'll seldom be reason to change these;
  • input_pat : this typically will be the base name of the PLINK files you want to process (i.e., do not include the file suffix). But you could be put any Unix-style glob here. The workflow will match files in the relevant input_dir directory;
  • high_ld_regions_fname: this is optional -- it is a list of regions which are in very high LD -- and are exclude when checking for relationships (https://www.cog-genomics.org/plink/1.9/filter#mrange_id)d)

Illumina TOP/BOTTOM format

If your data is given in Illumina TOP/BOTTOM format, then this option can be used. You need to run the workflow with the --topbot option, for example

nextflow run plink-qc.nf --topbot

The details of the other options that must be set can be found in Converting-from-Illumina-Top-Bottom.

The conversion from TOP/BOTTOM format is very time-consuming. Since the the plink-qc.nf is likely to run several times, we recommend that it is better to run the topbottom.nf workflow separately to do the conversion and then run the plink-qc.nf workflow.

Overview of the workflow

The QC process consists of:

  • removing duplicate markers (typically if there are tri-allelic SNPs);
  • indentifying indviduals for whom there is discordant sex information;
  • removing individuals with too high missingness or excessive heterozygosity;
  • detecting whether there are any related individuals and removing enough to ensure that there are not related pairs;
  • removing SNPs with too low MAF, or too high missingness, or anomalous HWE, or SNPs where there is a high differential missingness between cases and controls;
  • a PCA of the resultant data is computed;
  • a detailed report of the QC process is done.

#Parameters

The following parameters can be set

QC Paramters

Parameters for describing input and output -- these are explained fully above

  • sexinfo_available: TRUE or FALSE. If we don't have sex information then we cannot do the check for discordant genotype;
  • cut_het_high: What is the maximum allowable heterozygosity for individualsl;
  • cut_het_low: minimum
  • cut_maf : the minimum minor allele frequency a SNP must have to be included
  • cut_diff_miss : allowable differential missingness between cases and controls;
  • cut_geno: maximum allowable per-SNP mssingness
  • cut_mind: maximum allowable per-individual missingness
  • cut_hwe: minimum allowable per-SNP Hardy-Weinberg Equilibrium p-value
  • pi_hat: maximum allowable relatedness

These can be set in the nextflow.config file.

Performance parameters

There are three parameters that are important control performance

  • plink_process_memory : specify in MB or GB how much memory your processes that use PLINK require;
  • other_process_memory : specify how much other processes need;
  • max_plink_cores : specify how many cores your PLINK processes can use. (This is only for those PLINK operations that are parallelisable. Some processes can't be parallelised and our workflow is designed so that for those processes only one core is used).

#Output

A PDF report can be found in the output directory. This describes the process as well as what the inputs and outputs were.

Note that one issue that sometimes occurs in analysis is that there may over time be multple copies of the same file, perhaps with minor differences. To help version control, the PDF report captures the md5 checksums of inputs and outputs.