Genotype imputation and quality control workflow used by the eQTL Catalogue.
Performs the following main steps:
Pre-imputation QC:
- Convert raw array genotypes to GRCh38 coordinates with CrossMap.py v0.4.1
- Align array genotypes to the 1000 Genomes 30x on GRCh38 reference panel with Genotype Harmonizer.
- Convert the genotypes to the VCF format with PLINK.
- Exclude variants with Hardy-Weinberg p-value < 1e-6, missingness > 0.05 and minor allele frequency < 0.01 with bcftools
- Calculate individual-level missingness using vcftools.
Imputation:
- Genotype pre-phasing with Egale 2.4.1
- Genotype imputation with Minimac4
Post-imputation QC:
- Exclude variants with imputation R2 < 0.4
- Keep variants on chromosomes 1-22 and X
- Keep variants with MAF > 0.01
- Multiply genotype dosage of male samples on the Non-PAR region of the X chromsome by two for easier QTL mapping
--bfile
Raw genotypes in PLINK format (bed, bim, fam). Assumed to be in GRCh37 coordinates. Genotypes in VCF format can be converted to PLINK format with:
plink --vcf <path_to_vcf_file> --make-bed --out <plink_file_prefix>
The PAR and non-PAR regions of the X chromosome should be merged together and the name of the X chromsome should be 'X'. This can be achieved with PLINK:
plink --bfile Young_2019 --merge-x --make-bed --output-chr MT --out Young_2019_mergedX
Imputing genotypes from the open access CEDAR dataset.
nextflow run main.nf \
-profile eqtl_catalogue -resume\
--bfile plink_genimpute/CEDAR\
--output_name CEDAR\
--outdir CEDAR\
--impute_PAR true\
--impute_non_PAR true
- Ralf Tambets
- Kaur Alasoo
- Liina Anette Pärtel
- Mark-Erik Kodar