-
Notifications
You must be signed in to change notification settings - Fork 3
Home
The MRC-IEU UK Biobank genome wide association study (GWAS) pipeline has been optimized to perform GWAS on the imputed genetic dataset of the full 500 000 from UK Biobank quickly, efficiently and in a standardized manner. The imputed data has been quality controlled* for the appropriate samples and SNPs to be included in the GWAS as detailed in this document. This pipeline offers the options of performing your GWAS of your trait of interest using either PLINK or BOLT-LMM software.
* Full details are here: https://data.bris.ac.uk/data/dataset/1ovaau5sxunp2cv8rcy88688v. Please also see documentation associated the pipeline (in /docs
) for exact and up-to-date QC procedures.
BOLT-LMM uses a linear mixed model (LMM) to account for both relatedness and population stratification, therefore allowing a wider range of individuals to be included in terms of relatedness and ancestry at a cost of slightly longer running time.
PLINK allows for an analysis to be performed in a homogeneous and unrelated population.
All that is required to use this pipeline is for the user to supply a phenotype file containing their phenotype of interest, a covariates file and the script used to generate these files (see below for detailed instructions).
Please note: you will need to supply phenotypic data from your application to UK Biobank. Your application will also need permission to use the UK Biobank genetic data in order to use this pipeline. Therefore, on any output from this pipeline your application number will need to be given as well as citing this documentation.
1) Check the submission sheet to see if your phenotype of interest has already been run.
2) If not, contact IEU Data Manager (ieu-datamanagement@bristol.ac.uk) to be added to the GWAS project on RDSF within which he will make you a directory.
3) Add phenotype files, covariate file and scripts to input directory created above. We are providing standard covariate files. You need to copy these into your input directory adding any additional covariates that you would like in the analysis. See below for more details.
4) Check permissions on the input files are set to read for all chmod 744 input/*
5) Email the IEU Data Manager and ask for your data to be copied to BC4 (as RDSF is not mounted on BlueCrystal4 we need to copy them over manually at the moment)
6) Once the IEU Data Manager has told you that they has moved the data (and only then!), fill in the submission sheet with one row per phenotype, using the file names that are now in your input folder. If filling in the sheet, most fields in green and orange are required, don't edit the final four red fields.
- Name - Your name
- Location - A unique name that matches that in the config file of the pipeline (don't worry too much about this).
- Username - Your University user name
- Email - The email address you want alerts to be sent to
- Method - BOLT-LMM or PLINK
- Model - For PLINK only (linear or logistic)
- Phenotype name - A unique identify for the phenotype. This will be used to create a folder for the output.
- Phenotype description - A description of the phenotype; how it was created, what is the aim...
- Biobank column ID(s) - The Biobank column IDs used to create the phenotype (optional)
- Phenotype file - The phenotype file containing the phenotype data, no spaces please. (see below for details)
- Phenotype file generation script - The script used to generate the phenotype and covariate files, no spaces please.
- Phenotype column name - The column to be used for the phenotype data
- Covariates file - The file to be used for the covariates data (can be the same as the phenotype file but then you must specify the covariates)
- Categorical covariates - The column(s) to be used for the categorical covariates (comma separated)
- Quantitative covariates - The column(s) to be used for the quantitative covariates (comma separated)
- Job status - Set to 'Hold' to hold the job back from submission, or 'Run' to run it.
7) Check the status of the job on the sheet and wait for details. An email will also be sent to Chris Raistrick and he will copy your results back to your directory on RDSF.
PLEASE NOTE - as soon as you click run and the jobs are submitted to the queue, pressing hold or deleting the fields in red will not cancel the GWAS on BlueCrystal. The 22 jobs for each chromosome will still be running. In the event of needing to cancel the GWAS please contact Ruth Mitchell or Tom Gaunt.
Both the phenotype and covariate files need to have the genetic ids (from the 8786 application). The linker file provided with your original application (previously used to link to the 150,000 genetic release) will link all your phenotype ids to the 500,000 release.
- Space delimited text files (Columns should be separated by spaces).
- The first two columns must be FID and IID (the PLINK identifiers of an individual).
FID IID phenotype1 covariate1 covariate2
123456 123456 100 1 1
234567 234567 20 2 1
345678 345678 50 3 NA
- Any number of columns may follow and values in the column should be numeric.
- Case/control phenotypes are expected to be encoded as 1=unaffected (control), 2=affected (case).
https://data.broadinstitute.org/alkesgroup/BOLT-LMM/#x1-270005.2
The covariates file provided has the genetic ids of the 8786 application. Please be aware of this if adding additional covariates and supply the the phenotype and covariates file will the genetic ids.
This is in the same format as the phenotype file described above.As standard covariates, for BOLT-LMM, we are providing sex and chip; for PLINK, sex, chip and the first 10 pcas.
We recommend including the genotyping array ('chip') as a covariate as there is evidence of differential array effect on markers scattered across the genome. If your outcome is casually associated with lung function and smoking behaviour, we recommended performing the analysis with and wihtout genotyping array as a covariate. Please see UK Biobank supplementary documentation (S2.3.3) for more details.
BOLT-LMM performs a linear regression and therefore the output betas will need to be transformed to obtain log odds ratios using the following formula:
Edit 1/12/2017 Previously the formula for mu was not correct. It has been corrected now
logOR = beta_bolt/(mu*(1-mu))
with mu
being the prevalence, mu=case/(case+control)
.
The standard errors are adjusted using the same method: se_bolt/(mu*(1-mu))