Scripts for the INTERVENE GxE SES and complex diseases project

The analyses are broken up into Part 1, general data preparation for individual-level analyses in each Biobank, Part 2, individual-level analyses with Educational Attainment and Occupation in each Biobank, Part 3, meta-analyses done on summary statistics to draw conclusions across biobank studies, and Part 4, scripts to creates the (supplemental) tables and figures as included in the manuscript describing this project.

Dependencies

These scripts assume you have plink-1.9 and R v4.3.2 or higher installed on your biobank computing system. Required R libraries: data.table, foreach, doParallel, lubridate, tidyverse/dplyr, plyr, forcats, stringr, survival, metafor, googledrive, googlesheets4, ggplot2, viridis, cowplot, grid, gridExtra, and RColorBrewer.

Contact

Please contact Fiona Hagenbeek (fiona.hagenbeek@helsinki.fi) if you have any questions.

Part 1: Biobank-specific general data preparation

Step 1: Define phenotypes

Step 1a: Run UKBPhenotyper.R

The script in this step is identical to the INTERVENE flagship project, but we retain a 19th phenotype for analysis (Alcohol Use Disorder).
Run UKBPhenotyper.R to define the phenotypes required for these analyses. Note, this will require you to also download the file UKBB_definitions_demo_TEST.csv. For more information on running this script see the flagship README.
Phenotypes of interest after running this script are (you can subset the file to just these):

C3_CANCER (All cancers)
K11_APPENDACUT (Appendicitis)
J10_ASTHMA (Asthma)
I9_AF (Atrial fibrillation)
C3_BREAST (Breast cancer
I9_CHD (Coronary heart disease)
C3_COLORECTAL (Colorectal cancer)
G6_EPLEPSY (Epilepsy)
GOUT (Gout)
COX_ARTHROSIS (Hip osteoarthritis)
KNEE_ARTHROSIS (Knee osteoarthritis)
F5_DEPRESSIO (Major depressive disorder)
C3_MELANOMA_SKIN (Malignant skin melanoma)
C3_PROSTATE (Prostate cancer)
RHEUMA_SEROPOS_OTH (Rheumatoid arthritis)
T1D (Type 1 diabetes)
T2D (Type 2 diabetes)
C3_BRONCHUS_LUNG (Lung cancer)
AUD_SWEDISH (Alcohol use disorder) = new

Step 1b: Construct the phenotype file

The phenotype file will include all variables required to run the analyses. The list of required variables can be found here.
Note, you can currently ignore the requirement to define smoking. Education can either be mapped to ISCED 1997 or to ISCED 2011 (the relevant script will recode to 1997), an overview of the mapping of raw educational attainment to ISCED codes and the final dichotomized Educational Attainment variable used in the analyses for all Biobanks can be found here. Mapped Occupation information can either be added to the phenotype file or be read-in from a separate file. Examples of the occupation mapping in FinnGen and UK Biobank can be found here.

Step 2: Download the adjusted summary statistics

Download the pre-adjusted summary statistics created using MegaPRS that correspond to the genome build of your Biobank. All pre-adjusted summary statistics can be found here. Email Fiona Hagenbeek (fiona.hagenbeek@helsinki.fi) for access.

hg19 pre-adjusted summary statistics can be found here and hg38 files here.
- Hg19 contains rsids and hg38 contains variant IDs in the CHR_POS_REF_ALT format.

Step 3: Merge variant IDs to match those within the .bim file

Note, only to be performed if the genome build is hg38 and the variant ID structure is CHR_POS_REF_ALT
The script in this step is identical to the INTERVENE flagship project.
Run hg38_biobankadjustments.R to select the correct CHR_POS_REF_ALT based on the .bim of your biobank. For more information on running this script see the flagship README.

Step 4: Compute polygenic risk scores (PRSs) with Plink

The scripts in this step are identical to the INTERVENE flagship project.

If your Plink files are not split by chromosome

Run GeneratePRS.sh to create the PRSs for your Biobank. For more information on running this script see the flagship README.

If your Plink files are split by chromosome

Step 4a: run GeneratePRS_IndividualChr.sh

Run GeneratePRS_IndividualChr.sh to create the PRSs per chromosome for your Biobank. For more information on running this script see the flagship README.

Step 4b: run PRSSummationOverChr.R

Run PRSSummationOverChr.R to sum over the per chromosome PRSs to create a total PRS for each trait in your Biobank. For more information on running this script see the flagship README.

Step 5: Construct combined phenotype and polygenic risk score (PRS) R objects

Step 5a: Educational Attainment

Run DataPrep_EducationalAttainment.R to create the EA-specific sample for each trait in your Biobank. Please make the following adjustments:

Line 51 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 54 - replace with the name of your biobank (don't include spaces in the biobank name)
Line 64 - specify phenotype file location + filename
Lines 67 + 76 - specify the location of the folder containing PRS weights
Line 81 - please replace the identifier name with that used in your biobank.
Lines 94-95 - if Educational Attainment has not been converted to ISCED 1997 from ISCED 2011, replace "EDUCATION_11" in the code that creates the factor with the naming convention for ISCED 2011 education in your biobank; if Educational Attainment has already been converted to ISCED 1997 instead of ISCED 2011, replace it (if applicable) with code to make the ISCED 1997 variable a factor:

pheno$ISCED97 <- factor(pheno$ISCED97, levels = c(1,2,3,4,5,6), # remove the ISCED 1997 levels not available in your biobank
                    labels = c("ISCED 1","ISCED 2","ISCED 3","ISCED 4","ISCED 5", "ISCED 6)) # remove the ISCED 1997 not available in your biobank

Lines 98-123 - Run if ISCED 2011 has not yet been recoded to ISCED 1997 (otherwise out-comment); remove the ISCED 1997 levels not available in your biobank
Lines 132-134 - remove the ISCED 1997 levels not available in your biobank
Line 150 - please replace the identifier names with those used in your biobank.
Line 154 - if your biobank contains individuals of non-European ancestry/those that have principal components calculated for NON-EUROPEAN ancestry, i.e. within ancestry principal components, not global genetic principal components, please add code to only retain individuals of European ancestry after this line, for example,:

pheno <- subset(pheno, ANCESTRY=='EUR')

Lines 168-224 - Code assumes you have kept the same shorthand names for the phenotypes as within FinnGen (column B) and you have kept the same naming structure for the PRS files as when you downloaded them. Please adjust the names of the standard covariates before running this code if the current names do not match the naming convention in your biobank and add additional (technical) covariates as required. Remove any of the traits not applicable in your biobank (i.e., if the biobank was included in GWAS the summary statistics were based on, see Supplementary Table 4 of the INTERVENE flagship preprint.
Line 261, 266, 278, 286, 296, 333, 340, 350, 353, 356, 412, 415, 418, 421, 426, 429, 432 + 435 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Lines 300 + 305 - add additional (technical) covariates if required
Line 324-328 - If none of the PRSs have been flipped (e.g., all associations are positive) then you must out-comment these lines.
Lines 366-373 - remove birth decades not present in your biobank and add birth decades not included in the code that are included in your biobank. 16. Before writing the output to file, please check whether each subgroup for each trait has >=5 individuals! Remove traits from the list if <5 individuals in a subgroup, e.g., with the following code:

Listname[c(x,y,z)] <- NULL # where x, y, and z are the shorthand names for the phenotypes as in FinnGen

Lines 443, 445, 447 + 449 - specify the location you want to save the .Rdata files.

Output files are "*_INTERVENE_EducationalAttainment_dat.RData", "*_INTERVENE_PGSgroup1_EducationalAttainment_dat.RData", "*_INTERVENE_PGSgroup2_EducationalAttainment_dat.RData", and "*_INTERVENE_PGSgroup3_EducationalAttainment_dat.RData".

Step 5b: Occupation

Run DataPrep_Occupation.R to create the occupation-specific sample for each trait in your Biobank. Please make the following adjustments:

Line 51 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 54 - replace with the name of your biobank (don't include spaces in the biobank name)
Line 64 - specify phenotype file location + filename
Line 68 - If not already included in the phenotype file, specify occupation file location + filename. Otherwise, out-comment this line.
Lines 71 + 80 - specify the location of the folder containing PRS weights
Line 85 - please replace the identifier name with that used in your biobank.
Lines 99-125 - Replace the example coding (based on UKB data) with the codes relevant to your biobank and remove lines related to the occupation category not included in your biobank. Note, that the UKB data does not include the "Self-employed" category. If your biobank contains the information to categorize occupation information into a "Self-employed" class, please replace lines 116-117 with the following, where XX, YY, and ZZ are fictional codes in the relevant UKB field:

	} 
	else if(!is.na(dat$`20277-0.0`[i]) & (grepl("^XX", dat$`20277-0.0`[i]) | grepl("^YY", dat$`20277-0.0`[i]) | grepl("^ZZ", dat$`20277-0.0`[i]))) {
	Occupation[i] <- "Self-employed"
	}
}

And adjust lines 120-121 as follows (alternatively, if not all occupation classes are present in your biobank remove those not included):

Occupation <- factor(Occupation, levels = c("Manual worker","Self-employed",
                                            "Lower-level","Upper-level"))

Lines 128-131 - If occupation information has already been categorized, out-comment these lines, if occupation information has not been categorized, please replace the identifier names with those used in your biobank.
Line 134 - If occupation information has already been categorized, please replace "phenos" with "pheno", and replace the identifier names with those used in your biobank.
Line 138 - if your biobank contains individuals of non-European ancestry/those that have principal components calculated for NON-EUROPEAN ancestry, i.e. within ancestry principal components, not global genetic principal components, please add code to only retain individuals of European ancestry after this line, for example,:

pheno <- subset(pheno, ANCESTRY=='EUR')

Lines 152-208 - Code assumes you have kept the same shorthand names for the phenotypes as within FinnGen (column B) and you have kept the same naming structure for the PRS files as when you downloaded them. Please adjust the names of the standard covariates before running this code if the current names do not match the naming convention in your biobank and add additional (technical) covariates as required. Remove any of the traits not applicable in your biobank (i.e., if the biobank was included in GWAS the summary statistics were based on, see Supplementary Table 4 of the INTERVENE flagship preprint.
Lines 245, 250, 265, 273, 281, 320, 326, 336, 339 + 342 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Lines 283 + 288 - add additional (technical) covariates if required
Line 307-311 - If none of the PRSs have been flipped (e.g., all associations are positive) then you must out-comment these lines.

15. Before writing the output to file, please check whether each subgroup for each trait has >=5 individuals! Remove traits from the list if <5 individuals in a subgroup, e.g., with the following code:

Listname[c(x,y,z)] <- NULL # where x, y, and z are the shorthand names for the phenotypes as in FinnGen

Lines 350-357 - specify the location you want to save the .Rdata files.

Output files are "*_INTERVENE_Occupation_dat.RData", "*_INTERVENE_PGSgroup1_Occupation_dat.RData", "*_INTERVENE_PGSgroup2_Occupation_dat.RData", and "*_INTERVENE_PGSgroup3_Occupation_dat.RData".

Step 6: Calculate descriptives for each of the socioeconomic indices and trait combinations

Step 6a: Educational Attainment

Run Descriptives_EducationalAttainment.R to calculate the summary statistics on the phenotype files including Educational Attainment. Please make the following adjustments:

Line 50 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 53 - replace with the name of your biobank (don't include spaces in the biobank name)
Lines 62-67 - specify file locations + filenames
Line 149, 179, 207 + 235 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Lines 158-163, 188-193, 216-221 + 224-249 - if your Biobank was included in the prostate cancer GWASs or the number of individuals in any subgroup was <5, and you cannot investigate this trait, out-comment or remove these lines
Lines 165-170, 195-200, 223-228 + 251-256 - if your Biobank was included in the breast cancer GWASs or the number of individuals in any subgroup was <5, and you cannot investigate this trait, out-comment or remove these lines
Lines 173 + 265 - specify the location you want to save the descriptive files.

Output files are "*_INTERVENE_EducationalAttainment_SampleDescriptives.txt" and "*_INTERVENE_EducationalAttainment_SampleDescriptives_byPGS3group.txt"

Step 6b: Occupation

Run Descriptives_Occupation.R to calculate the summary statistics on the phenotype files including Occupation. Please make the following adjustments:

Line 50 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 53 - replace with the name of your biobank (don't include spaces in the biobank name)
Lines 62-67 - specify file locations + filenames
Lines 107-112 - remove lines related to the occupation categories not included in your biobank. As this example is based on UKB, which does not include the "Self-employed" occupation class, please add the following two lines of code if your biobank does include the "Self-employed" occupation class:

Ncontrols_Selfemployed <- sum(filelist$Occupation[which(filelist[,15]==0)]=="Self-employed")
Ncases_Selfemployed <- sum(filelist$Occupation[which(filelist[,15]==1)]=="Self-employed")

Lines 115-117 - remove lines related to occupation categories not included in your biobank. As this example is based on UKB, which does not include the "Self-employed" occupation class, please add the following line of code if your biobank does include the "Self-employed" occupation class:

prevalence_Selfemployed <- round(Ncases_Selfemployed/(Ncases_Selfemployed+Ncontrols_Selfemployed)*100,2)

Lines 120-143 - remove lines related to occupation categories not included in your biobank. As this example is based on UKB, which does not include the "Self-employed" occupation class, please add the following two lines of code if your biobank does include the "Self-employed" occupation class:

Ncontrols_females_Selfemployed <- c(paste0(sum(filelist$SEX[which(filelist[,15]==0 & filelist$Occupation=="Self-employed")]==1), 
                                             " (", 
                                             round(prop.table(table(filelist$SEX[which(filelist[,15]==0 & filelist$Occupation=="Self-employed")]))[2]*100,1), 
                                             "%)"))
Ncases_females_Selfemployed <- c(paste0(sum(filelist$SEX[which(filelist[,15]==1 & filelist$Occupation=="Self-employed")]==1), 
                                          " (", 
                                          round(prop.table(table(filelist$SEX[which(filelist[,15]==1 & filelist$Occupation=="Self-employed")]))[2]*100,1), 
                                          "%)"))

Lines 146-156 - remove lines related to occupation categories not included in your biobank. As this example is based on UKB, which does not include the "Self-employed" occupation class, please adjust the code if your biobank does include the "Self-employed" occupation class (example code now assumes all four occupation classes are present in your biobank):

dat <- as.data.frame(cbind(trait,Ncontrols,Ncases,Ncontrols_females,
                             Ncases_females,prevalence,AgeOnset_q25,
                             AgeOnstet_q50,AgeOnset_q75,AgeOnset_IQR,
                             Followup_IQR,Followup_Median,Ncontrols_Manualworker,Ncases_Manualworker,
                             Ncontrols_Selfemployed,Ncases_Selfemployed,Ncontrols_Lowerlevel,
                             Ncases_Lowerlevel,Ncontrols_Upperlevel,Ncases_Upperlevel,
                             prevalence_Manualworker,prevalence_Selfemployed,
                             prevalence_Lowerlevel,prevalence_Upperlevel,
                             Ncontrols_females_Manualworker,Ncases_females_Manualworker,
                             Ncontrols_females_Selfemployed,Ncases_females_Selfemployed,
                             Ncontrols_females_Lowerlevel,Ncases_females_Lowerlevel,
                             Ncontrols_females_Upperlevel,Ncases_females_Upperlevel))

Line 163, 198, 231 + 264 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar& with %do%
Lines 172-179, 207-214, 240-247 + 273-280 - if your Biobank was included in the prostate cancer GWASs or the number of individuals in any subgroup was <5, and you cannot investigate this trait, out-comment or remove these lines
Lines 181-188, 216-223, 249-256 + 282-289 - if your Biobank was included in the breast cancer GWASs or the number of individuals in any subgroup was <5, and you cannot investigate this trait, out-comment or remove these lines
Lines 191 and 298 - specify the location you want to save the descriptive files.

Output files are "*_INTERVENE_Occupation_SampleDescriptives.txt" and "*_INTERVENE_Occupation_SampleDescriptives_byPGS3group.txt"

Part 2: Biobank-specific analyses

Model 1: Determine the individual effect of the socioeconomic indices or trait-specific polygenic risk score (PRS) on disease risk

Model 1a SES effect, with Educational Attainment as the socioeconomic index

Run CoxPHmodel1a_EducationalAttainment.R to run the Cox proportional hazard models with age at disease onset as timescale, where EA is dichotomized into low vs high EA (reference = low EA), and include sex (except for prostate and breast cancer), birth decade and the first 5 genetic principal components (PCs) as covariates. Please make the following adjustments:

Line 51 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 54 - replace with the name of your biobank (don't include spaces in the biobank name)
Line 63 - specify file location + filename
Lines 89, 115, 125, 135, 145,155 + 165 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Lines 181 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model1a.sex" to "modcoeffs.cox.model1a"
Lines 184-187 - if you cannot run the analyses for prostate and breast cancer, out-comment or remove these lines
Line 190 - specify the location you want to save the model 1a output.

Output file is "*_INTERVENE_EducationalAttainment_CoxPH_model1a_Coeffs.txt"

Model 1a SES effect, with Occupation as the socioeconomic index

Run CoxPHmodel1a_Occupation.R to run the Cox proportional hazard models with age at disease onset as timescale, where occupation is classified into "Manual worker", "Lower-level", "Upper-level" and (optional) "Self-employed" (reference = Manual worker), and include sex (except for prostate and breast cancer), birth decade and the first 5 genetic PCs as covariates. Please make the following adjustments:

Line 52 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 55 - replace with the name of your biobank (don't include spaces in the biobank name)
Line 64 - specify file location + filename
Lines 88, 114, 124, 134, 144, 154 + 164 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Lines 180 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model1a.sex" to "modcoeffs.cox.model1a"
Lines 183-186 - if you cannot run the analyses for prostate and breast cancer, out-comment or remove these lines
Line 189 - specify the location you want to save the model 1a output.

Output file is "*_INTERVENE_Occupation_CoxPH_model1a_Coeffs.txt"

Model 1b PRS effect, without Educational Attainment as the socioeconomic index

Run CoxPHmodel1b_EducationalAttainment.R to run the Cox proportional hazard models with age at disease onset as timescale, with the trait-specific PRS, sex (except for prostate and breast cancer), the first 10 genetic PCs, and birth decade as covariates. Please make the following adjustments:

Line 51 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 54 - replace with the name of your biobank (don't include spaces in the biobank name)
Line 63 - specify file location + filename
Lines 85-86 - add biobank-specific technical covariates if required
Lines 90, 116, 126, 136, 146, 156 + 166 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Lines 182 + 188 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model1b.sex" to "modcoeffs.cox.model1b"
Lines 185-186 + 189-190 - if you cannot run the analyses for prostate and breast cancer, out-comment or remove these lines
Line 193 - specify the location you want to save the model 1b output.

Output file is "*_INTERVENE_EducationalAttainment_CoxPH_model1b_Coeffs.txt"

Model 1b PRS effect, without Occupation as the socioeconomic index

Run CoxPHmodel1b_Occupation.R to run the Cox proportional hazard models with age at disease onset as timescale, with the trait-specific PRS, sex (except for prostate and breast cancer), the first 10 genetic PCs, and birth decade as covariates. Please make the following adjustments:

Line 51 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 54 - replace with the name of your biobank (don't include spaces in the biobank name)
Line 63 - specify file location + filename
Lines 85-86 - add biobank-specific technical covariates if required
Lines 90, 116, 126, 136, 146, 156 + 166 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Lines 182 + 188 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model1b.sex" to "modcoeffs.cox.model1b"
Lines 185-186 + 189-190 - if you cannot run the analyses for prostate and breast cancer, out-comment or remove these lines
Line 193 - specify the location you want to save the model 1b output.

Output file is "*_INTERVENE_Occupation_CoxPH_model1b_Coeffs.txt"

Model 2: Determine the effect of the socioeconomic indices and the trait-specific polygenic risk score (PRS) together on disease risk

Educational Attainment

Run CoxPHmodel2_EducationalAttainment.R to run the Cox proportional hazard models with age at disease onset as timescale, where EA is dichotomized into low vs high EA (reference = low EA), and include the trait-specific PRS, sex (except for prostate and breast cancer), birth decade and the first 10 genetic PCs as covariates. Please make the following adjustments:

Line 51 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 54 - replace with the name of your biobank (don't include spaces in the biobank name)
Line 63 - specify file location + filename
Lines 86-87 - add biobank-specific technical covariates if required
Lines 91, 117, 127, 137, 147, 157 + 167 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Lines 174 + 180 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model2.sex" to "modcoeffs.cox.model2"
Lines 186-187 + 190-191 - if you cannot run the analyses for prostate and breast cancer, out-comment or remove these lines
Line 194 - specify the location you want to save the model 2 output.

Output file is "*_INTERVENE_EducationalAttainment_CoxPH_model2_Coeffs.txt"

Occupation

Run CoxPHmodel2_Occupation.R to run the Cox proportional hazard models with age at disease onset as timescale, where occupation is classified into "Manual worker", "Lower-level", "Upper-level" and (optional) "Self-employed" (reference = Manual worker), and include the trait-specific PRS, sex (except for prostate and breast cancer), birth decade and the first 10 genetic PCs as covariates. Please make the following adjustments:

Line 52 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 55 - replace with the name of your biobank (don't include spaces in the biobank name)
Line 64 - specify file location + filename
Lines 89-90 - add biobank-specific technical covariates if required
Lines 94, 120, 130, 140, 150, 160 + 170 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Lines 186 + 192 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model2.sex" to "modcoeffs.cox.model2"
Lines 189-190 + 193-194 - if you cannot run the analyses for prostate and breast cancer, out-comment or remove these lines
Line 197 - specify the location you want to save the model 2 output.

Output file is "*_INTERVENE_Occupation_CoxPH_model2_Coeffs.txt"

Model 3: Determine the effect of the trait-specific polygenic risk score (PRS) per level of the socioeconomic index on disease risk

Educational Attainment

Run CoxPHmodel3_EducationalAttainment.R to run the Cox proportional hazard models with age at disease onset as timescale, where EA is dichotomized into low vs high EA (reference = low EA), and include the trait-specific PRS by EA level (created in this script), sex (except for breast and prostate cancer), bith decade, and the first 10 genetic PCS as covariates. Please make the following adjustments:

Line 50 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 53 - replace with the name of your biobank (don't include spaces in the biobank name)
Line 62 - specify file location + filename
Lines 91, 96, 130, 156, 166, 176, 186, 196 + 206 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Line 103 - adjust "23" to the total number of columns in the dataset before adding the new variables
Lines 126-127 - add biobank-specific technical covariates if required
Lines 222 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model3.sex" to "modcoeffs.cox.model3"
Lines 225-228 - if you cannot run the analyses for prostate and breast cancer, out-comment or remove these lines
Line 231 - specify the location you want to save the model 3 output.

Output file is "*_INTERVENE_EducationalAttainment_CoxPH_model3_Coeffs.txt"

Occupation

Run CoxPHmodel3_Occupation.R to run the Cox proportional hazard models with age at disease onset as timescale, where occupation is classified into "Manual worker", "Lower-level", "Upper-level" and (optional) "Self-employed" (reference = Manual worker), and include the trait-specific PRS by occupation level (created in this script), sex (except for breast and prostate cancer), bith decade, and the first 10 genetic PCS as covariates. Please make the following adjustments:

Line 50 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 53 - replace with the name of your biobank (don't include spaces in the biobank name)
Line 62 - specify file location + filename
Lines 75-77 - remove the lines related to the occupation categories not included in your biobank. As this example is based on UKB, which does not include the "Self-employed" occupation class, please adjust the code if your biobank does include the "Self-employed" occupation class:

Selfemployed <- ifelse(filelist$Occupation=="Self-employed", 1, 0)

Lines 80-82 - remove the lines related to the occupation categories not included in your biobank. As this example is based on UKB, which does not include the "Self-employed" occupation class, please adjust the code if your biobank does include the "Self-employed" occupation class:

GxSelfemployed <- filelist[,19] * Selfemployed

Line 85 - remove the occupation classes not included in your biobank. As this example is based on UKB, which does not include the "Self-employed" occupation class, please adjust the code if your biobank does include the "Self-employed" occupation class (the following example assumes all four occupation classes are present in your biobank):

out = cbind(GxManualworker,GxSelfemployed,GxLowerlevel,GxUpperlevel)

Lines 93, 98, 137, 163, 173, 183, 193, 203 + 213 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Line 104-107 - adjust "21" to the total number of columns in the dataset before adding the new variables and remove occupation classes not included in your biobank. As this example is based on UKB, which does not include the "Self-employed" occupation class, please adjust the code if your biobank does include the "Self-employed" occupation class (the following example assumes all four occupation classes are present in your biobank):

for (i in 1:length(INTERVENE.list)) {
  names(INTERVENE.list[[i]]) <- c(names(INTERVENE.list[[i]][1:21]),
                                  "GxManualworker","GxSelfemployed","GxLowerlevel","GxUpperlevel")
}

Lines 129-134 - add biobank-specific technical covariates if required and remove the occupation classes not included in your biobank. As this example is based on UKB, which does not include the "Self-employed" occupation class, please adjust the code if your biobank does include the "Self-employed" occupation class (the following example assumes all four occupation classes are present in your biobank):

mod3sex.formula <- paste0("GxManualworker + GxSelfemployed + GxLowerlevel + GxUpperlevel + 
                          SEX + PC1 + PC2 + PC3 + PC4 + PC5 + 
                          PC6 + PC7 + PC8 + PC9 + PC10")
mod3nosex.formula <- paste0("GxManualworker + GxSelfemployed + GxLowerlevel + GxUpperlevel + 
                          PC1 + PC2 + PC3 + PC4 + PC5 + 
                          PC6 + PC7 + PC8 + PC9 + PC10")

Lines 229 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model3.sex" to "modcoeffs.cox.model3"
Lines 232-235 - if you cannot run the analyses for prostate and breast cancer, out-comment or remove these lines
Line 238 - specify the location you want to save the model 3 output.

Output file is "*_INTERVENE_Occupation_CoxPH_model3_Coeffs.txt"

Model 4: Determine the effect of the socioeconomic index, the trait-specific polygenic risk score (PRS), and their interaction on disease risk

Educational Attainment

Run CoxPHmodel4_EducationalAttainment.R to run the Cox proportional hazard models with age at disease onset as timescale, where EA is dichotomized into low vs high EA (reference = low EA), and include EA, the trait-specific PRS, the EA * trait-specific PRS interaction, sex (except for prostate and breast cancer), the first 10 genetics PCs, and birth decade as covariates. Please make the following adjustments:

Line 51 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 54 - replace with the name of your biobank (don't include spaces in the biobank name)
Line 63 - specify file location + filename
Lines 88-89 - add biobank-specific technical covariates if required
Lines 93, 119, 129, 139, 149, 159 + 169 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Lines 185 + 191 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model4.sex" to "modcoeffs.cox.model4"
Lines 188-189 + 192-193 - if you cannot run the analyses for prostate and breast cancer, out-comment or remove these lines
Line 196 - specify the location you want to save the model 4 output.

Output file is "*_INTERVENE_EducationalAttainment_CoxPH_model4_Coeffs.txt"

Occupation

Run CoxPHmodel4_Occupation.R to run the Cox proportional hazard models with age at disease onset as timescale, where occupation is classified into "Manual worker", "Lower-level", "Upper-level" and (optional) "Self-employed" (reference = Manual worker), and include occupation, the trait-specific PRS, the occupation * trait-specific PRS interaction, sex (except for prostate and breast cancer), the first 10 genetics PCs, and birth decade as covariates. Please make the following adjustments:

Line 51 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 54 - replace with the name of your biobank (don't include spaces in the biobank name)
Line 63 - specify file location + filename
Lines 89-90 - add biobank-specific technical covariates if required
Lines 94, 120, 130, 140, 150, 160 + 170 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Lines 186 + 192 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model4.sex" to "modcoeffs.cox.model4"
Lines 189-190 + 193-194 - if you cannot run the analyses for prostate and breast cancer, out-comment or remove these lines
Line 197 - specify the location you want to save the model 4 output.

Output file is "*_INTERVENE_Occupation_CoxPH_model4_Coeffs.txt"

Model 5: Determine the effect of the socioeconomic indices on disease risk in each of the three trait-specific polygenic risk score (PRS) groups

Educational Attainment

Run CoxPHmodel5_EducationalAttainment.R to run the Cox proportional hazard models with age at disease onset as timescale, where EA is dichotomized into low vs high EA (reference = low EA), and include sex (except for prostate and breast cancer), birth decade, and the first 5 genetic PCs as covariates in each of the groups stratified by PRS ("<25%", "25-75%", and ">75%"). Please make the following adjustments:

Line 51 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 54 - replace with the name of your biobank (don't include spaces in the biobank name)
Lines 63-65 - specify file location + filename
Lines 91, 117, 127, 137, 147, 157, 167, 198 + 225 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Lines 183 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model1a.Group1.sex" to "modcoeffs.cox.model1a.Group1"
Lines 186-189, 213-216 + 240-243 - if you cannot run the analyses for prostate and breast cancer, out-comment or remove these lines
Lines 210 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model1a.Group2.sex" to "modcoeffs.cox.model1a.Group2"
Lines 237 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model1a.Group3.sex" to "modcoeffs.cox.model1a.Group3"
Line 192, 219 + 246 - specify the location you want to save the model 5 output.

Output files are "*_INTERVENE_EducationalAttainment_CoxPH_model5_Coeffs_PGSGroup1.txt", "*_INTERVENE_EducationalAttainment_CoxPH_model5_Coeffs_PGSGroup2.txt", and "*_INTERVENE_EducationalAttainment_CoxPH_model5_Coeffs_PGSGroup3.txt"

Occupation

Run CoxPHmodel5_Occupation.R to run the Cox proportional hazard models with age at disease onset as timescale, where occupation is classified into "Manual worker", "Lower-level", "Upper-level" and (optional) "Self-employed" (reference = Manual worker), and include sex (except for prostate and breast cancer), birth decade, and the first 5 genetic PCs as covariates in each of the groups stratified by PRS ("<25%", "25-75%", and ">75%"). Please make the following adjustments:

Line 52 - if you're running this on a single core or a Rstudio session with automatic multi-threading, you can choose to out-command this line
Line 54 - replace with the name of your biobank (don't include spaces in the biobank name)
Lines 64-66 - specify file location + filename
Lines 92, 118, 128, 138, 148, 158, 168, 199 + 226 - if running on a single core or a Rstudio session with automatic multi-threading, replace %dopar% with %do%
Lines 184 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model1a.Group1.sex" to "modcoeffs.cox.model1a.Group1"
Lines 187-190, 214-217 + 241-244 - if you cannot run the analyses for prostate and breast cancer, out-comment or remove these lines
Lines 211 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model1a.Group2.sex" to "modcoeffs.cox.model1a.Group2"
Lines 238 - if you cannot run the analyses for prostate and breast cancer, rename "modcoefffs.cox.model1a.Group3.sex" to "modcoeffs.cox.model1a.Group3"
Line 193, 220 + 247 - specify the location you want to save the model 5 output.

Output files are "*_INTERVENE_Occupation_CoxPH_model5_Coeffs_PGSGroup1.txt", "*_INTERVENE_Occupation_CoxPH_model5_Coeffs_PGSGroup2.txt", and "*_INTERVENE_Occupation_CoxPH_model5_Coeffs_PGSGroup3.txt"

Part 3: Meta-analyses across biobank studies

Please note that the meta-analysis scripts are only provided for Educational Attainment.

Model 1: Meta-analyze the individual effect of the socioeconomic indices or trait-specific polygenic risk score (PRS) on disease risk

Model 1a SES effect, with Educational Attainment as the socioeconomic index

Run MetaAnalysismodel1a_EducationalAttainment.R to run the fixed-effect meta-analysis across biobank studies using the beta coefficients from the Cox proportional hazard models with age at disease onset as timescale, where EA is dichotomized into low vs high EA (reference = low EA), and including sex (except for prostate and breast cancer), birth decade and the first 5 genetic principal components (PCs) as covariates. This script downloads the summary statistics per biobank study from Google Drive and also uploads the resulting meta-analysis to Google Drive. In this project, we meta-analysed across the FinnGen study, the UK Biobank, and Generation Scotland.

Output file is "*_INTERVENE_EducationalAttainment_FEMetaAnalysis_FinnGenR11_UKB_GenScot_model1a_anyN.csv"

Model 1b PRS effect, without Educational Attainment as the socioeconomic index

Run MetaAnalysismodel1b_EducationalAttainment.R to run the fixed-effect meta-analysis across biobank studies using the beta coefficients from the Cox proportional hazard models with age at disease onset as timescale, with the trait-specific PRS, sex (except for prostate and breast cancer), the first 10 genetic PCs, and birth decade as covariates. This script downloads the summary statistics per biobank study from Google Drive and also uploads the resulting meta-analysis to Google Drive. In this project, we meta-analysed across the FinnGen study, the UK Biobank, and Generation Scotland.

Output file is "*_INTERVENE_EducationalAttainment_FEMetaAnalysis_FinnGenR11_UKB_GenScot_model1b_anyN.csv"

Model 2: Meta-analyze the effect of the socioeconomic indices and the trait-specific polygenic risk score (PRS) together on disease risk

Run MetaAnalysismodel2_EducationalAttainment.R to run the fixed-effect meta-analysis across biobank studies using the beta coefficients from the Cox proportional hazard models with age at disease onset as timescale, where EA is dichotomized into low vs high EA (reference = low EA), and including the trait-specific PRS, sex (except for prostate and breast cancer), birth decade and the first 10 genetic PCs as covariates. This script downloads the summary statistics per biobank study from Google Drive and also uploads the resulting meta-analysis to Google Drive. In this project, we meta-analysed across the FinnGen study, the UK Biobank, and Generation Scotland.

Output file is "*_INTERVENE_EducationalAttainment_FEMetaAnalysis_FinnGenR11_UKB_GenScot_model2_anyN.csv"

Compare the meta-analyzed effects of the socioeconomic indices and the trait-specific polygenic risk scores (PRS) as obtained from model 1 (unadjusted) with those obtained in model 2 (adjusted).

Comparisons are done with two-sided Wald tests after Bonferroni correction for multiple testing of 19 phenotypes (p < 2.63x10-03). Run MetaAnalysismodel1vs2differences_EducationalAttainment.R to compare the meta-analyzed estimates from model 1 and model 2 to determine whether analyzing the socioeconomic indices and PRSs jointly significantly differ from analyzing them separately. This script downloads the summary statistics per biobank study and across biobank studies (meta-analysis) from Google Drive and also uploads the result of the comparisons to Google Drive. In this project, we meta-analysed across the FinnGen study, the UK Biobank, and Generation Scotland.

Output file is "*_INTERVENE_EducationalAttainment_FEMetaAnalysis_FinnGenR11_UKB_GenScot_model1vs2_Differences.csv"

Model 3: Meta-analyze the effect of the trait-specific polygenic risk score (PRS) per level of the socioeconomic index on disease risk

Run MetaAnalysismodel3_EducationalAttainment.R to run the fixed-effect meta-analysis across biobank studies using the beta coefficients from the Cox proportional hazard models with age at disease onset as timescale, where EA is dichotomized into low vs high EA (reference = low EA), and including the trait-specific PRS by EA level (created in this script), sex (except for breast and prostate cancer), bith decade, and the first 10 genetic PCS as covariates. This script downloads the summary statistics per biobank study from Google Drive and also uploads the resulting meta-analysis to Google Drive. In this project, we meta-analysed across the FinnGen study, the UK Biobank, and Generation Scotland.

Output file is "*_INTERVENE_EducationalAttainment_FEMetaAnalysis_FinnGenR11_UKB_GenScot_model3_anyN.csv"

Compare the meta-analyzed effects the trait-specific polygenic risk score (PRS) per level of the socioeconomic index on disease risk between the levels of the socioeconomic index

Comparisons are done with two-sided Wald tests after Bonferroni correction for multiple testing of 19 phenotypes (p < 2.63x10-03). Run MetaAnalysismodel3differences_EducationalAttainment.R to compare the meta-analyzed estimates effects the trait-specific polygenic risk score (PRS) per level of the socioeconomic index on disease risk between the levels of the socioeconomic index to determine whether the effect of the PRS differs significantly between the levels of the socioeconomic index. This script downloads the summary statistics per biobank study and across biobank studies (meta-analysis) from Google Drive and also uploads the result of the comparisons to Google Drive. In this project, we meta-analysed across the FinnGen study, the UK Biobank, and Generation Scotland.

Output file is "*_INTERVENE_EducationalAttainment_FEMetaAnalysis_FinnGenR11_UKB_GenScot_model3_Differences.csv"

Model 4: Meta-analyze the effect of the socioeconomic index, the trait-specific polygenic risk score (PRS), and their interaction on disease risk

Run MetaAnalysismodel4_EducationalAttainment.R to run the fixed-effect meta-analysis across biobank studies using the beta coefficients from the Cox proportional hazard models with age at disease onset as timescale, where EA is dichotomized into low vs high EA (reference = low EA), and including EA, the trait-specific PRS, the EA * trait-specific PRS interaction, sex (except for prostate and breast cancer), the first 10 genetics PCs, and birth decade as covariates. This script downloads the summary statistics per biobank study from Google Drive and also uploads the resulting meta-analysis to Google Drive. In this project, we meta-analysed across the FinnGen study, the UK Biobank, and Generation Scotland.

Output file is "*_INTERVENE_EducationalAttainment_FEMetaAnalysis_FinnGenR11_UKB_GenScot_model4_anyN.csv"

Model 5: Meta-analyze the effect of the socioeconomic indices on disease risk in each of the three trait-specific polygenic risk score (PRS) groups

Run MetaAnalysismodel5_EducationalAttainment.R to run the fixed-effect meta-analysis across biobank studies using the beta coefficients from the Cox proportional hazard models with age at disease onset as timescale, where EA is dichotomized into low vs high EA (reference = low EA), and including sex (except for prostate and breast cancer), birth decade, and the first 5 genetic PCs as covariates in each of the groups stratified by PRS ("<25%", "25-75%", and ">75%"). This script downloads the summary statistics per biobank study from Google Drive and also uploads the resulting meta-analysis to Google Drive. In this project, we meta-analysed across the FinnGen study, the UK Biobank, and Generation Scotland.

Output file is "*_INTERVENE_EducationalAttainment_FEMetaAnalysis_FinnGenR11_UKB_GenScot_model5_anyN.csv"

Part 4: Create (supplemental) tables and figures as included in the manuscript for this project.

Please not that the manuscript only includes the results for Educational Attainment.

(Supplemental) Tables: Scripts that create the Main Table 1 and the supplemental tables including the results from this project.

Run MainTable1.R to create the Main Table 1 in the manuscript, which contains the general descriptive statistics across biobank studies.

Output file is "*_INTERVENE_EducationalAttainment_MainTable1.txt"

Run SupplementaryTables.R to create the eight supplemental tables for the manuscript, namely: Table S4: descriptive statistics per cohohort, Table S5: results model 1: per cohort + meta-analysis; Table S6: results model 2: per cohort + meta-analysis; Table S7: significance test difference effect education and PGS model 1vs2; Table S8: results model 3 (per cohort + meta-analysis); Table S9 significance test difference effect PGS high vs low EA (model3); Table S10: results model 5: per cohort + meta-analysis; and Table S11: descriptive statistics per cohort and polygenic score strata.

Output files are "*_INTERVENE_EducationalAttainment_TableS4.txt", "*_INTERVENE_EducationalAttainment_TableS5.txt", "*_INTERVENE_EducationalAttainment_TableS6.txt", "*_INTERVENE_EducationalAttainment_TableS7.txt", "*_INTERVENE_EducationalAttainment_TableS8.txt", "*_INTERVENE_EducationalAttainment_TableS9.txt", "*_INTERVENE_EducationalAttainment_TableS10.txt", and "*_INTERVENE_EducationalAttainment_TableS11.txt".

(Supplemental) Figures: Scripts that create the Main Figures 1 and 2 and the supplemental figures depicting the results from this project.

Please note that Main Figures 1 and 2 are completed in InkScape after creating the base figures in R.

Run MainFigure1.R to create the base for Main Figure 1 in the manuscript, which contains 4 panels: A: meta-analyzed results model 1a; B: comparison meta-analyzed education results model 1a vs. model 2; C: meta-analyzed PRS results model 1b; D: comparison meta-analyzed PRS results model 1b vs. model 2.

Output file is "*_INTERVENE_SESDiffDiseases_MainFigure1.png"

Run MainFigure2.R to create the base for Main Figure 2 in the manuscript, which contains the meta-analyzed results of model 3, including the comparison of the PRS estimates between the education groups.

Output file is "*_INTERVENE_SESDiffDiseases_MainFigure2.png"

Run SupplementaryFigures.R to create the four supplemental figures for the manuscript, namely: Figure S1: comparison of the hazard ratios per biobank study and the fixed-effect meta-analysis of model 1a and 1b; Figure S2: comparison of the hazard ratios per biobank study and the fixed-effect meta-analysis of model 2; Figure S3: comparison of the hazard ratios per biobank study and the fixed-effect meta-analysis of model 3; and Figure S4: comparison of the hazard ratios per biobank study and the fixed-effect meta-analysis of model 5.

Output files are "*_INTERVENE_EducationalAttainment_FigureS1.tiff", "*_INTERVENE_EducationalAttainment_FigureS2.tiff", "*_INTERVENE_EducationalAttainment_FigureS3.tiff", and "*_INTERVENE_EducationalAttainment_FigureS4.tiff".

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
CoxModels		CoxModels
DataPrep		DataPrep
Manuscript		Manuscript
Meta-analyses		Meta-analyses
LICENSE		LICENSE
README.md		README.md

License

intervene-EU-H2020/GxE_SESDisease

Folders and files

Latest commit

History

Repository files navigation

Scripts for the INTERVENE GxE SES and complex diseases project

Dependencies

Contact

Part 1: Biobank-specific general data preparation

Step 1: Define phenotypes

Step 1a: Run UKBPhenotyper.R

Step 1b: Construct the phenotype file

Step 2: Download the adjusted summary statistics

Step 3: Merge variant IDs to match those within the .bim file

Step 4: Compute polygenic risk scores (PRSs) with Plink

If your Plink files are not split by chromosome

If your Plink files are split by chromosome

Step 4a: run GeneratePRS_IndividualChr.sh

Step 4b: run PRSSummationOverChr.R

Step 5: Construct combined phenotype and polygenic risk score (PRS) R objects

Step 5a: Educational Attainment

Step 5b: Occupation

Step 6: Calculate descriptives for each of the socioeconomic indices and trait combinations

Step 6a: Educational Attainment

Step 6b: Occupation

Part 2: Biobank-specific analyses

Model 1: Determine the individual effect of the socioeconomic indices or trait-specific polygenic risk score (PRS) on disease risk

Model 1a SES effect, with Educational Attainment as the socioeconomic index

Model 1a SES effect, with Occupation as the socioeconomic index

Model 1b PRS effect, without Educational Attainment as the socioeconomic index

Model 1b PRS effect, without Occupation as the socioeconomic index

Model 2: Determine the effect of the socioeconomic indices and the trait-specific polygenic risk score (PRS) together on disease risk

Educational Attainment

Occupation

Model 3: Determine the effect of the trait-specific polygenic risk score (PRS) per level of the socioeconomic index on disease risk

Educational Attainment

Occupation

Model 4: Determine the effect of the socioeconomic index, the trait-specific polygenic risk score (PRS), and their interaction on disease risk

Educational Attainment

Occupation

Model 5: Determine the effect of the socioeconomic indices on disease risk in each of the three trait-specific polygenic risk score (PRS) groups

Educational Attainment

Occupation

Part 3: Meta-analyses across biobank studies

Model 1: Meta-analyze the individual effect of the socioeconomic indices or trait-specific polygenic risk score (PRS) on disease risk

Model 1a SES effect, with Educational Attainment as the socioeconomic index

Model 1b PRS effect, without Educational Attainment as the socioeconomic index

Model 2: Meta-analyze the effect of the socioeconomic indices and the trait-specific polygenic risk score (PRS) together on disease risk

Compare the meta-analyzed effects of the socioeconomic indices and the trait-specific polygenic risk scores (PRS) as obtained from model 1 (unadjusted) with those obtained in model 2 (adjusted).

Model 3: Meta-analyze the effect of the trait-specific polygenic risk score (PRS) per level of the socioeconomic index on disease risk

Compare the meta-analyzed effects the trait-specific polygenic risk score (PRS) per level of the socioeconomic index on disease risk between the levels of the socioeconomic index

Model 4: Meta-analyze the effect of the socioeconomic index, the trait-specific polygenic risk score (PRS), and their interaction on disease risk

Model 5: Meta-analyze the effect of the socioeconomic indices on disease risk in each of the three trait-specific polygenic risk score (PRS) groups

Part 4: Create (supplemental) tables and figures as included in the manuscript for this project.

(Supplemental) Tables: Scripts that create the Main Table 1 and the supplemental tables including the results from this project.

(Supplemental) Figures: Scripts that create the Main Figures 1 and 2 and the supplemental figures depicting the results from this project.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages