This repository is for the results in the paper Multivariate Bayesian structured variable selection for pharmacogenomic studies by 'Zhao Z., Banterle M., Lewin A. and Zucknick M. (2022)', arXiv:2101.05899v3.
The main data is the Genomics of Drug Sensitivity in Cancer (Garnett et al., 2012, Nature) which is publicly available. The gene set of MAPK is also is publicly available from Kyoto Encyclopedia of Genes and Genomes (KEGG).
No restrictions.
The pharmacogenomic datasets in Garnett et al. (2012, Nature) can be downloaded from
ftp://ftp.sanger.ac.uk/pub4/cancerrxgene/releases/release-5.0/
There are three data files needed: gdsc_en_input_w5.csv
, gdsc_drug_sensitivity_fitted_data_w5.csv
and gdsc_tissue_output_w5.csv
.
The gene set of MAPK can be downloaded from KEGG gene set enrichment analysis
http://software.broadinstitute.org/gsea/msigdb/cards/KEGG_MAPK_SIGNALING_PATHWAY
Download the .txt
version and delete the head content, i.e. the first two lines.
We are including all of the code that will enable reproducing our results.
We developed an R package BayesSUR which is available on CRAN for implementing our approach. All of the code are also available on the first author’s GitHub.
The real data analysis is computationally intensive due to high-dimensional genomic predictors. Running with one thread of CPU takes a few hours, but producing the same results in this article. It can be run faster on a cluster with parallelization. However, since the core code of our approach is in C++ for computational efficiency, Rcpp with OpenMP for parallelization is difficult reproduce the results on different type of machines.
All tables, Figures 3-5, and Figures 7-10 can be reproduced through the provided code. The general steps are:
- Load simulation function through file
simulation_function.R
. - Run script
simulation_results.R
line by line to reproduce all simulation results in the article. - Download real data by following the above Data section.
- Run script
GDSC_preprocess1.R
to get the real data ready for modelling (need to uncomment some lines to obtain datasets corresponding 'Feature sets II and III' in the article); run scriptGDSC_preprocess2.R
to get the independent validation data. - Run script
GDSC_results.R
line by line to reproduce all results of real data analysis in the article.