Department of Genetic Identification, Erasmus MC University Medical Center Rotterdam, Rotterdam, The Netherlands.
Operating system: tested on Ubuntu 18.04LTS
R: tested on R version 3.6.1 (2019-07-05) -- "Action of the toes"
RAM requirements: Especially for data preprocessing, and data normalization use at least 120 GB of RAM.
Datasets: Scripts employed in the preprocessing of 450K data whose raw IDAT data are available in the Gene Expression Omnibus database: GSE128235, GSE100386, GSE125105, GSE61496, GSE87571, and GSE115278.
-
quality_control.R: quality control assessment of probes/cpg-sites, samples and sex prediction
-
normalization.R: normalization pipeline for all raw IDATs. !! Warning, using all 1057 samples requires approximately ~160GB RAM to store matrix transformation and tested with 40 CPUs.
-
train.R: model training using Support vector Machines with Radial Kernel and eps-regression technique
-
predict.R: script for predicting age using pre-normalized beta values
-
plots.R: plots generated as seen in the paper. Including violing plots, histograms and scatter-plots
-
annotation.R: employed in the functional annotation of evCpGs.
-
probes_correlation.R: age correlation among all Y-Cpg probes as in the paper
-
data/qc/ list of probes used for data preprocessing/normalization for train + validation and test set
-
data/annotation/ annotation and correlation files to be used on Integrative Genomics Viewer (IGV)
-
data/feature_selection/ list of CpG sites based on IQR (>= 0.1) and Stepwise-Forward feature selection
-
data/normalized/ contains normalized methylation beta values BMIQ + ENmix for horvath and Y-chromosome
Please contact me at d.montielgonzalez@erasmusmc.nl for any questions or issues concerning the scripts.
A. Vidaki et al (2020). Male-specific age estimation based on Y-chromosomal DNA methylation. Aging