Report describing the work is in report.pdf
file.
To reproduce our results:
- Download dataset from here and put it under "data" directory in the repo root.
- Create conda environment from
environment.yml
and activate it by running:conda env create -f environment.yml conda activate ntds_2019
- Execute the jupyter notebooks in the following order:
baseline_phenotype_predictor.ipynb
coexpression_graph_construction.ipynb
coexpression_graph_exploration.ipynb
SNP_expression_imputation.ipynb
- Exploratory Data Analysis
- Obtain PhenoID for diseases of interest (Malaria and Influenza)
- Get phenotype value associated to the PhenoID of interest for each mouse
- Check distribution of phenotype values obtained previously
- Select single phenotype for simplicity: malaria susceptibility
- Normalize the data: subtract mean and divide by standard deviation
- Building a dataframe of all (SNP expression, tissue) pairs for all mice for constructing the coexpression graph
- Setting a baseline
- Building a regression model, ridge regression and random forests, to predict phenotype values given genes expression. We take all expressions and filling missing values with mean
- Building co-expression graph
- (Genes, Tissue) as node and we use the expression values of each mouse to compute the distance metric.
- Imputation of SNP expression on co-expression graph using Tikhonov filter
- Smoothing for filling missing values for SNP expression
- Prediction of phenotype values using additional expression data from 4.
- Using the best method, comparing with baseline.