report.Rmd

---
title: "Windows of Susceptibility Analysis for Brain Diseases"
author: "Tianrui Qi"
date: "22 Feb 2022"
output:
  html_document: default
  word_document: default
  pdf_document: default
urlcolor: blue
editor_options:
  markdown:
    wrap: sentence
---

```{r setup, include=FALSE}
# Required R package installation:
# These will install packages if they are not already installed

# Set the correct default repository
r = getOption("repos")
r["CRAN"] = "http://cran.rstudio.com"
options(repos = r)

if (!require("gplots")) {
  install.packages("gplots")
  library(gplots)
}
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}
if (!require("ggbiplot")) {
   devtools::install_git("https://github.com/vqv/ggbiplot.git")
   library(ggbiplot)
}
if (!require("ggbiplot")) {
   install.packages("ggbiplot")
   library(ggbiplot)
}
if (!require("ComplexHeatmap")) {
  library(devtools)
  install_github("jokergoo/ComplexHeatmap")
  library(ComplexHeatmap)
}
if (!require("fields")) {
  install.packages("fields")
  library(fields)
}
if (!require("hablar")) {
  install.packages("hablar")
  library(hablar)
}
if (!require("knitr")) {
  install.packages("knitr")
  library(knitr)
}
if (!require("matlab")) {
   install.packages("matlab")
   library(matlab)
}
if (!require("tibble")) {
  install.packages("tibble")
  library(tibble)
}
if (!require("tidyr")) {
  install.packages("tidyr")
  library(tidyr)
}
knitr::opts_chunk$set(echo = TRUE)
```

# 1. Introduction

From [WHO](https://www.who.int/news-room/fact-sheets/detail/zika-virus), Zika 
virus disease is caused by a virus transmitted primarily by Aedes mosquitoes, 
which bite during the day. 
Symptoms are generally mild and include fever, rash, conjunctivitis, muscle and
joint pain, malaise or headache. 
Symptoms typically last for 2–7 days. 
Most people with Zika virus infection do not develop symptoms. 
Zika virus infection during pregnancy can cause infants to be born with 
Microcephaly and other congenital malformations, known as congenital Zika 
syndrome. 
Infection with Zika virus is also associated with other complications of 
pregnancy including preterm birth and miscarriage. An increased risk of 
neurologic complications is associated with Zika virus infection in adults and 
children, including Guillain-Barré syndrome, neuropathy and myelitis.

Researchers led by Dr. Bennett used existing data from in-vitro models of human 
brain development based on embryonic stem cells to understand when embryos are 
particularly susceptible to Zika infection resulting in Microcephaly. 
These models produce the layers of the cerebral cortex in a petri dish. 
The study is uses data from the [Cortecon](http://cortecon.neuralsci.org/) that
consists of RNA transcription data measured as the layers of the cerebral cortex
develop at nine different time points between 0 and 77 days. 
An analysis of this data was published in a 2019 paper in the journal 
[Nature](https://www.nature.com/articles/s41598-019-39318-8). 
The prior study found that in humans that genes associated with microcephaly are
enriched for the Neuroectoderm Stage (Cluster 2). 
Genes associated with microcephaly and changed by Zika infection have an even 
stronger enrichment. 
This suggests that human embryos are particularly susceptible to Zika-induced 
microcephaly in the first trimester, possibly before the mother even know she is
pregnant.

Clustering was used to identify the stages of development in humans. 
Principal component analysis (PCA), data visualization, and log odds ratio 
analysis were used to show when these stages occur. 
An advanced visualization, called a Susceptibility Window Ontological 
Transcription (SWOT) Clock, is available at 
[here](https://github.com/mpoegel/SemNExT-Visualizations). 
Given a set of genes associated with a disease, the SWOT clock can be used to 
help identify time periods of cortical development that may be susceptible to 
changes in those genes. 
Recall each cluster of gene represents a distinct stage of development, if a set
of genes occurs more than expected (a.k.a enriched) for a given stages then it 
is likely that brain development may be impacted by changes in the activities of
those genes during that stage. 
The SWOT Clock provides a visualization and a statistical analysis of the log 
odds ratio of each cluster along with its p-value. 
Since the set of disease genes are significantly enriched in a cluster, then the
time period associated with that cluster is a window of susceptibility. 
A cluster is significantly enriched if it has a positive log odds ratio and a 
p-value < 0.1.

Perform the windows of susceptibility (WOS) analysis based on mouse data from a 
similar brain-in-a-dish model for mice instead of human data used by 
Dr. Bennett. 
Analyze the same sets of microcephaly-associated genes and Zika-associated genes
to see if we can detect a similar WOS for Microcephaly and Zika-induced 
microcephaly in mice as in humans. 
Same technique also applied to cognitive disorders, anther disease that impacts 
mental function, to see if we get a similar result.

# 2. Stages of Development Analysis

## 1.1 Preparation of the Mouse Homologs Data

Scott the Scientist did an analysis of RNA-Seq data from the development of the mouse cortex at days $-8$, $-4$, $0$ , $1$, $7$, $16$, $21$, and $26$ taken from [Allen Brain Map Developing Mouse Brain Atlas](http://developingmouse.brain-map.org).
This type of data is know as time series data since the features are taken through time.

We begins by reading in the data and preparing the data frame.
The columns are days at which the samples are collected where `DayNegN`, `Day0`, and `DayPosN` means $N$ days before birth, day of birth, and $N$ days after bith. 
The entries in the columns are the amount of RNA for each gene detected on that day in the mouse embryo cerebral cortex.
We then change the columns' name to their actual meaning and create a matrix for future analysis.

```{r}
# read data as data frame
Mouse.df <- read.csv("data/MouseHomologData.csv", row.names = 1) 
head(Mouse.df)  # preview of the data frame

# change column name
colnames(Mouse.df) <- c("-8", "-4", "0", "1", "7", "16", "21", "28")
head(Mouse.df)  # preview of the data frame

# create matrix
Mouse.matrix <- as.matrix(Mouse.df)
```

Note that the data already been scaled so that the analysis can focus on the shape of the time series rather than specific magnitudes.
We can confirm the scaling is successful by calculating the row means and making sure their norm was near 0.

```{r}
norm(rowMeans(Mouse.matrix))
```

## 1.2 K-means Clustering

We used [K-means clustering](https://en.wikipedia.org/wiki/K-means_clustering) to create five clusters based on domain knowledge: biologists believe there are five stages of brain development, so we select five clusters.

```{r}
# K-means clustering
set.seed(300)
km <- kmeans(Mouse.matrix, 5)
```

Then, examining the heat map of the K-means cluster centers, we can see that each cluster corresponds to different average peaks of gene expressions.

```{r}
# plot the heat map
heatmap.2(
  x = km$centers, Colv = FALSE,
  dendrogram = "none", trace ="none",
  main = "K-means Cluster Centers",
)
```

Note that the order of the cluster is `2`, `4`, `1`, `5`, `3` from the heat map.
Thus, we rearrange the order of cluster to `2`, `4`, `1`, `5`, `3` and rename them as `A`, `B`, `C`, `D`, `E`.

```{r}
# reorder and rename the "km$centers"
km$centers  # before reorder and rename
km$centers <- km$centers[c('2', '4', '1', '5', '3'), ]
rownames(km$centers) <- c('A', 'B', 'C', 'D', 'E')
km$centers  # after reorder and rename 

# rename the "km$cluster"
head(km$cluster)  # before rename
km$cluster <- as.factor(km$cluster)
levels(km$cluster) <- c('C', 'A', 'E', 'B', 'D')
head(km$cluster)  # after rename
```

## 1.3 Cluster Visualization

We can see the time trends in the clusters means by plotting each cluster mean as a line.
The cluster means have to be reformatted into a data frame with columns `Cluster`, `Day` and `Mean`.
This is done using the `dplyr` package `gather()` command.
Then, we use the reformatted data frame and `ggplot` with `geom_line()` to make the plots.
Note our use of `facet_wrap()` to generate individual plots by `Cluster`.

```{r}
# reformatted
head(as.data.frame(km$centers))  # before reformatted
reformatted.df <- as.data.frame(km$centers) %>% 
  rownames_to_column("Cluster") %>% #  Make a new column called Cluster
  gather(key="Day",value="Mean", -Cluster) %>%
  convert(int(Day))  # convert Day to an integer
head(reformatted.df)  # after reformatted

# plotting each cluster mean
ggplot(reformatted.df,aes(x=Day, y=Mean, col=Cluster)) +
  geom_line() + geom_point() +
  scale_x_continuous(breaks=c(-8,-4,0,1,7,16,21,28)) +
  labs(title="Cluster Centers") +
  facet_wrap(Cluster ~.)  # Use facet_wrap to make a separate plot for each cluster

rm(reformatted.df)  # clean environment
```

We can also make separate heat map for mouse genes in each of the five clusters and put them vertically from `A` to `E` (top to bottom) by [Hearmap](https://github.com/jokergoo/ComplexHeatmap).
It's easy to see a shift of higher value from cluster `A` to `E`.

```{r}
heatmapList <- NULL
for (i in c('A', 'B', 'C', 'D', 'E')){
  heatmapList = heatmapList %v% Heatmap(
    Mouse.matrix[km$cluster==i,], name = i, 
    show_row_dend= FALSE, show_column_dend= FALSE, 
    show_row_names=FALSE, cluster_columns = FALSE,
  )
}
draw(heatmapList, row_title = "Mouse genes in Cluster A (top) to E (bottom)", column_title = "Day")

rm(heatmapList)  # clean environment
```

## 1.4 PCA Analysis

We visualize the cluster using a [biplot](https://en.wikipedia.org/wiki/Biplot) of two components generated by PCA which explain 74% of the variance.

```{r}
pca <- prcomp(Mouse.matrix, retx=TRUE, center=TRUE, scale=TRUE)
summary(pca)
```

We display the points in a `biplot`, colored according to the k-means cluster result. 
The projection of the points makes an interesting disc-type shape which is less dense in the middle, like a donut. 
We can see that the clusters are arranged in time order, or stage, around the disc, from `A` to `E`.

```{r}
t <- 1.2*max(abs(pca$x[,1:2]))  # x and y scale limits for the biplot
ggbiplot(
  pca,
  choices=c(1,2),  # Use PC1, PC2
  alpha=.1,        # Make dots transparent
  varname.adjust=1.5,  # Move variables names out a bit
  scale =0,       # Don't rescale data
  groups=as.factor(km$cluster)
) + ggtitle('Biplot for PC1 and PC2') + xlim(-t, t) + ylim(-t, t)
```

# 3. WOS Analysis of Zika

We read Zika genes and show them on the cluster biplot of the mouse model.

```{r}
# get sample data
disease.df <- read.csv("data/Zikamicrocephaly_data.csv",row.names = 1)
disease_symbols <- intersect(as.character(disease.df$symbol), as.character(rownames(Mouse.df)))

label <- as.factor(km$cluster)
levels(label) <- c('C', 'A', 'E', 'B', 'D')
plot.df <- cbind.data.frame(pca$x, cluster=label)
myplot.df <- plot.df[disease_symbols,]

# biplot
ggplot() + 
  geom_point(data = myplot.df, 
             aes(x=PC1, y=PC2, color=cluster) ) +
  geom_segment(x=0, y=0, 
               aes(xend=3*pca$rotation[,1], yend=3*pca$rotation[,2]),
               arrow=arrow(length=unit(1/2,'picas')) ) + 
  geom_text(aes(x=3.3*pca$rotation[,1], 
                y=3.3*pca$rotation[,2], 
                label=c("-8","-4","0","1","7","16","21","28")), 
            size=4)
```

Def `cluster_pvals` to calculate pvalue and logodds of each cluster.
The code come from the Lab5 handout from Dr. Bennett.

```{r}
# Define cluster_pvals; DO NOT CHANGE!
cluster_pvals <- function(k, km, myplot.df) {
  # Inputs: k, km, myplot.df 
  # Returns: results (dataframe with clusters, pvalues, logodds)
  # Set the p-value and logodds to 0
  pvalue <- zeros(k,1)
  logodds <- zeros(k,1)
  results <- cbind.data.frame(cluster=1:k, pvalue, logodds)
  classdisease <- zeros(k,1)
  classall <- as.vector(table(km$cluster))
  # use dplyr to calculate counts for each cluster 
  temp <- myplot.df %>% 
    dplyr::group_by(cluster) %>% 
    dplyr::count(name="freq")  # Creates 'freq' column!
  classdisease[temp$cluster] <- temp$freq
  classlogodds <- zeros(k,2)
  totaldisease <- sum(classdisease)
  totalall <- sum(classall)
  # Calculate the log odds ratio for the disease
  for (i in 1:k) {
    n11 <- classdisease[i] +1                # genes in disease in cluster i 
    n21 <- totaldisease- classdisease[i] +1  # genes in disease not in cluster i
    n12 <- classall[i]-n11+1                 # genes not in disease and in cluster i
    n22 <- totalall- n11-n21 -n12+1;         # genes not in disease and not in cluster 
    res <- fisher.test(matrix(c(n11,n21,n12,n22), 2, 2))
    results[i,]$pvalue <- res$p.value
    results[i,]$logodds<- log((n11*n22)/(n12*n21))
  }
  return(results)
}
```

Next we use the help function `cluster_pvals` to calculte the log odds ratio of Zika disease for each cluster.

```{r}
# Apply cluster_pvals using the parameters just generated
clusters <- cluster_pvals(5, km, myplot.df)

# Helper function to determine enrichment
threshold <- 0.1  # Normally set to 0.1
enriched <- function(p.value,logodds,p.threshold=0.1) {
  if ((p.value <= p.threshold) && (logodds > 0)) {
    return(TRUE)
  } 
  else {
    return(FALSE)
  }
}

# Evaluate across our results; create new column
clusters$enriched <- mapply(enriched, clusters$pvalue, clusters$logodds,threshold)

# rename the cluster name of the output matrix
clusters$cluster[clusters$cluster==2]<-"A"
clusters$cluster[clusters$cluster==4]<-"B"
clusters$cluster[clusters$cluster==1]<-"C"
clusters$cluster[clusters$cluster==5]<-"D"
clusters$cluster[clusters$cluster==3]<-"E"

# View results
kable(clusters)
```

From the result, we see the stage B is enriched since it has a positive log odds ratio and a p-value < 0.1.
Thus, for Zika disease, stage B is the most susceptible period base on the mouse model, which is same to the WOS in human model. 

# 4. WOS Analysis of Cognitive Disorder

Same process as above, we read cognitive disorder genes and show them on the cluster biplot of the mouse model.

```{r}
# get sample data
disease.df <- read.csv("data/Cognitive disorder_heat_map_data.csv",row.names = 1)
disease_symbols <- intersect(as.character(disease.df$symbol), as.character(rownames(Mouse.df)))

label <- as.factor(km$cluster)
levels(label) <- c('C', 'A', 'E', 'B', 'D')
plot.df <- cbind.data.frame(pca$x, cluster=label)
myplot.df <- plot.df[disease_symbols,]

# biplot
ggplot()+
  geom_point(data = myplot.df, 
             aes(x=PC1, y=PC2, color=cluster) )+
  geom_segment(x=0, y=0, 
               aes(xend=3*pca$rotation[,1], yend=3*pca$rotation[,2]),
               arrow=arrow(length=unit(1/2,'picas')) )+ 
  geom_text(aes(x=3.3*pca$rotation[,1], 
                y=3.3*pca$rotation[,2], 
                label=c("-8","-4","0","1","7","16","21","28")), 
            size=4)
```

Next we use the help function "cluster_pvals" provides in lab 5 handout and defines above to calculte the log odds ratio of Rett syndrome disease for each cluster.

```{r}
# Apply cluster_pvals using the parameters just generated
clusters <- cluster_pvals(5, km, myplot.df)

# Helper function to determine enrichment
threshold <- 0.1  # Normally set to 0.1
enriched <- function(p.value,logodds,p.threshold=0.1) {
  if ((p.value <= p.threshold) && (logodds > 0)) {
    return(TRUE)
  } 
  else {
    return(FALSE)
  }
}

# Evaluate across our results; create new column
clusters$enriched <- mapply(enriched, clusters$pvalue, clusters$logodds,threshold)

# rename the cluster name of the output matrix
clusters$cluster[clusters$cluster==2]<-"A"
clusters$cluster[clusters$cluster==4]<-"B"
clusters$cluster[clusters$cluster==1]<-"C"
clusters$cluster[clusters$cluster==5]<-"D"
clusters$cluster[clusters$cluster==3]<-"E"

# View results
kable(clusters)
```

The result shows that stages `D` and `E` are enriched, which is the most 
susceptible period for Cognitive Disorder. 
In human data, the last stage, upper layers, is enriched. 
Thus, the mice and human results are match: they both enriched at the end stage.

# 5. Conclusions

We got similar WOS result in human and mouse model for both Zika and cognitive 
disorder. 
For Zika virus, analysis in both models show the second stage is most 
susceptible. 
The last stage, the upper layers stage, is most susceptible for the cognitive 
disorders in human model where the stages `D` and `E` are most susceptible in 
mouse model, which is also end stage.

# 6. Acknowledgements

This is a mini project for the *MATP 4400 Data Introduction to Data Mathematics*
by Dr. [Kristin Bennett](https://www.linkedin.com/in/kristin-bennett-b337637/) 
and Dr. [John Erickson](https://www.linkedin.com/in/olyerickson/), Spring 2022, 
at [Rensselaer Polytechnic Institute](https://www.rpi.edu), Troy, NY. 
The background and goal of the project is given in *IDM Lab 5: Mini Project*. 
The structure of this report is revise on the basis of *PreLab 5* and *Lab 5* 
templates given by instructors. 
Dr. Bennett and Dr. Erickson gave perfect instruction about all the methods 
including K-meaning clustering, PCA, and log odds ratio analysis and R 
visualization techniques in MATP 4400.