-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreport.Rmd
455 lines (387 loc) · 16.8 KB
/
report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
---
title: "Windows of Susceptibility Analysis for Brain Diseases"
author: "Tianrui Qi"
date: "22 Feb 2022"
output:
html_document: default
word_document: default
pdf_document: default
urlcolor: blue
editor_options:
markdown:
wrap: sentence
---
```{r setup, include=FALSE}
# Required R package installation:
# These will install packages if they are not already installed
# Set the correct default repository
r = getOption("repos")
r["CRAN"] = "http://cran.rstudio.com"
options(repos = r)
if (!require("gplots")) {
install.packages("gplots")
library(gplots)
}
if (!require("ggplot2")) {
install.packages("ggplot2")
library(ggplot2)
}
if (!require("ggbiplot")) {
devtools::install_git("https://github.com/vqv/ggbiplot.git")
library(ggbiplot)
}
if (!require("ggbiplot")) {
install.packages("ggbiplot")
library(ggbiplot)
}
if (!require("ComplexHeatmap")) {
library(devtools)
install_github("jokergoo/ComplexHeatmap")
library(ComplexHeatmap)
}
if (!require("fields")) {
install.packages("fields")
library(fields)
}
if (!require("hablar")) {
install.packages("hablar")
library(hablar)
}
if (!require("knitr")) {
install.packages("knitr")
library(knitr)
}
if (!require("matlab")) {
install.packages("matlab")
library(matlab)
}
if (!require("tibble")) {
install.packages("tibble")
library(tibble)
}
if (!require("tidyr")) {
install.packages("tidyr")
library(tidyr)
}
knitr::opts_chunk$set(echo = TRUE)
```
# 1. Introduction
From [WHO](https://www.who.int/news-room/fact-sheets/detail/zika-virus), Zika
virus disease is caused by a virus transmitted primarily by Aedes mosquitoes,
which bite during the day.
Symptoms are generally mild and include fever, rash, conjunctivitis, muscle and
joint pain, malaise or headache.
Symptoms typically last for 2–7 days.
Most people with Zika virus infection do not develop symptoms.
Zika virus infection during pregnancy can cause infants to be born with
Microcephaly and other congenital malformations, known as congenital Zika
syndrome.
Infection with Zika virus is also associated with other complications of
pregnancy including preterm birth and miscarriage. An increased risk of
neurologic complications is associated with Zika virus infection in adults and
children, including Guillain-Barré syndrome, neuropathy and myelitis.
Researchers led by Dr. Bennett used existing data from in-vitro models of human
brain development based on embryonic stem cells to understand when embryos are
particularly susceptible to Zika infection resulting in Microcephaly.
These models produce the layers of the cerebral cortex in a petri dish.
The study is uses data from the [Cortecon](http://cortecon.neuralsci.org/) that
consists of RNA transcription data measured as the layers of the cerebral cortex
develop at nine different time points between 0 and 77 days.
An analysis of this data was published in a 2019 paper in the journal
[Nature](https://www.nature.com/articles/s41598-019-39318-8).
The prior study found that in humans that genes associated with microcephaly are
enriched for the Neuroectoderm Stage (Cluster 2).
Genes associated with microcephaly and changed by Zika infection have an even
stronger enrichment.
This suggests that human embryos are particularly susceptible to Zika-induced
microcephaly in the first trimester, possibly before the mother even know she is
pregnant.
Clustering was used to identify the stages of development in humans.
Principal component analysis (PCA), data visualization, and log odds ratio
analysis were used to show when these stages occur.
An advanced visualization, called a Susceptibility Window Ontological
Transcription (SWOT) Clock, is available at
[here](https://github.com/mpoegel/SemNExT-Visualizations).
Given a set of genes associated with a disease, the SWOT clock can be used to
help identify time periods of cortical development that may be susceptible to
changes in those genes.
Recall each cluster of gene represents a distinct stage of development, if a set
of genes occurs more than expected (a.k.a enriched) for a given stages then it
is likely that brain development may be impacted by changes in the activities of
those genes during that stage.
The SWOT Clock provides a visualization and a statistical analysis of the log
odds ratio of each cluster along with its p-value.
Since the set of disease genes are significantly enriched in a cluster, then the
time period associated with that cluster is a window of susceptibility.
A cluster is significantly enriched if it has a positive log odds ratio and a
p-value < 0.1.
Perform the windows of susceptibility (WOS) analysis based on mouse data from a
similar brain-in-a-dish model for mice instead of human data used by
Dr. Bennett.
Analyze the same sets of microcephaly-associated genes and Zika-associated genes
to see if we can detect a similar WOS for Microcephaly and Zika-induced
microcephaly in mice as in humans.
Same technique also applied to cognitive disorders, anther disease that impacts
mental function, to see if we get a similar result.
# 2. Stages of Development Analysis
## 1.1 Preparation of the Mouse Homologs Data
Scott the Scientist did an analysis of RNA-Seq data from the development of the mouse cortex at days $-8$, $-4$, $0$ , $1$, $7$, $16$, $21$, and $26$ taken from [Allen Brain Map Developing Mouse Brain Atlas](http://developingmouse.brain-map.org).
This type of data is know as time series data since the features are taken through time.
We begins by reading in the data and preparing the data frame.
The columns are days at which the samples are collected where `DayNegN`, `Day0`, and `DayPosN` means $N$ days before birth, day of birth, and $N$ days after bith.
The entries in the columns are the amount of RNA for each gene detected on that day in the mouse embryo cerebral cortex.
We then change the columns' name to their actual meaning and create a matrix for future analysis.
```{r}
# read data as data frame
Mouse.df <- read.csv("data/MouseHomologData.csv", row.names = 1)
head(Mouse.df) # preview of the data frame
# change column name
colnames(Mouse.df) <- c("-8", "-4", "0", "1", "7", "16", "21", "28")
head(Mouse.df) # preview of the data frame
# create matrix
Mouse.matrix <- as.matrix(Mouse.df)
```
Note that the data already been scaled so that the analysis can focus on the shape of the time series rather than specific magnitudes.
We can confirm the scaling is successful by calculating the row means and making sure their norm was near 0.
```{r}
norm(rowMeans(Mouse.matrix))
```
## 1.2 K-means Clustering
We used [K-means clustering](https://en.wikipedia.org/wiki/K-means_clustering) to create five clusters based on domain knowledge: biologists believe there are five stages of brain development, so we select five clusters.
```{r}
# K-means clustering
set.seed(300)
km <- kmeans(Mouse.matrix, 5)
```
Then, examining the heat map of the K-means cluster centers, we can see that each cluster corresponds to different average peaks of gene expressions.
```{r}
# plot the heat map
heatmap.2(
x = km$centers, Colv = FALSE,
dendrogram = "none", trace ="none",
main = "K-means Cluster Centers",
)
```
Note that the order of the cluster is `2`, `4`, `1`, `5`, `3` from the heat map.
Thus, we rearrange the order of cluster to `2`, `4`, `1`, `5`, `3` and rename them as `A`, `B`, `C`, `D`, `E`.
```{r}
# reorder and rename the "km$centers"
km$centers # before reorder and rename
km$centers <- km$centers[c('2', '4', '1', '5', '3'), ]
rownames(km$centers) <- c('A', 'B', 'C', 'D', 'E')
km$centers # after reorder and rename
# rename the "km$cluster"
head(km$cluster) # before rename
km$cluster <- as.factor(km$cluster)
levels(km$cluster) <- c('C', 'A', 'E', 'B', 'D')
head(km$cluster) # after rename
```
## 1.3 Cluster Visualization
We can see the time trends in the clusters means by plotting each cluster mean as a line.
The cluster means have to be reformatted into a data frame with columns `Cluster`, `Day` and `Mean`.
This is done using the `dplyr` package `gather()` command.
Then, we use the reformatted data frame and `ggplot` with `geom_line()` to make the plots.
Note our use of `facet_wrap()` to generate individual plots by `Cluster`.
```{r}
# reformatted
head(as.data.frame(km$centers)) # before reformatted
reformatted.df <- as.data.frame(km$centers) %>%
rownames_to_column("Cluster") %>% # Make a new column called Cluster
gather(key="Day",value="Mean", -Cluster) %>%
convert(int(Day)) # convert Day to an integer
head(reformatted.df) # after reformatted
# plotting each cluster mean
ggplot(reformatted.df,aes(x=Day, y=Mean, col=Cluster)) +
geom_line() + geom_point() +
scale_x_continuous(breaks=c(-8,-4,0,1,7,16,21,28)) +
labs(title="Cluster Centers") +
facet_wrap(Cluster ~.) # Use facet_wrap to make a separate plot for each cluster
rm(reformatted.df) # clean environment
```
We can also make separate heat map for mouse genes in each of the five clusters and put them vertically from `A` to `E` (top to bottom) by [Hearmap](https://github.com/jokergoo/ComplexHeatmap).
It's easy to see a shift of higher value from cluster `A` to `E`.
```{r}
heatmapList <- NULL
for (i in c('A', 'B', 'C', 'D', 'E')){
heatmapList = heatmapList %v% Heatmap(
Mouse.matrix[km$cluster==i,], name = i,
show_row_dend= FALSE, show_column_dend= FALSE,
show_row_names=FALSE, cluster_columns = FALSE,
)
}
draw(heatmapList, row_title = "Mouse genes in Cluster A (top) to E (bottom)", column_title = "Day")
rm(heatmapList) # clean environment
```
## 1.4 PCA Analysis
We visualize the cluster using a [biplot](https://en.wikipedia.org/wiki/Biplot) of two components generated by PCA which explain 74% of the variance.
```{r}
pca <- prcomp(Mouse.matrix, retx=TRUE, center=TRUE, scale=TRUE)
summary(pca)
```
We display the points in a `biplot`, colored according to the k-means cluster result.
The projection of the points makes an interesting disc-type shape which is less dense in the middle, like a donut.
We can see that the clusters are arranged in time order, or stage, around the disc, from `A` to `E`.
```{r}
t <- 1.2*max(abs(pca$x[,1:2])) # x and y scale limits for the biplot
ggbiplot(
pca,
choices=c(1,2), # Use PC1, PC2
alpha=.1, # Make dots transparent
varname.adjust=1.5, # Move variables names out a bit
scale =0, # Don't rescale data
groups=as.factor(km$cluster)
) + ggtitle('Biplot for PC1 and PC2') + xlim(-t, t) + ylim(-t, t)
```
# 3. WOS Analysis of Zika
We read Zika genes and show them on the cluster biplot of the mouse model.
```{r}
# get sample data
disease.df <- read.csv("data/Zikamicrocephaly_data.csv",row.names = 1)
disease_symbols <- intersect(as.character(disease.df$symbol), as.character(rownames(Mouse.df)))
label <- as.factor(km$cluster)
levels(label) <- c('C', 'A', 'E', 'B', 'D')
plot.df <- cbind.data.frame(pca$x, cluster=label)
myplot.df <- plot.df[disease_symbols,]
# biplot
ggplot() +
geom_point(data = myplot.df,
aes(x=PC1, y=PC2, color=cluster) ) +
geom_segment(x=0, y=0,
aes(xend=3*pca$rotation[,1], yend=3*pca$rotation[,2]),
arrow=arrow(length=unit(1/2,'picas')) ) +
geom_text(aes(x=3.3*pca$rotation[,1],
y=3.3*pca$rotation[,2],
label=c("-8","-4","0","1","7","16","21","28")),
size=4)
```
Def `cluster_pvals` to calculate pvalue and logodds of each cluster.
The code come from the Lab5 handout from Dr. Bennett.
```{r}
# Define cluster_pvals; DO NOT CHANGE!
cluster_pvals <- function(k, km, myplot.df) {
# Inputs: k, km, myplot.df
# Returns: results (dataframe with clusters, pvalues, logodds)
# Set the p-value and logodds to 0
pvalue <- zeros(k,1)
logodds <- zeros(k,1)
results <- cbind.data.frame(cluster=1:k, pvalue, logodds)
classdisease <- zeros(k,1)
classall <- as.vector(table(km$cluster))
# use dplyr to calculate counts for each cluster
temp <- myplot.df %>%
dplyr::group_by(cluster) %>%
dplyr::count(name="freq") # Creates 'freq' column!
classdisease[temp$cluster] <- temp$freq
classlogodds <- zeros(k,2)
totaldisease <- sum(classdisease)
totalall <- sum(classall)
# Calculate the log odds ratio for the disease
for (i in 1:k) {
n11 <- classdisease[i] +1 # genes in disease in cluster i
n21 <- totaldisease- classdisease[i] +1 # genes in disease not in cluster i
n12 <- classall[i]-n11+1 # genes not in disease and in cluster i
n22 <- totalall- n11-n21 -n12+1; # genes not in disease and not in cluster
res <- fisher.test(matrix(c(n11,n21,n12,n22), 2, 2))
results[i,]$pvalue <- res$p.value
results[i,]$logodds<- log((n11*n22)/(n12*n21))
}
return(results)
}
```
Next we use the help function `cluster_pvals` to calculte the log odds ratio of Zika disease for each cluster.
```{r}
# Apply cluster_pvals using the parameters just generated
clusters <- cluster_pvals(5, km, myplot.df)
# Helper function to determine enrichment
threshold <- 0.1 # Normally set to 0.1
enriched <- function(p.value,logodds,p.threshold=0.1) {
if ((p.value <= p.threshold) && (logodds > 0)) {
return(TRUE)
}
else {
return(FALSE)
}
}
# Evaluate across our results; create new column
clusters$enriched <- mapply(enriched, clusters$pvalue, clusters$logodds,threshold)
# rename the cluster name of the output matrix
clusters$cluster[clusters$cluster==2]<-"A"
clusters$cluster[clusters$cluster==4]<-"B"
clusters$cluster[clusters$cluster==1]<-"C"
clusters$cluster[clusters$cluster==5]<-"D"
clusters$cluster[clusters$cluster==3]<-"E"
# View results
kable(clusters)
```
From the result, we see the stage B is enriched since it has a positive log odds ratio and a p-value < 0.1.
Thus, for Zika disease, stage B is the most susceptible period base on the mouse model, which is same to the WOS in human model.
# 4. WOS Analysis of Cognitive Disorder
Same process as above, we read cognitive disorder genes and show them on the cluster biplot of the mouse model.
```{r}
# get sample data
disease.df <- read.csv("data/Cognitive disorder_heat_map_data.csv",row.names = 1)
disease_symbols <- intersect(as.character(disease.df$symbol), as.character(rownames(Mouse.df)))
label <- as.factor(km$cluster)
levels(label) <- c('C', 'A', 'E', 'B', 'D')
plot.df <- cbind.data.frame(pca$x, cluster=label)
myplot.df <- plot.df[disease_symbols,]
# biplot
ggplot()+
geom_point(data = myplot.df,
aes(x=PC1, y=PC2, color=cluster) )+
geom_segment(x=0, y=0,
aes(xend=3*pca$rotation[,1], yend=3*pca$rotation[,2]),
arrow=arrow(length=unit(1/2,'picas')) )+
geom_text(aes(x=3.3*pca$rotation[,1],
y=3.3*pca$rotation[,2],
label=c("-8","-4","0","1","7","16","21","28")),
size=4)
```
Next we use the help function "cluster_pvals" provides in lab 5 handout and defines above to calculte the log odds ratio of Rett syndrome disease for each cluster.
```{r}
# Apply cluster_pvals using the parameters just generated
clusters <- cluster_pvals(5, km, myplot.df)
# Helper function to determine enrichment
threshold <- 0.1 # Normally set to 0.1
enriched <- function(p.value,logodds,p.threshold=0.1) {
if ((p.value <= p.threshold) && (logodds > 0)) {
return(TRUE)
}
else {
return(FALSE)
}
}
# Evaluate across our results; create new column
clusters$enriched <- mapply(enriched, clusters$pvalue, clusters$logodds,threshold)
# rename the cluster name of the output matrix
clusters$cluster[clusters$cluster==2]<-"A"
clusters$cluster[clusters$cluster==4]<-"B"
clusters$cluster[clusters$cluster==1]<-"C"
clusters$cluster[clusters$cluster==5]<-"D"
clusters$cluster[clusters$cluster==3]<-"E"
# View results
kable(clusters)
```
The result shows that stages `D` and `E` are enriched, which is the most
susceptible period for Cognitive Disorder.
In human data, the last stage, upper layers, is enriched.
Thus, the mice and human results are match: they both enriched at the end stage.
# 5. Conclusions
We got similar WOS result in human and mouse model for both Zika and cognitive
disorder.
For Zika virus, analysis in both models show the second stage is most
susceptible.
The last stage, the upper layers stage, is most susceptible for the cognitive
disorders in human model where the stages `D` and `E` are most susceptible in
mouse model, which is also end stage.
# 6. Acknowledgements
This is a mini project for the *MATP 4400 Data Introduction to Data Mathematics*
by Dr. [Kristin Bennett](https://www.linkedin.com/in/kristin-bennett-b337637/)
and Dr. [John Erickson](https://www.linkedin.com/in/olyerickson/), Spring 2022,
at [Rensselaer Polytechnic Institute](https://www.rpi.edu), Troy, NY.
The background and goal of the project is given in *IDM Lab 5: Mini Project*.
The structure of this report is revise on the basis of *PreLab 5* and *Lab 5*
templates given by instructors.
Dr. Bennett and Dr. Erickson gave perfect instruction about all the methods
including K-meaning clustering, PCA, and log odds ratio analysis and R
visualization techniques in MATP 4400.