-
Notifications
You must be signed in to change notification settings - Fork 0
/
Inspecting_the_matrix.Rmd
212 lines (166 loc) · 6.88 KB
/
Inspecting_the_matrix.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
---
title: "Inspecting our data: skewness, NA values"
author: "Florian Huber"
date: "`r Sys.Date()`"
output:
html_document:
fig_caption: yes
fig_height: 6
fig_width: 6
number_sections: yes
toc: yes
toc_depth: 2
toc_float: yes
editor_options:
chunk_output_type: console
---
# Setup, library loading
```{r setup, message = FALSE}
source("./setup.R")
outdir <- "./data/programmatic_output"
library(e1071)
```
# Understanding the data
One should probably check a bit the behaviour of our variables: how skewed are
the distributions, are certain drugs or conditions far off, are NA values
clustered in certain genes/drugs?
Take `m_all` from DataLoading_FeaturePreselect.Rmd, which is the matrix after
filtering and quality control steps.
```{r}
indir <- "./data/programmatic_output"
m_all <- readRDS(file.path(indir, "m_all.rds"))
m_signif_all <- readRDS(file.path(indir, "m_signif_all.rds"))
drugs_complete <- readRDS(file.path(indir, "drugs_complete.rds"))
```
## Predictor skewness
Let's check skewness of predictors first.
```{r}
skewness_vals <- sapply(m_all[, 4:ncol(m_all)], e1071::skewness,
na.rm = TRUE)
(skewness_vals <- tibble(gene_synonym = names(skewness_vals),
skewness = skewness_vals))
# indicate outlier boundaries using boxplot.stats
my_boundaries <- get_outlier_boundaries(skewness_vals$skewness)
skewness_vals$is_outlier <-
skewness_vals$skewness <= my_boundaries$lower_bound |
skewness_vals$skewness >= my_boundaries$upper_bound
my_lower <- my_boundaries$lower_bound
my_upper <- my_boundaries$upper_bound
ggplot(skewness_vals, aes(x = skewness)) +
geom_histogram(binwidth = 0.1) +
geom_vline(xintercept = c(my_lower, my_upper), linetype = "dotted") +
labs(title = "Skewness of predictors (mutants) across conditions\n(outliers indicated acc. to boxplot.stats())")
ggsave(filename = "./plots/Predictor_skewness_distribution.pdf")
```
We have a number of featrues that are more skewed than the others: indicated by
their skewness values again being outliers in the skewness distribution of all
predictors.
```{r}
# skewness values are mostly negative, which makes sense, considering the
# nature of our data but there are 141 outliers
filter(skewness_vals, is_outlier) %>%
arrange(skewness)
most_neg <- with(skewness_vals, gene_synonym[which.min(skewness)])
```
`r most_neg` has the most negative skewness value, let's have a look:
```{r}
# a few seem to be quite bad, e.g. "PLDA"
hist(m_all[[most_neg]], breaks = 100, main = paste0("Histogram of s-scores
for ", most_neg))
pdf(paste0("./plots/Sscore_dist_", most_neg, ".pdf"))
hist(m_all[[most_neg]], breaks = 100, main = paste0("Histogram of s-scores
for ", most_neg))
dev.off()
```
It looks like a handful of conditions with very low s-scores might be responsible
for the "skewness" of the distribution:
```{r}
e1071::skewness(m_all[[most_neg]], na.rm = TRUE)
my_boundaries <- get_outlier_boundaries(m_all[[most_neg]])
# skewness metric without outliers
e1071::skewness(my_boundaries$input_data_noout)
```
Skewness may be a way of finding such "selective" genes.
```{r}
dplyr::select(m_all, drugname_typaslab, conc, !!most_neg) %>%
filter(!!sym(most_neg) <= my_boundaries$lower_bound)
# significant interactions:
dplyr::select(m_signif_all, drugname_typaslab, conc, !!most_neg) %>%
filter(!!sym(most_neg))
```
Removing the outliers (according to `boxplot.stats()`) from each predictor
makes the skewness distribution a lot more "normal":
```{r}
# remove for each gene those conditions that are considered outliers, then
# recompute skewness
skewness_vals$skewness_vals_noout <-
sapply(colnames(m_all[, 4:ncol(m_all)]), function(x) {
noout <- get_outlier_boundaries(m_all[[x]])$input_data_noout
return(e1071::skewness(noout, na.rm = TRUE))
})
my_boundaries <- get_outlier_boundaries(skewness_vals$skewness_vals_noout)
my_lower_new <- my_boundaries$lower_bound
my_upper_new <- my_boundaries$upper_bound
ggplot(skewness_vals, aes(x = skewness_vals_noout)) +
geom_histogram(binwidth = 0.1) +
geom_vline(xintercept = c(my_lower, my_upper), linetype = "dotted") +
geom_vline(xintercept = c(my_lower_new, my_upper_new), linetype = "dashed") +
labs(title = "Skewness of predictors (mutants) across conditions:
\n(this time with outliers removed, dashed lines = new boundaries)")
ggsave("./plots/Predictor_skewness_distribution_noout.pdf")
# seems like there are a few skewed conditions left, perhaps worth checking them:
sort(boxplot.stats(skewness_vals$skewness_vals_noout)$out)
# let's have a look at the respective s-score histograms/boxplots
skewness_vals$to_check <-
skewness_vals$skewness_vals_noout <= my_lower_new |
skewness_vals$skewness_vals_noout >= my_upper_new
(genes_to_check <- skewness_vals$gene_synonym[skewness_vals$to_check])
skewness_vals_melted <-
dplyr::select(m_all, drugname_typaslab, conc, genes_to_check) %>%
gather(genes_to_check, key = "gene_synonym", value = "s_score") %>%
left_join(skewness_vals[, c("gene_synonym", "skewness", "skewness_vals_noout")])
# add information about MoA:
moa_info <- drugs_complete[, c("drugname_typaslab", "process_broad")]
skewness_vals_melted <- left_join(skewness_vals_melted, moa_info)
annot_helper <- skewness_vals_melted %>%
group_by(gene_synonym) %>%
summarise(x = 0, y = 40, skewness = mean(skewness),
skewness_vals_noout = mean(skewness_vals_noout))
```
These are the most skewed distributions after outlier removal. Some of them are
really messed up, for example OXYR.
```{r}
p <- ggplot(skewness_vals_melted, aes(x = s_score)) +
geom_histogram() +
facet_wrap( ~ gene_synonym, scales = "free") +
geom_text(data = annot_helper, aes(label = round(skewness, digits = 2),
x = x, y = y), color = "blue", size = 3)
ggsave(plot = p, filename = "./plots/Sscore_dists_skewed_features.pdf",
width = 12, height = 12)
```
Pair plots of sscores of the skewed features.
```{r}
all_by_all <-
full_join(skewness_vals_melted, skewness_vals_melted,
by = c("drugname_typaslab" = "drugname_typaslab", "conc" = "conc",
"process_broad" = "process_broad"))
# plot_pairplots takes the data frame above and plots all by all combinations
# of the genes of interest
plot_pairplots(c("MRCB", "CYSB"))
plot_pairplots(c("MRCB", "CYSB", "CARB_and_11_more"))
```
Perhaps these are features that can be used to isolate certain (sub)groups of
drugs? - would be interesting to test. Another interesting observation: s-scores
can fluctuate a lot such as ACRA and novobiocin. By the way, my impression was
that s-scores _do_ go down with higher concentrations even though in an earlier
analysis I couldn't see a clear trend.
```{r}
filter(skewness_vals_melted, gene_synonym == "ACRA_and_2_more",
drugname_typaslab == "NOVOBIOCIN")
dplyr::select(m_signif_all, drugname_typaslab, conc, ACRA_and_2_more) %>%
filter(drugname_typaslab == "NOVOBIOCIN")
```
```{r, session_info}
R.version
sessionInfo()
```