Inspecting_the_matrix.Rmd

---
title: "Inspecting our data: skewness, NA values"
author: "Florian Huber"
date: "`r Sys.Date()`"
output:
  html_document:
    fig_caption: yes
    fig_height: 6
    fig_width: 6
    number_sections: yes
    toc: yes
    toc_depth: 2
    toc_float: yes
editor_options: 
  chunk_output_type: console
---

# Setup, library loading

```{r setup, message = FALSE}
source("./setup.R")
outdir <- "./data/programmatic_output"

library(e1071)
```


# Understanding the data

One should probably check a bit the behaviour of our variables: how skewed are 
the distributions, are certain drugs or conditions far off, are NA values 
clustered in certain genes/drugs?

Take `m_all` from DataLoading_FeaturePreselect.Rmd, which is the matrix after 
filtering and quality control steps. 

```{r}
indir <- "./data/programmatic_output"
m_all <- readRDS(file.path(indir, "m_all.rds"))
m_signif_all <- readRDS(file.path(indir, "m_signif_all.rds"))
drugs_complete <- readRDS(file.path(indir, "drugs_complete.rds"))
```

## Predictor skewness

Let's check skewness of predictors first.

```{r}
skewness_vals <- sapply(m_all[, 4:ncol(m_all)], e1071::skewness, 
  na.rm = TRUE)
(skewness_vals <- tibble(gene_synonym = names(skewness_vals), 
  skewness = skewness_vals))

# indicate outlier boundaries using boxplot.stats
my_boundaries <- get_outlier_boundaries(skewness_vals$skewness)
skewness_vals$is_outlier <- 
   skewness_vals$skewness <= my_boundaries$lower_bound | 
   skewness_vals$skewness >= my_boundaries$upper_bound

my_lower <- my_boundaries$lower_bound
my_upper <- my_boundaries$upper_bound

ggplot(skewness_vals, aes(x = skewness)) + 
   geom_histogram(binwidth = 0.1) + 
   geom_vline(xintercept = c(my_lower, my_upper), linetype = "dotted") + 
   labs(title = "Skewness of predictors (mutants) across conditions\n(outliers indicated acc. to boxplot.stats())")

ggsave(filename = "./plots/Predictor_skewness_distribution.pdf")
```

We have a number of featrues that are more skewed than the others: indicated by 
their skewness values again being outliers in the skewness distribution of all 
predictors.

```{r}
# skewness values are mostly negative, which makes sense, considering the 
# nature of our data but there are 141 outliers
filter(skewness_vals, is_outlier) %>%
   arrange(skewness)

most_neg <- with(skewness_vals, gene_synonym[which.min(skewness)])
```

`r most_neg` has the most negative skewness value, let's have a look:

```{r}
# a few seem to be quite bad, e.g. "PLDA"
hist(m_all[[most_neg]], breaks = 100, main = paste0("Histogram of s-scores 
  for ", most_neg))

pdf(paste0("./plots/Sscore_dist_", most_neg, ".pdf"))
hist(m_all[[most_neg]], breaks = 100, main = paste0("Histogram of s-scores 
  for ", most_neg))
dev.off()
```

It looks like a handful of conditions with very low s-scores might be responsible 
for the "skewness" of the distribution:

```{r}
e1071::skewness(m_all[[most_neg]], na.rm = TRUE)
my_boundaries <- get_outlier_boundaries(m_all[[most_neg]])
# skewness metric without outliers
e1071::skewness(my_boundaries$input_data_noout)
```

Skewness may be a way of finding such "selective" genes. 

```{r}
dplyr::select(m_all, drugname_typaslab, conc, !!most_neg) %>%
   filter(!!sym(most_neg) <= my_boundaries$lower_bound)

# significant interactions:
dplyr::select(m_signif_all, drugname_typaslab, conc, !!most_neg) %>%
   filter(!!sym(most_neg))
```

Removing the outliers (according to `boxplot.stats()`) from each predictor 
makes the skewness distribution a lot more "normal":

```{r}
# remove for each gene those conditions that are considered outliers, then 
# recompute skewness
skewness_vals$skewness_vals_noout <- 
   sapply(colnames(m_all[, 4:ncol(m_all)]), function(x) {
      noout <- get_outlier_boundaries(m_all[[x]])$input_data_noout
      return(e1071::skewness(noout, na.rm = TRUE))
})

my_boundaries <- get_outlier_boundaries(skewness_vals$skewness_vals_noout)
my_lower_new <- my_boundaries$lower_bound
my_upper_new <- my_boundaries$upper_bound

ggplot(skewness_vals, aes(x = skewness_vals_noout)) + 
   geom_histogram(binwidth = 0.1) + 
   geom_vline(xintercept = c(my_lower, my_upper), linetype = "dotted") + 
   geom_vline(xintercept = c(my_lower_new, my_upper_new), linetype = "dashed") + 
   labs(title = "Skewness of predictors (mutants) across conditions:
        \n(this time with outliers removed, dashed lines = new boundaries)")

ggsave("./plots/Predictor_skewness_distribution_noout.pdf")
# seems like there are a few skewed conditions left, perhaps worth checking them:
sort(boxplot.stats(skewness_vals$skewness_vals_noout)$out)

# let's have a look at the respective s-score histograms/boxplots
skewness_vals$to_check <- 
   skewness_vals$skewness_vals_noout <= my_lower_new | 
   skewness_vals$skewness_vals_noout >= my_upper_new

(genes_to_check <- skewness_vals$gene_synonym[skewness_vals$to_check])
skewness_vals_melted <- 
   dplyr::select(m_all, drugname_typaslab, conc, genes_to_check) %>%
   gather(genes_to_check, key = "gene_synonym", value = "s_score") %>%
   left_join(skewness_vals[, c("gene_synonym", "skewness", "skewness_vals_noout")])

# add information about MoA:
moa_info <- drugs_complete[, c("drugname_typaslab", "process_broad")]
skewness_vals_melted <- left_join(skewness_vals_melted, moa_info)

annot_helper <- skewness_vals_melted %>%
   group_by(gene_synonym) %>%
   summarise(x = 0, y = 40, skewness = mean(skewness), 
     skewness_vals_noout = mean(skewness_vals_noout))
```

These are the most skewed distributions after outlier removal. Some of them are 
really messed up, for example OXYR. 

```{r}
p <- ggplot(skewness_vals_melted, aes(x = s_score)) + 
   geom_histogram() + 
   facet_wrap( ~ gene_synonym, scales = "free") + 
   geom_text(data = annot_helper, aes(label = round(skewness, digits = 2), 
     x = x, y = y), color = "blue", size = 3)

ggsave(plot = p, filename = "./plots/Sscore_dists_skewed_features.pdf", 
  width = 12, height = 12)
```

Pair plots of sscores of the skewed features. 

```{r}
all_by_all <- 
   full_join(skewness_vals_melted, skewness_vals_melted, 
     by = c("drugname_typaslab" = "drugname_typaslab", "conc" = "conc", 
       "process_broad" = "process_broad"))

# plot_pairplots takes the data frame above and plots all by all combinations 
# of the genes of interest

plot_pairplots(c("MRCB", "CYSB"))
plot_pairplots(c("MRCB", "CYSB", "CARB_and_11_more"))
```

Perhaps these are features that can be used to isolate certain (sub)groups of 
drugs? - would be interesting to test. Another interesting observation: s-scores 
can fluctuate a lot such as ACRA and novobiocin. By the way, my impression was 
that s-scores _do_ go down with higher concentrations even though in an earlier 
analysis I couldn't see a clear trend. 

```{r}
filter(skewness_vals_melted, gene_synonym == "ACRA_and_2_more", 
  drugname_typaslab == "NOVOBIOCIN")
dplyr::select(m_signif_all, drugname_typaslab, conc, ACRA_and_2_more) %>%
   filter(drugname_typaslab == "NOVOBIOCIN")
```


```{r, session_info}
R.version
sessionInfo()
```