Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update mtp tables qc checks to include expression tpm tables #288

Merged
merged 30 commits into from
Dec 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
950f41c
change the name form mutation-frequencies-table-checks to mtp-tables-…
adilahiri Nov 3, 2022
a56acc8
add long_n_tpm_mean_sd_quantile_gene_wise_zscore.tsv.gz and long_n_t…
adilahiri Nov 3, 2022
bebe633
Adding the v10 files in the previous folder
adilahiri Nov 3, 2022
21db082
Reading the current and previous files
adilahiri Nov 3, 2022
ccbcd00
Change the directory
adilahiri Nov 4, 2022
5c007c6
adding the function
adilahiri Nov 4, 2022
6b5451e
Adding number of samples
adilahiri Nov 7, 2022
bc2418d
Add all cancer cohort
adilahiri Nov 7, 2022
ca2eebc
column changes
adilahiri Nov 7, 2022
c8daa58
Add mondo changes, ensmbl changes
adilahiri Nov 7, 2022
728f8c8
Remove the temp result files
adilahiri Nov 7, 2022
79d3aac
Adding a RMD file for running bash script
adilahiri Nov 7, 2022
a6aac3a
Cleaned the bash file and added the function rmd file
adilahiri Nov 8, 2022
4e204cb
Fixing the typo of current_tables vs current_table
adilahiri Nov 8, 2022
78720af
Run after correcting the typo
adilahiri Nov 8, 2022
e8e8085
remove some of the intermediate coding files
adilahiri Nov 8, 2022
a65e5d3
updating the readme
adilahiri Nov 9, 2022
f3b408f
Update the readme intro
adilahiri Nov 9, 2022
55fffd8
Merge branch 'dev' into HEAD
adilahiri Nov 9, 2022
65ea7e5
Update the branch
adilahiri Nov 9, 2022
83caa58
delete v10 files, to add the new files provided
adilahiri Nov 9, 2022
078cdb9
adding the v10 files from the link
adilahiri Nov 9, 2022
ecc198e
Run after v10 files
adilahiri Nov 9, 2022
76f7161
Remove the readme html
adilahiri Nov 9, 2022
561ec5d
Remove the script 2 html and rerun
adilahiri Nov 9, 2022
aa9d5a1
Updating the readme
adilahiri Nov 11, 2022
faba961
Update readme header
adilahiri Nov 11, 2022
9e1d8e9
review updates
ewafula Dec 1, 2022
accb255
review updates
ewafula Dec 1, 2022
5af2ad2
Merge branch 'dev' into update-mtp-tables-qc-checks
ewafula Dec 1, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ To run this from the command line, use:
Rscript -e "rmarkdown::render('01-frequencies-tables-checks.Rmd', clean = TRUE)"
```
_This assumes you are in the analysis module directory of the repository,_
_OpenPedCan-analysis/analyses/mutation-frequencies-tables-checks._
_OpenPedCan-analysis/analyses/mtp-tables-qc-checks._


### Load packages
Expand All @@ -50,7 +50,7 @@ suppressPackageStartupMessages(library(openxlsx))
```{r set output directory}
# directories for input and output files
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))
analyses_dir <- file.path(root_dir, "analyses", "mutation-frequencies-table-checks")
analyses_dir <- file.path(root_dir, "analyses", "mtp-tables-qc-checks")
results_dir <- file.path(analyses_dir, "results")

# Create results folder if it doesn't exist
Expand Down Expand Up @@ -110,7 +110,7 @@ get_num_samples <- function(freq_df) {

# changes in common columns among MPT mutation frequencies between
# current and previous tables
changes_in_columns <- function(current_tables, previous_table, column_name) {
changes_in_columns <- function(current_table, previous_table, column_name) {
# values specific to current table
specific_to_current <- current_table %>%
dplyr::select(Dataset, column_name) %>%
Expand Down Expand Up @@ -266,7 +266,7 @@ all_cohorts_cancer_groups
#### Check gene symbols
```{r gene symbols}
changes_gene_symbols <-
changes_in_columns(current_tables, previous_table, "Gene_symbol")
changes_in_columns(current_table, previous_table, "Gene_symbol")

# write to file
changes_gene_symbols %>%
Expand All @@ -281,7 +281,7 @@ changes_gene_symbols
#### Check Ensembl IDs
```{r ensembl ids}
changes_ensembl_ids <-
changes_in_columns(current_tables, previous_table, "targetFromSourceId")
changes_in_columns(current_table, previous_table, "targetFromSourceId")

# write to file
changes_ensembl_ids %>%
Expand All @@ -296,7 +296,7 @@ changes_ensembl_ids
#### Check cancer groups
```{r cancer groups}
changes_cancer_groups <-
changes_in_columns(current_tables, previous_table, "Disease")
changes_in_columns(current_table, previous_table, "Disease")

# write to file
changes_cancer_groups %>%
Expand All @@ -311,7 +311,7 @@ changes_cancer_groups
#### Check EFO IDs
```{r efo ids}
changes_efo_ids <-
changes_in_columns(current_tables, previous_table, "diseaseFromSourceMappedId")
changes_in_columns(current_table, previous_table, "diseaseFromSourceMappedId")

# write to file
changes_efo_ids %>%
Expand All @@ -325,7 +325,7 @@ changes_efo_ids
#### Check MONDO IDs
```{r mondo ids}
changes_mondo_ids <-
changes_in_columns(current_tables, previous_table, "MONDO")
changes_in_columns(current_table, previous_table, "MONDO")

# write to file
changes_mondo_ids %>%
Expand Down
303 changes: 303 additions & 0 deletions analyses/mtp-tables-qc-checks/02-tpm-tables-checks.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,303 @@
---
title: "Gene Expression TPM Table Summary and QC Checks"
output:
html_notebook:
toc: TRUE
toc_float: TRUE
toc_depth: 4
author: Aditya Lahiri and Eric Wafula
date: 2022-11-07
params:
current_table:
label: "current expression tpm table"
value: current/long_n_tpm_mean_sd_quantile_gene_wise_zscore.tsv.gz
input: file
previous_table:
label: "previous expression tpm table"
value: previous/long_n_tpm_mean_sd_quantile_gene_wise_zscore.tsv.gz
input: file
---



### Purpose
Performs summary and QC checks comparing the current and the previous OpenPedCan
tpm tables.

### Usage

To run this from the command line, use:
```
Rscript -e "rmarkdown::render('02-tpm-tables-checks.Rmd', clean = TRUE)"
```
_This assumes you are in the analysis module directory of the repository,_
_OpenPedCan-analysis/analyses/mtp-tables-qc-checks._



### Load packages
```{r load packages}
# R packages
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(openxlsx))

# Magrittr pipe
`%>%` <- dplyr::`%>%`
```


### Set output directory
```{r set output directory}
# directories for input and output files
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))
analyses_dir <- file.path(root_dir, "analyses", "mtp-tables-qc-checks")
results_dir <- file.path(analyses_dir, "results")

# Create results folder if it doesn't exist
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
```

### Functions
```{r functions}
# get number of samples in cohorts
get_num_samples <- function(tpm_df) {
samples <- tpm_df %>%
dplyr::filter(cohort != "All Cohorts") %>%
dplyr::select(cohort, Disease,
n_samples) %>%
dplyr::distinct() %>%
dplyr::select(-Disease) %>%
dplyr::group_by(cohort) %>%
dplyr::summarise(n_samples =
sum(as.integer(n_samples, na.rm = TRUE)))
return(samples)
}

# changes in common columns among MPT mutation frequencies between
# current and previous tables
changes_in_columns <- function(current_table, previous_table, column_name) {
# values specific to current table
specific_to_current <- current_table %>%
dplyr::select(cohort, column_name) %>%
dplyr::distinct() %>%
setdiff(previous_table %>% dplyr::select(cohort, column_name) %>%
dplyr::distinct()) %>%
dplyr::rename(current_dataset = cohort)
# values specific to previous table
specific_to_previous <- previous_table %>%
dplyr::select(cohort, column_name) %>%
dplyr::distinct() %>%
setdiff(current_table %>% dplyr::select(cohort, column_name) %>%
dplyr::distinct()) %>%
dplyr::rename(previous_dataset = cohort)
# combine differences
changes_in_columns <- specific_to_current %>%
dplyr::full_join(specific_to_previous, by = column_name) %>%
dplyr::select(column_name, current_dataset, previous_dataset)
return(changes_in_columns)
}
```


### Read in current and previous tpm files
```{r load files}
current_table <- data.table::fread(params$current_table,
data.table = FALSE,
showProgress = FALSE) %>%
purrr::discard(~all(is.na(.)))

previous_table <- data.table::fread(params$previous_table,
data.table = FALSE,
showProgress = FALSE) %>%
purrr::discard(~all(is.na(.)))
```


### First 50 records from a static cancer group (GMKF Neuroblastoma)
Records from a static cancer_group should not change between the current and previous mutation frequencies tables. There should be no changes in computed mutation frequencies. But other fields in the table will likely change if annotations are updated.

#### Current table
```{r current table}
# ordered top 50 GMKF Neuroblastoma records from current tpm table
gmkf_nbl_current <- current_table %>%
dplyr::filter(cohort == "GMKF", Disease == "Neuroblastoma") %>%
dplyr::arrange() %>%
head(n = 50)

# write to file
gmkf_nbl_current %>%
readr::write_tsv(file.path(results_dir, "current_group_static_cancer.tsv"))

# display
gmkf_nbl_current
```

#### Previous table

```{r previous table}
# ordered top 50 GMKF Neuroblastoma records from previous tpm table
gmkf_nbl_previous <- previous_table %>%
dplyr::filter(cohort == "GMKF", Disease == "Neuroblastoma") %>%
dplyr::arrange() %>%
head(n = 50)

# write to file
gmkf_nbl_previous %>%
readr::write_tsv(file.path(results_dir, "previous_group_static_cancer.tsv"))

# display
gmkf_nbl_previous
```



### Number of samples in each cohort
```{r number of samples}
# get number of samples in each cohort
num_samples <- get_num_samples(current_table) %>%
dplyr::rename(samples_in_current = n_samples) %>%
dplyr::full_join(get_num_samples(previous_table) %>%
dplyr::rename(samples_in_previous = n_samples), by = "cohort") %>%
dplyr::select(cohort, samples_in_current, samples_in_previous)

# write to file
num_samples %>%
readr::write_tsv(file.path(results_dir, "number_of_samples.tsv"), na = "NA")

# display
num_samples
```



### Cancer groups in the "All Cohorts" category
```{r all cohorts cancer groups}
# get All Cohorts cancer groups
all_cohorts_cancer_groups <- current_table %>%
dplyr::select(cohort, Disease) %>%
dplyr::filter(cohort == "All Cohorts") %>%
dplyr::distinct() %>%
dplyr::rename(current_cohort = cohort) %>%
dplyr::full_join(previous_table %>%
dplyr::select(cohort, Disease) %>%
dplyr::filter(cohort == "All Cohorts") %>%
dplyr::distinct() %>%
dplyr::rename(previous_cohort = cohort),
by = "Disease") %>%
dplyr::select(Disease, current_cohort, previous_cohort)

# write to file
all_cohorts_cancer_groups %>%
readr::write_tsv(file.path(results_dir,
"all_cohorts_cancer_groups.tsv"), na = "NA")

# display
all_cohorts_cancer_groups
```


### Changes in common columns across tpm tables

#### Check gene symbols
```{r gene symbols}
changes_gene_symbols <-
changes_in_columns(current_table, previous_table, "Gene_symbol")

# write to file
changes_gene_symbols %>%
readr::write_tsv(file.path(results_dir,
"changes_in_gene_symbols.tsv"), na = "NA")

# display
changes_gene_symbols
```



#### Check Ensembl IDs
```{r ensembl ids}
changes_ensembl_ids <-
changes_in_columns(current_table, previous_table, "Gene_Ensembl_ID")

# write to file
changes_ensembl_ids %>%
readr::write_tsv(file.path(results_dir,
"changes_in_ensembl_ids.tsv"), na = "NA")

# display
changes_ensembl_ids
```


#### Check cancer groups
```{r cancer groups}
changes_cancer_groups <-
changes_in_columns(current_table, previous_table, "Disease")

# write to file
changes_cancer_groups %>%
readr::write_tsv(file.path(results_dir,
"changes_in_cancer_groups.tsv"), na = "NA")

# display
changes_cancer_groups
```



#### Check EFO IDs
```{r efo ids}
changes_efo_ids <-
changes_in_columns(current_table, previous_table, "EFO")

# write to file
changes_efo_ids %>%
readr::write_tsv(file.path(results_dir, "changes_in_efo_ids.tsv"), na = "NA")

# display
changes_efo_ids
```


#### Check MONDO IDs
```{r mondo ids}
changes_mondo_ids <-
changes_in_columns(current_table, previous_table, "MONDO")

# write to file
changes_mondo_ids %>%
readr::write_tsv(file.path(results_dir, "changes_in_mondo_ids.tsv"), na = "NA")

# display
changes_mondo_ids
```


### Generate excel output for all QC checks and summaries
```{r}
# read all QC checks and summaries files
qc_files <- list.files(results_dir, pattern = "tsv$")
qc_files_list <- lapply(qc_files, function(x) {
# read each file
qc_file_df <- readr::read_tsv(file.path(results_dir, x))
return(qc_file_df)
})
# get file names
qc_files_names <- gsub(".tsv", "", qc_files)
names(qc_files_list) <- qc_files_names
# write results to excel workbook
qc_files_list %>%
openxlsx::write.xlsx(file.path(results_dir,
paste0(gsub(".tsv.gz", "",
basename(params$current_table)),
".xlsx")),
overwrite = TRUE, keepNA = TRUE, na.string = "NA")
```


```{r session info}
sessionInfo()
```
Loading