OpenPedCan-analysis/analyses/long-format-table-utils at dev · d3b-center/OpenPedCan-analysis

Name	Name	Last commit message	Last commit date
parent directory ..
annotator	annotator
README.md	README.md
run-update-long-format-table-utils.sh	run-update-long-format-table-utils.sh

Long-format table utils

This analysis is DEPRECATED and was last run with OpenPedCan data release v12.

Module author: Yuanchao Zhang (@logstar)

Long-format table utils
- Purpose
- Methods
  - Update downloaded data that are used in this module
  - Add gene and cancer_group annotations

Purpose

Create application programming interface (API) and command line interface (CLI) for handling long-format tables that are generated by analysis modules. API provides analysis module developers with functions that can be imported into their own scripts via R source('path/to/the/function/file.R') or Python import os, sys; sys.path.append(os.path.abspath("path/to/the/function/dir")); import function_filename_but_no_dot_py. CLI provides analysis module developers with scripts that can be executed in their own run-module shell script with either Rscript --vanilla path/to/the/script.R arg long.tsv long_edited.tsv or python3 path/to/the/script.py arg long.tsv long_edited.tsv.

This module is suggested by @jharenza and @kgaonkar6 in Slack at https://opentargetspediatrics.slack.com/archives/C021Z53SK98/p1626290031138100?thread_ts=1626287625.133600&cid=C021Z53SK98, in order to alleviate the burdens of analysis module developers for adding annotations and keeping track of what annotations need to be added. This module could also potentially handle large file storage issues at a later point, since the file size limit of GitHub is 100MB.

Sub-module name	Implemented function	Available interface(s)
`annotator`	Add gene and `cancer_group` annotations	R API, R CLI

API and CLI usages and descriptions are in the Methods section.

Methods

Update downloaded data that are used in this module

Run the following command to update downloaded data that are used in this module.

bash run-update-long-format-table-utils.sh

The run-update-long-format-table-utils.sh runs data downloading scripts in sub-modules, e.g. annotator/run-download-annotation-data.sh.

Users could use git diff --stat to check for data file changes.

Following is the table of data files that need to be updated by the maintainer of this module.

Data file	Date of the last update	Update method	Data version(s)
`annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv`	08/20/2021	`bash run-update-long-format-table-utils.sh`	MyGene version on the last update date
`annotator/annotation-data/oncokb-cancer-gene-list.tsv`	07/16/2021	Manually download from https://www.oncokb.org/cancerGenes	07/16/2021

Note on MyGene version: MyGene releases are built regularly using data from various sources, e.g. Ensembl, NCBI, and UCSC. In each release note, updated data sources are listed, e.g. Ensembl gene is updated from 103 to 104 in Build version 20210510. The annotator/download-annotation-data.R script uses the R mygene package 1.22.0 to query Gene_Ensembl_ID values to retrieve Gene_full_name and Protein_RefSeq_ID values. The R mygene package 1.22.0 uses MyGene v3 API with default_url <- "http://mygene.info/v3" specified at line 8 of mygene.R, even though the R mygene package 1.22.0 documentation says that v2 API is used.

Note on using MyGene instead of biomaRt: MyGene has more Gene_Ensembl_IDs that have Gene_full_name or Protein_RefSeq_ID available than biomaRt, and more details are discussed at #55 (comment). However, biomaRt allows users to specify Ensembl (GENCODE) versions with biomaRt::useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = 90). If biomaRt is preferred at a later point, use the code at #55 (comment) as a starting point for updating annotator/download-annotation-data.R.

Add gene and `cancer_group` annotations

Implementation of long-format table annotator

Check input long-format tables have required columns for adding annotation columns.

Column name	Required for adding which annotation column(s)	Description
`Gene_symbol`	`Gene_type`, `OncoKB_cancer_gene`, `OncoKB_oncogene_TSG`	HUGO symbols, e.g. PHLPP1, TM6SF1, and DNAH5.
`Gene_Ensembl_ID`	`PMTL`, `Gene_full_name`, `Protein_RefSeq_ID`	Ensembl ENSG IDs without `.#` versions, e.g. ENSG00000039139, ENSG00000111261, and ENSG00000169710
`Disease`	`EFO`, `MONDO`	The `cancer_group` in the `histologies.tsv`, e.g. Adamantinomatous Craniopharyngioma, Atypical Teratoid Rhabdoid Tumor, and Low-grade glioma/astrocytoma
`GTEx_tissue_group`	`GTEx_tissue_group_UBERON`	The `gtex_group` in the `histologies.tsv`, e.g. Adipose, Kidney, and Thyroid
`GTEx_tissue_subgroup`	`GTEx_tissue_subgroup_UBERON`	The `gtex_subgroup` in the `histologies.tsv`, e.g. Adipose - Subcutaneous, Kidney - Cortex, and Thyroid

Add one or more of the following gene, disease (/cancer_group), and tissue annotations, by specifying the columns_to_add parameter in the annotate_long_format_table function, or by specifying the -c/--columns-to-add option when running annotator-cli.R.

Annotation column name	`join_by` column name	Non-missing value	Annotation data file	Source
`PMTL`	`Gene_Ensembl_ID`	`Relevant Molecular Target (PMTL version 1.1)` or `Non-Relevant Molecular Target (PMTL version 1.1)`	`data/ensg-hugo-pmtl-mapping.tsv`	PediatricOpenTargets/OpenPedCan-analysis data release
`Gene_type`	`Gene_symbol`	A sorted comma separated list of one or more of the following gene types: `CosmicCensus`, `Kinase`, `Oncogene`, `TranscriptionFactor`, and `TumorSuppressorGene`. Example values: `CosmicCensus`, `CosmicCensus,Kinase`, and `CosmicCensus,Kinase,TumorSuppressorGene`.	`analyses/fusion_filtering/references/genelistreference.txt`	Described at https://github.com/d3b-center/annoFuse
`OncoKB_cancer_gene`	`Gene_symbol`	`Y` or `N`	`analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv`	Downloaded from https://www.oncokb.org/cancerGenes
`OncoKB_oncogene_TSG`	`Gene_symbol`	`Oncogene`, or `TumorSuppressorGene`, or `Oncogene,TumorSuppressorGene`	`analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv`	Downloaded from https://www.oncokb.org/cancerGenes
`Gene_full_name`	`Gene_Ensembl_ID`	A single string of gene full name, e.g. `cytochrome c oxidase subunit III` and `ATP synthase F0 subunit 6`	`analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv`	MyGene.info v3 API
`Protein_RefSeq_ID`	`Gene_Ensembl_ID`	A sorted comma separated list of one or more protein RefSeq IDs, e.g. `NP_004065.1`, `NP_000053.2,NP_001027466.1`, and `NP_000985.1,NP_001007074.1,NP_001007075.1`	`analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv`	MyGene.info v3 API
`EFO`	`Disease`	A single string of EFO code, e.g. `EFO_1000069`, `EFO_1002008`, and `EFO_1000177`	`data/efo-mondo-map.tsv`	PediatricOpenTargets/OpenPedCan-analysis data release
`MONDO`	`Disease`	A single string of MONDO code, e.g. `MONDO_0002787`, `MONDO_0020560`, and `MONDO_0009837`	`data/efo-mondo-map.tsv`	PediatricOpenTargets/OpenPedCan-analysis data release
`GTEx_tissue_group_UBERON`	`GTEx_tissue_group`	A single string of UBERON code, e.g. `UBERON_0000955`, `UBERON_0002107`, and `UBERON_0000007`	`data/uberon-map-gtex-group.tsv`	PediatricOpenTargets/OpenPedCan-analysis data release
`GTEx_tissue_subgroup_UBERON`	`GTEx_tissue_subgroup`	A single string of UBERON code, e.g. `UBERON_0002369`, `UBERON_0001870`, and `UBERON_0002038`	`data/uberon-map-gtex-subgroup.tsv`	PediatricOpenTargets/OpenPedCan-analysis data release

Note: only add Gene_type to gene-level tables, which can be implemented by leaving "Gene_type" out of the columns_to_add parameter of the annotate_long_format_table function in annotator-api.R, or by leaving "Gene_type" out of the -c/--columns-to-add option when running annotator-cli.R.

Notes on requiring Gene_symbol and Gene_Ensembl_ID:

Certain annotation files use Gene_symbol as key columns, and certain other annotation files use Gene_Ensembl_ID as key columns.
Some Gene_symbols are mapped to multiple Gene_Ensembl_IDs, so adding Gene_Ensembl_IDs by mapping Gene_symbols with data/ensg-hugo-pmtl-mapping.tsv may implicitly introduce duplicated rows. Therefore, adding Gene_Ensembl_IDs by mapping Gene_symbols is left to users with cautions for potentially introducing unwanted duplicates.
Similarly, some Gene_Ensembl_IDs are mapped to multiple Gene_symbols, so adding Gene_symbols by mapping Gene_Ensembl_IDs with data/ensg-hugo-pmtl-mapping.tsv may implicitly introduce duplicated rows. Therefore, adding Gene_symbols by mapping Gene_Ensembl_IDs is left to users with cautions for potentially introducing unwanted duplicates.

Notes on annotation data versions:

The version of PediatricOpenTargets/OpenPedCan-analysis data release is determined by the download-data.sh under the OpenPedCan-analysis directory.
The version of analyses/fusion_filtering/references/genelistreference.txt is tracked by GitHub commits, and the GitHub permalink to the currently used file is https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/7fb11a020a92d06c8685736546e860bfe23da7e2/analyses/fusion_filtering/references/genelistreference.txt.
The versions of other sources are listed in the Update downloaded data that are used in this module section.

Notes on GTEx_tissue_subgroup -> EFO code mapping:

The data/uberon-map-gtex-subgroup.tsv file contains two EFO codes:
- EFO_0002009 for Cells - Cultured fibroblast
- EFO_0000572 for Cells - EBV-transformed lymphocytes
It is decided in a discussion that the annotator submodule will not add GTEx_tissue_subgroup -> EFO mapping, because the cell lines may have no biological context for being searched on PediatricOpenTargets website.
If GTEx_tissue_subgroup -> EFO mapping needs to be added to the annotator submodule at a later point, keep using the EFO column name for Disease -> EFO mapping, and add GTEx_tissue_subgroup_EFO annotation column. Even though the viewers of the annotated tables that have both EFO and GTEx_tissue_subgroup_EFO may be confused, the API and CLI calls will be backward compatible, so that analysis modules that use the annotator will not need to be updated for changing the column name for Disease EFO.

R API usage of long-format table annotator

The long-format-table-utils/annotator/annotator-api.R file provides the annotate_long_format_table function for annotating long-format tables.

Use the long-format table annotator API in an analysis module with the following steps:

Change the working directory of the analysis module to be OpenPedCan-analysis or a subdirectory of OpenPedCan-analysis. This allows the API function annotate_long_format_table to locate annotation data files.
source the long-format-table-utils/annotator/annotator-api.R file.
If the class of the table to be annotated is not tibble::tbl_df, convert the table to tibble::tbl_df with tibble::as_tibble. After conversion, carefully check rownames, colnames, column classes (especially factors), and other properties that may affect the correctness of you code.
If required columns are not all present in the table to be annotated, add new columns or rename existing ones to have all these required columns.
Check that the annotation columns to be added are not already present in the table that needs to be annotated. If there is any annotation column that needs to be added already exists in the table that needs to be annotated, the annotate_long_format_table function will raise an error without annotating the table.
Call annotate_long_format_table to add one or more of the available annotation columns, by specifying the columns_to_add parameter in the annotate_long_format_table function. Read the documentation comment of the function for usage.
Rename, select, and reorder the columns of the annotated table for output in TSV, or JSON, or JSONL formats. NOTE that the names of the annotation columns will be standardized at a later point, as suggested by @jharenza at #56 (review), so it is recommended to use the annotation column names in this module for the results.

Following is an example usage in the rna-seq-expression-summary-stats module 01-tpm-summary-stats.R.

> getwd()
[1] "/home/rstudio/OpenPedCan-analysis/analyses/rna-seq-expression-summary-stats"
> source("../long-format-table-utils/annotator/annotator-api.R")
> class(m_tpm_ss_long_tbl)
[1] "tbl_df"     "tbl"        "data.frame"
> colnames(m_tpm_ss_long_tbl)
 [1] "gene_symbol"                          "gene_id"
 [3] "cancer_group"                         "cohort"
 [5] "tpm_mean"                             "tpm_sd"
 [7] "tpm_mean_cancer_group_wise_zscore"    "tpm_mean_gene_wise_zscore"
 [9] "tpm_mean_cancer_group_wise_quantiles" "n_samples"
>
> # Gene_Ensembl_ID column is required for adding PMTL column
> # Disease column is required for adding EFO and MONDO columns
> renamed_m_tpm_ss_long_tbl <- dplyr::rename(
+   m_tpm_ss_long_tbl, Gene_Ensembl_ID = gene_id, Disease = cancer_group)
>
> annotation_columns_to_add <- c("MONDO", "PMTL", "EFO")
> # Assert all columns to be added are not already present in the
> # colnames(renamed_m_tpm_ss_long_tbl)
> stopifnot(
+   all(!annotation_columns_to_add %in% colnames(renamed_m_tpm_ss_long_tbl)))
>
> annotated_renamed_m_tpm_ss_long_tbl <- annotate_long_format_table(
+   renamed_m_tpm_ss_long_tbl, columns_to_add = annotation_columns_to_add)
>
> m_tpm_ss_long_tbl <- dplyr::rename(
+   annotated_renamed_m_tpm_ss_long_tbl,
+   gene_id = Gene_Ensembl_ID, cancer_group = Disease)
> m_tpm_ss_long_tbl <- dplyr::select(
+   m_tpm_ss_long_tbl, gene_symbol, PMTL, gene_id,
+   cancer_group, EFO, MONDO, n_samples, cohort,
+   tpm_mean, tpm_sd,
+   tpm_mean_cancer_group_wise_zscore, tpm_mean_gene_wise_zscore,
+   tpm_mean_cancer_group_wise_quantiles)

R CLI usage of long-format table annotator

The long-format-table-utils/annotator/annotator-cli.R file provides an R CLI for using the API to annotate long-format tables.

Use the long-format table annotator CLI in an analysis module with the following steps:

If required columns are not all present in the the table to be annotated, add new columns or rename existing ones to have all these required columns.
Output the table that needs to be annotated in TSV format. NOTEs on the TSV file:
1. The TSV file should use double quotes for field values thatneed escape, e.g. "NA" for string literal "NA" and "\t" for tab
2. Only unquoted NA field values are treated as missing values by annotator-cli.R
3. Leading and trailing white spaces in field values are NOT trimmed by annotator-cli.R
Change the working directory to be OpenPedCan-analysis or a subdirectory of OpenPedCan-analysis. This allows the annotator-cli.R to locate the annotator-api.R.
Run the annotator-cli.R script with Rscript --vanilla path/to/annotator-cli.R and proper options. The Rscript command can be invoked by R system("Rscript --vanilla path/to/annotator-cli.R -h") (if the annotator R API is not preferred) or Python (>= 3.5) import subprocess; subprocess.run("Rscript --vanilla analyses/long-format-table-utils/annotator/annotator-cli.R -h".split()). For more information about R system, https://stat.ethz.ch/R-manual/R-devel/library/base/html/system.html. For more information about Python (>= 3.5) subprocess.run, https://docs.python.org/3/library/subprocess.html#subprocess.run.
Read the annotated table TSV file. It is recommended to read all fields as character/string types, so the format and the number of significant digits of the double/float can be preserved.
Rename, select, and reorder the columns of the annotated table for output in TSV, or JSON, or JSONL formats.

Following is an example usage in the rna-seq-expression-summary-stats module 01-tpm-summary-stats.R.

> getwd()
[1] "/home/rstudio/OpenPedCan-analysis/analyses/rna-seq-expression-summary-stats"
> class(m_tpm_ss_long_tbl)
[1] "tbl_df"     "tbl"        "data.frame"
> colnames(m_tpm_ss_long_tbl)
 [1] "gene_symbol"                          "gene_id"
 [3] "cancer_group"                         "cohort"
 [5] "tpm_mean"                             "tpm_sd"
 [7] "tpm_mean_cancer_group_wise_zscore"    "tpm_mean_gene_wise_zscore"
 [9] "tpm_mean_cancer_group_wise_quantiles" "n_samples"
>
> # Gene_Ensembl_ID column is required for adding PMTL column
> # Disease column is required for adding EFO and MONDO columns
> renamed_m_tpm_ss_long_tbl <- dplyr::rename(
+   m_tpm_ss_long_tbl, Gene_Ensembl_ID = gene_id, Disease = cancer_group)
>
> readr::write_tsv(
+   renamed_m_tpm_ss_long_tbl,
+   "../../scratch/renamed_m_tpm_ss_long_tbl.tsv")
>
> system(paste(
+   "Rscript --vanilla ../long-format-table-utils/annotator/annotator-cli.R",
+   "-r -v -c MONDO,PMTL,EFO",
+   "-i ../../scratch/renamed_m_tpm_ss_long_tbl.tsv",
+   "-o ../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv"))
Read ../../scratch/renamed_m_tpm_ss_long_tbl.tsv...
Annotate ../../scratch/renamed_m_tpm_ss_long_tbl.tsv...
Output ../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv...
Done.
>
> annotated_renamed_m_tpm_ss_long_tbl <- readr::read_tsv(
+   "../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv",
+   na = character(),
+   col_types = readr::cols(.default = readr::col_character()))
|=================================================================================================| 100%  226 MB
> m_tpm_ss_long_tbl <- dplyr::rename(
+   annotated_renamed_m_tpm_ss_long_tbl,
+   gene_id = Gene_Ensembl_ID,  cancer_group = Disease)
> m_tpm_ss_long_tbl <- dplyr::select(
+   m_tpm_ss_long_tbl, gene_symbol, PMTL, gene_id,
+   cancer_group, EFO, MONDO, n_samples, cohort,
+   tpm_mean, tpm_sd,
+   tpm_mean_cancer_group_wise_zscore, tpm_mean_gene_wise_zscore,
+   tpm_mean_cancer_group_wise_quantiles)

Unit testing for long-format table annotator

The unit testing is implemented using the testthat package version 2.1.1, as suggested by @jharenza and @NHJohnson in the reviews of PR #55.

To run all unit tests, run bash annotator/run-tests.sh in the Docker image/container from any working directory. Following is an example run.

$ bash annotator/run-tests.sh
✔ |  OK F W S | Context
✔ |  55       | tests/test_annotate_long_format_table.R [22.4 s]
✔ |  45       | tests/test_annotator_cli.R [49.6 s]
✔ |   8       | tests/test_collapse_name_vec.R
✔ |   7       | tests/test_collapse_rp_lists.R
✔ |  21       | tests/test_helper_import_function.R

══ Results ═════════════════════
Duration: 72.2 s

OK:       136
Failed:   0
Warnings: 0
Skipped:  0
Done running run-tests.sh

To add more tests, create additional test*R files under the annotator/tests directory, with available test*R files as reference.

Notes on the testthat unit testing framework:

testthat::test_dir("tests") finds all test*R files under the tests directory to run, which is used in annotator/run-tests.sh.
testthat::test_dir("tests") also finds and runs all helper*R files under the tests directory before running the test*R files.
The working directory is tests when running the helper*R and test*R files through testthat::test_dir("tests").
In order to import a funciton for testing from an R file without running the whole file, a helper function import_function is defined at tests/helper_import_function.R, and the import_function is also tested in the tests/test_helper_import_function.R file.
Even though the testthat 2.1.1 documentation of the filter parameter of test_dir function says that "Matching is performed on the file name after it's stripped of "test-" and ".R", the R code uses the following. Therefore, naming test files with test_some_test_file.R can be found by the test_dir function.
- "^test.*\\.[rR]$" for finding test files in find_test_scripts
- sub("^test-?", "", test_names), sub("\\.[rR]$", "", test_names), and grepl(filter, test_names, ...) for filtering test files in testthat:::filter_test_scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

long-format-table-utils

long-format-table-utils

README.md

Long-format table utils

Purpose

Methods

Update downloaded data that are used in this module

Add gene and `cancer_group` annotations

Implementation of long-format table annotator

R API usage of long-format table annotator

R CLI usage of long-format table annotator

Unit testing for long-format table annotator

Files

long-format-table-utils

Directory actions

More options

Directory actions

More options

Latest commit

History

long-format-table-utils

Folders and files

parent directory

README.md

Long-format table utils

Purpose

Methods

Update downloaded data that are used in this module

Add gene and cancer_group annotations

Implementation of long-format table annotator

R API usage of long-format table annotator

R CLI usage of long-format table annotator

Unit testing for long-format table annotator

Add gene and `cancer_group` annotations