Skip to content

Latest commit

 

History

History
 
 

long-format-table-utils

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Long-format table utils

This analysis is DEPRECATED and was last run with OpenPedCan data release v12.

Module author: Yuanchao Zhang (@logstar)

Purpose

Create application programming interface (API) and command line interface (CLI) for handling long-format tables that are generated by analysis modules. API provides analysis module developers with functions that can be imported into their own scripts via R source('path/to/the/function/file.R') or Python import os, sys; sys.path.append(os.path.abspath("path/to/the/function/dir")); import function_filename_but_no_dot_py. CLI provides analysis module developers with scripts that can be executed in their own run-module shell script with either Rscript --vanilla path/to/the/script.R arg long.tsv long_edited.tsv or python3 path/to/the/script.py arg long.tsv long_edited.tsv.

This module is suggested by @jharenza and @kgaonkar6 in Slack at https://opentargetspediatrics.slack.com/archives/C021Z53SK98/p1626290031138100?thread_ts=1626287625.133600&cid=C021Z53SK98, in order to alleviate the burdens of analysis module developers for adding annotations and keeping track of what annotations need to be added. This module could also potentially handle large file storage issues at a later point, since the file size limit of GitHub is 100MB.

Sub-module name Implemented function Available interface(s)
annotator Add gene and cancer_group annotations R API, R CLI

API and CLI usages and descriptions are in the Methods section.

Methods

Update downloaded data that are used in this module

Run the following command to update downloaded data that are used in this module.

bash run-update-long-format-table-utils.sh

The run-update-long-format-table-utils.sh runs data downloading scripts in sub-modules, e.g. annotator/run-download-annotation-data.sh.

Users could use git diff --stat to check for data file changes.

Following is the table of data files that need to be updated by the maintainer of this module.

Data file Date of the last update Update method Data version(s)
annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv 08/20/2021 bash run-update-long-format-table-utils.sh MyGene version on the last update date
annotator/annotation-data/oncokb-cancer-gene-list.tsv 07/16/2021 Manually download from https://www.oncokb.org/cancerGenes 07/16/2021

Note on MyGene version: MyGene releases are built regularly using data from various sources, e.g. Ensembl, NCBI, and UCSC. In each release note, updated data sources are listed, e.g. Ensembl gene is updated from 103 to 104 in Build version 20210510. The annotator/download-annotation-data.R script uses the R mygene package 1.22.0 to query Gene_Ensembl_ID values to retrieve Gene_full_name and Protein_RefSeq_ID values. The R mygene package 1.22.0 uses MyGene v3 API with default_url <- "http://mygene.info/v3" specified at line 8 of mygene.R, even though the R mygene package 1.22.0 documentation says that v2 API is used.

Note on using MyGene instead of biomaRt: MyGene has more Gene_Ensembl_IDs that have Gene_full_name or Protein_RefSeq_ID available than biomaRt, and more details are discussed at #55 (comment). However, biomaRt allows users to specify Ensembl (GENCODE) versions with biomaRt::useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = 90). If biomaRt is preferred at a later point, use the code at #55 (comment) as a starting point for updating annotator/download-annotation-data.R.

Add gene and cancer_group annotations

Implementation of long-format table annotator

Check input long-format tables have required columns for adding annotation columns.

Column name Required for adding which annotation column(s) Description
Gene_symbol Gene_type, OncoKB_cancer_gene, OncoKB_oncogene_TSG HUGO symbols, e.g. PHLPP1, TM6SF1, and DNAH5.
Gene_Ensembl_ID PMTL, Gene_full_name, Protein_RefSeq_ID Ensembl ENSG IDs without .# versions, e.g. ENSG00000039139, ENSG00000111261, and ENSG00000169710
Disease EFO, MONDO The cancer_group in the histologies.tsv, e.g. Adamantinomatous Craniopharyngioma, Atypical Teratoid Rhabdoid Tumor, and Low-grade glioma/astrocytoma
GTEx_tissue_group GTEx_tissue_group_UBERON The gtex_group in the histologies.tsv, e.g. Adipose, Kidney, and Thyroid
GTEx_tissue_subgroup GTEx_tissue_subgroup_UBERON The gtex_subgroup in the histologies.tsv, e.g. Adipose - Subcutaneous, Kidney - Cortex, and Thyroid

Add one or more of the following gene, disease (/cancer_group), and tissue annotations, by specifying the columns_to_add parameter in the annotate_long_format_table function, or by specifying the -c/--columns-to-add option when running annotator-cli.R.

Annotation column name join_by column name Non-missing value Annotation data file Source
PMTL Gene_Ensembl_ID Relevant Molecular Target (PMTL version 1.1) or Non-Relevant Molecular Target (PMTL version 1.1) data/ensg-hugo-pmtl-mapping.tsv PediatricOpenTargets/OpenPedCan-analysis data release
Gene_type Gene_symbol A sorted comma separated list of one or more of the following gene types: CosmicCensus, Kinase, Oncogene, TranscriptionFactor, and TumorSuppressorGene. Example values: CosmicCensus, CosmicCensus,Kinase, and CosmicCensus,Kinase,TumorSuppressorGene. analyses/fusion_filtering/references/genelistreference.txt Described at https://github.com/d3b-center/annoFuse
OncoKB_cancer_gene Gene_symbol Y or N analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv Downloaded from https://www.oncokb.org/cancerGenes
OncoKB_oncogene_TSG Gene_symbol Oncogene, or TumorSuppressorGene, or Oncogene,TumorSuppressorGene analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv Downloaded from https://www.oncokb.org/cancerGenes
Gene_full_name Gene_Ensembl_ID A single string of gene full name, e.g. cytochrome c oxidase subunit III and ATP synthase F0 subunit 6 analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv MyGene.info v3 API
Protein_RefSeq_ID Gene_Ensembl_ID A sorted comma separated list of one or more protein RefSeq IDs, e.g. NP_004065.1, NP_000053.2,NP_001027466.1, and NP_000985.1,NP_001007074.1,NP_001007075.1 analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv MyGene.info v3 API
EFO Disease A single string of EFO code, e.g. EFO_1000069, EFO_1002008, and EFO_1000177 data/efo-mondo-map.tsv PediatricOpenTargets/OpenPedCan-analysis data release
MONDO Disease A single string of MONDO code, e.g. MONDO_0002787, MONDO_0020560, and MONDO_0009837 data/efo-mondo-map.tsv PediatricOpenTargets/OpenPedCan-analysis data release
GTEx_tissue_group_UBERON GTEx_tissue_group A single string of UBERON code, e.g. UBERON_0000955, UBERON_0002107, and UBERON_0000007 data/uberon-map-gtex-group.tsv PediatricOpenTargets/OpenPedCan-analysis data release
GTEx_tissue_subgroup_UBERON GTEx_tissue_subgroup A single string of UBERON code, e.g. UBERON_0002369, UBERON_0001870, and UBERON_0002038 data/uberon-map-gtex-subgroup.tsv PediatricOpenTargets/OpenPedCan-analysis data release

Note: only add Gene_type to gene-level tables, which can be implemented by leaving "Gene_type" out of the columns_to_add parameter of the annotate_long_format_table function in annotator-api.R, or by leaving "Gene_type" out of the -c/--columns-to-add option when running annotator-cli.R.

Notes on requiring Gene_symbol and Gene_Ensembl_ID:

  • Certain annotation files use Gene_symbol as key columns, and certain other annotation files use Gene_Ensembl_ID as key columns.
  • Some Gene_symbols are mapped to multiple Gene_Ensembl_IDs, so adding Gene_Ensembl_IDs by mapping Gene_symbols with data/ensg-hugo-pmtl-mapping.tsv may implicitly introduce duplicated rows. Therefore, adding Gene_Ensembl_IDs by mapping Gene_symbols is left to users with cautions for potentially introducing unwanted duplicates.
  • Similarly, some Gene_Ensembl_IDs are mapped to multiple Gene_symbols, so adding Gene_symbols by mapping Gene_Ensembl_IDs with data/ensg-hugo-pmtl-mapping.tsv may implicitly introduce duplicated rows. Therefore, adding Gene_symbols by mapping Gene_Ensembl_IDs is left to users with cautions for potentially introducing unwanted duplicates.

Notes on annotation data versions:

Notes on GTEx_tissue_subgroup -> EFO code mapping:

  • The data/uberon-map-gtex-subgroup.tsv file contains two EFO codes:
    • EFO_0002009 for Cells - Cultured fibroblast
    • EFO_0000572 for Cells - EBV-transformed lymphocytes
  • It is decided in a discussion that the annotator submodule will not add GTEx_tissue_subgroup -> EFO mapping, because the cell lines may have no biological context for being searched on PediatricOpenTargets website.
  • If GTEx_tissue_subgroup -> EFO mapping needs to be added to the annotator submodule at a later point, keep using the EFO column name for Disease -> EFO mapping, and add GTEx_tissue_subgroup_EFO annotation column. Even though the viewers of the annotated tables that have both EFO and GTEx_tissue_subgroup_EFO may be confused, the API and CLI calls will be backward compatible, so that analysis modules that use the annotator will not need to be updated for changing the column name for Disease EFO.
R API usage of long-format table annotator

The long-format-table-utils/annotator/annotator-api.R file provides the annotate_long_format_table function for annotating long-format tables.

Use the long-format table annotator API in an analysis module with the following steps:

  1. Change the working directory of the analysis module to be OpenPedCan-analysis or a subdirectory of OpenPedCan-analysis. This allows the API function annotate_long_format_table to locate annotation data files.
  2. source the long-format-table-utils/annotator/annotator-api.R file.
  3. If the class of the table to be annotated is not tibble::tbl_df, convert the table to tibble::tbl_df with tibble::as_tibble. After conversion, carefully check rownames, colnames, column classes (especially factors), and other properties that may affect the correctness of you code.
  4. If required columns are not all present in the table to be annotated, add new columns or rename existing ones to have all these required columns.
  5. Check that the annotation columns to be added are not already present in the table that needs to be annotated. If there is any annotation column that needs to be added already exists in the table that needs to be annotated, the annotate_long_format_table function will raise an error without annotating the table.
  6. Call annotate_long_format_table to add one or more of the available annotation columns, by specifying the columns_to_add parameter in the annotate_long_format_table function. Read the documentation comment of the function for usage.
  7. Rename, select, and reorder the columns of the annotated table for output in TSV, or JSON, or JSONL formats. NOTE that the names of the annotation columns will be standardized at a later point, as suggested by @jharenza at #56 (review), so it is recommended to use the annotation column names in this module for the results.

Following is an example usage in the rna-seq-expression-summary-stats module 01-tpm-summary-stats.R.

> getwd()
[1] "/home/rstudio/OpenPedCan-analysis/analyses/rna-seq-expression-summary-stats"
> source("../long-format-table-utils/annotator/annotator-api.R")
> class(m_tpm_ss_long_tbl)
[1] "tbl_df"     "tbl"        "data.frame"
> colnames(m_tpm_ss_long_tbl)
 [1] "gene_symbol"                          "gene_id"
 [3] "cancer_group"                         "cohort"
 [5] "tpm_mean"                             "tpm_sd"
 [7] "tpm_mean_cancer_group_wise_zscore"    "tpm_mean_gene_wise_zscore"
 [9] "tpm_mean_cancer_group_wise_quantiles" "n_samples"
>
> # Gene_Ensembl_ID column is required for adding PMTL column
> # Disease column is required for adding EFO and MONDO columns
> renamed_m_tpm_ss_long_tbl <- dplyr::rename(
+   m_tpm_ss_long_tbl, Gene_Ensembl_ID = gene_id, Disease = cancer_group)
>
> annotation_columns_to_add <- c("MONDO", "PMTL", "EFO")
> # Assert all columns to be added are not already present in the
> # colnames(renamed_m_tpm_ss_long_tbl)
> stopifnot(
+   all(!annotation_columns_to_add %in% colnames(renamed_m_tpm_ss_long_tbl)))
>
> annotated_renamed_m_tpm_ss_long_tbl <- annotate_long_format_table(
+   renamed_m_tpm_ss_long_tbl, columns_to_add = annotation_columns_to_add)
>
> m_tpm_ss_long_tbl <- dplyr::rename(
+   annotated_renamed_m_tpm_ss_long_tbl,
+   gene_id = Gene_Ensembl_ID, cancer_group = Disease)
> m_tpm_ss_long_tbl <- dplyr::select(
+   m_tpm_ss_long_tbl, gene_symbol, PMTL, gene_id,
+   cancer_group, EFO, MONDO, n_samples, cohort,
+   tpm_mean, tpm_sd,
+   tpm_mean_cancer_group_wise_zscore, tpm_mean_gene_wise_zscore,
+   tpm_mean_cancer_group_wise_quantiles)
R CLI usage of long-format table annotator

The long-format-table-utils/annotator/annotator-cli.R file provides an R CLI for using the API to annotate long-format tables.

Use the long-format table annotator CLI in an analysis module with the following steps:

  1. If required columns are not all present in the the table to be annotated, add new columns or rename existing ones to have all these required columns.
  2. Output the table that needs to be annotated in TSV format. NOTEs on the TSV file:
    1. The TSV file should use double quotes for field values thatneed escape, e.g. "NA" for string literal "NA" and "\t" for tab
    2. Only unquoted NA field values are treated as missing values by annotator-cli.R
    3. Leading and trailing white spaces in field values are NOT trimmed by annotator-cli.R
  3. Change the working directory to be OpenPedCan-analysis or a subdirectory of OpenPedCan-analysis. This allows the annotator-cli.R to locate the annotator-api.R.
  4. Run the annotator-cli.R script with Rscript --vanilla path/to/annotator-cli.R and proper options. The Rscript command can be invoked by R system("Rscript --vanilla path/to/annotator-cli.R -h") (if the annotator R API is not preferred) or Python (>= 3.5) import subprocess; subprocess.run("Rscript --vanilla analyses/long-format-table-utils/annotator/annotator-cli.R -h".split()). For more information about R system, https://stat.ethz.ch/R-manual/R-devel/library/base/html/system.html. For more information about Python (>= 3.5) subprocess.run, https://docs.python.org/3/library/subprocess.html#subprocess.run.
  5. Read the annotated table TSV file. It is recommended to read all fields as character/string types, so the format and the number of significant digits of the double/float can be preserved.
  6. Rename, select, and reorder the columns of the annotated table for output in TSV, or JSON, or JSONL formats.

Following is an example usage in the rna-seq-expression-summary-stats module 01-tpm-summary-stats.R.

> getwd()
[1] "/home/rstudio/OpenPedCan-analysis/analyses/rna-seq-expression-summary-stats"
> class(m_tpm_ss_long_tbl)
[1] "tbl_df"     "tbl"        "data.frame"
> colnames(m_tpm_ss_long_tbl)
 [1] "gene_symbol"                          "gene_id"
 [3] "cancer_group"                         "cohort"
 [5] "tpm_mean"                             "tpm_sd"
 [7] "tpm_mean_cancer_group_wise_zscore"    "tpm_mean_gene_wise_zscore"
 [9] "tpm_mean_cancer_group_wise_quantiles" "n_samples"
>
> # Gene_Ensembl_ID column is required for adding PMTL column
> # Disease column is required for adding EFO and MONDO columns
> renamed_m_tpm_ss_long_tbl <- dplyr::rename(
+   m_tpm_ss_long_tbl, Gene_Ensembl_ID = gene_id, Disease = cancer_group)
>
> readr::write_tsv(
+   renamed_m_tpm_ss_long_tbl,
+   "../../scratch/renamed_m_tpm_ss_long_tbl.tsv")
>
> system(paste(
+   "Rscript --vanilla ../long-format-table-utils/annotator/annotator-cli.R",
+   "-r -v -c MONDO,PMTL,EFO",
+   "-i ../../scratch/renamed_m_tpm_ss_long_tbl.tsv",
+   "-o ../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv"))
Read ../../scratch/renamed_m_tpm_ss_long_tbl.tsv...
Annotate ../../scratch/renamed_m_tpm_ss_long_tbl.tsv...
Output ../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv...
Done.
>
> annotated_renamed_m_tpm_ss_long_tbl <- readr::read_tsv(
+   "../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv",
+   na = character(),
+   col_types = readr::cols(.default = readr::col_character()))
|=================================================================================================| 100%  226 MB
> m_tpm_ss_long_tbl <- dplyr::rename(
+   annotated_renamed_m_tpm_ss_long_tbl,
+   gene_id = Gene_Ensembl_ID,  cancer_group = Disease)
> m_tpm_ss_long_tbl <- dplyr::select(
+   m_tpm_ss_long_tbl, gene_symbol, PMTL, gene_id,
+   cancer_group, EFO, MONDO, n_samples, cohort,
+   tpm_mean, tpm_sd,
+   tpm_mean_cancer_group_wise_zscore, tpm_mean_gene_wise_zscore,
+   tpm_mean_cancer_group_wise_quantiles)
Unit testing for long-format table annotator

The unit testing is implemented using the testthat package version 2.1.1, as suggested by @jharenza and @NHJohnson in the reviews of PR #55.

To run all unit tests, run bash annotator/run-tests.sh in the Docker image/container from any working directory. Following is an example run.

$ bash annotator/run-tests.sh
✔ |  OK F W S | Context
✔ |  55       | tests/test_annotate_long_format_table.R [22.4 s]
✔ |  45       | tests/test_annotator_cli.R [49.6 s]
✔ |   8       | tests/test_collapse_name_vec.R
✔ |   7       | tests/test_collapse_rp_lists.R
✔ |  21       | tests/test_helper_import_function.R

══ Results ═════════════════════
Duration: 72.2 s

OK:       136
Failed:   0
Warnings: 0
Skipped:  0
Done running run-tests.sh

To add more tests, create additional test*R files under the annotator/tests directory, with available test*R files as reference.

Notes on the testthat unit testing framework:

  • testthat::test_dir("tests") finds all test*R files under the tests directory to run, which is used in annotator/run-tests.sh.
  • testthat::test_dir("tests") also finds and runs all helper*R files under the tests directory before running the test*R files.
  • The working directory is tests when running the helper*R and test*R files through testthat::test_dir("tests").
  • In order to import a funciton for testing from an R file without running the whole file, a helper function import_function is defined at tests/helper_import_function.R, and the import_function is also tested in the tests/test_helper_import_function.R file.
  • Even though the testthat 2.1.1 documentation of the filter parameter of test_dir function says that "Matching is performed on the file name after it's stripped of "test-" and ".R", the R code uses the following. Therefore, naming test files with test_some_test_file.R can be found by the test_dir function.
    • "^test.*\\.[rR]$" for finding test files in find_test_scripts
    • sub("^test-?", "", test_names), sub("\\.[rR]$", "", test_names), and grepl(filter, test_names, ...) for filtering test files in testthat:::filter_test_scripts.