This analysis is DEPRECATED and was last run with OpenPedCan data release v12
.
Module author: Yuanchao Zhang (@logstar)
Create application programming interface (API) and command line interface (CLI) for handling long-format tables that are generated by analysis modules. API provides analysis module developers with functions that can be imported into their own scripts via R source('path/to/the/function/file.R')
or Python import os, sys; sys.path.append(os.path.abspath("path/to/the/function/dir")); import function_filename_but_no_dot_py
. CLI provides analysis module developers with scripts that can be executed in their own run-module shell script with either Rscript --vanilla path/to/the/script.R arg long.tsv long_edited.tsv
or python3 path/to/the/script.py arg long.tsv long_edited.tsv
.
This module is suggested by @jharenza and @kgaonkar6 in Slack at https://opentargetspediatrics.slack.com/archives/C021Z53SK98/p1626290031138100?thread_ts=1626287625.133600&cid=C021Z53SK98, in order to alleviate the burdens of analysis module developers for adding annotations and keeping track of what annotations need to be added. This module could also potentially handle large file storage issues at a later point, since the file size limit of GitHub is 100MB.
Sub-module name | Implemented function | Available interface(s) |
---|---|---|
annotator |
Add gene and cancer_group annotations |
R API, R CLI |
API and CLI usages and descriptions are in the Methods section.
Run the following command to update downloaded data that are used in this module.
bash run-update-long-format-table-utils.sh
The run-update-long-format-table-utils.sh
runs data downloading scripts in sub-modules, e.g. annotator/run-download-annotation-data.sh
.
Users could use git diff --stat
to check for data file changes.
Following is the table of data files that need to be updated by the maintainer of this module.
Data file | Date of the last update | Update method | Data version(s) |
---|---|---|---|
annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv |
08/20/2021 | bash run-update-long-format-table-utils.sh |
MyGene version on the last update date |
annotator/annotation-data/oncokb-cancer-gene-list.tsv |
07/16/2021 | Manually download from https://www.oncokb.org/cancerGenes | 07/16/2021 |
Note on MyGene version: MyGene releases are built regularly using data from various sources, e.g. Ensembl, NCBI, and UCSC. In each release note, updated data sources are listed, e.g. Ensembl gene is updated from 103 to 104 in Build version 20210510. The annotator/download-annotation-data.R
script uses the R mygene package 1.22.0 to query Gene_Ensembl_ID
values to retrieve Gene_full_name
and Protein_RefSeq_ID
values. The R mygene package 1.22.0 uses MyGene v3 API with default_url <- "http://mygene.info/v3"
specified at line 8 of mygene.R
, even though the R mygene package 1.22.0 documentation says that v2 API is used.
Note on using MyGene instead of biomaRt: MyGene has more Gene_Ensembl_ID
s that have Gene_full_name
or Protein_RefSeq_ID
available than biomaRt, and more details are discussed at #55 (comment). However, biomaRt allows users to specify Ensembl (GENCODE) versions with biomaRt::useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = 90)
. If biomaRt is preferred at a later point, use the code at #55 (comment) as a starting point for updating annotator/download-annotation-data.R
.
Check input long-format tables have required columns for adding annotation columns.
Column name | Required for adding which annotation column(s) | Description |
---|---|---|
Gene_symbol |
Gene_type , OncoKB_cancer_gene , OncoKB_oncogene_TSG |
HUGO symbols, e.g. PHLPP1, TM6SF1, and DNAH5. |
Gene_Ensembl_ID |
PMTL , Gene_full_name , Protein_RefSeq_ID |
Ensembl ENSG IDs without .# versions, e.g. ENSG00000039139, ENSG00000111261, and ENSG00000169710 |
Disease |
EFO , MONDO |
The cancer_group in the histologies.tsv , e.g. Adamantinomatous Craniopharyngioma, Atypical Teratoid Rhabdoid Tumor, and Low-grade glioma/astrocytoma |
GTEx_tissue_group |
GTEx_tissue_group_UBERON |
The gtex_group in the histologies.tsv , e.g. Adipose, Kidney, and Thyroid |
GTEx_tissue_subgroup |
GTEx_tissue_subgroup_UBERON |
The gtex_subgroup in the histologies.tsv , e.g. Adipose - Subcutaneous, Kidney - Cortex, and Thyroid |
Add one or more of the following gene, disease (/cancer_group
), and tissue annotations, by specifying the columns_to_add
parameter in the annotate_long_format_table
function, or by specifying the -c
/--columns-to-add
option when running annotator-cli.R
.
Annotation column name | join_by column name |
Non-missing value | Annotation data file | Source |
---|---|---|---|---|
PMTL |
Gene_Ensembl_ID |
Relevant Molecular Target (PMTL version 1.1) or Non-Relevant Molecular Target (PMTL version 1.1) |
data/ensg-hugo-pmtl-mapping.tsv |
PediatricOpenTargets/OpenPedCan-analysis data release |
Gene_type |
Gene_symbol |
A sorted comma separated list of one or more of the following gene types: CosmicCensus , Kinase , Oncogene , TranscriptionFactor , and TumorSuppressorGene . Example values: CosmicCensus , CosmicCensus,Kinase , and CosmicCensus,Kinase,TumorSuppressorGene . |
analyses/fusion_filtering/references/genelistreference.txt |
Described at https://github.com/d3b-center/annoFuse |
OncoKB_cancer_gene |
Gene_symbol |
Y or N |
analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv |
Downloaded from https://www.oncokb.org/cancerGenes |
OncoKB_oncogene_TSG |
Gene_symbol |
Oncogene , or TumorSuppressorGene , or Oncogene,TumorSuppressorGene |
analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv |
Downloaded from https://www.oncokb.org/cancerGenes |
Gene_full_name |
Gene_Ensembl_ID |
A single string of gene full name, e.g. cytochrome c oxidase subunit III and ATP synthase F0 subunit 6 |
analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv |
MyGene.info v3 API |
Protein_RefSeq_ID |
Gene_Ensembl_ID |
A sorted comma separated list of one or more protein RefSeq IDs, e.g. NP_004065.1 , NP_000053.2,NP_001027466.1 , and NP_000985.1,NP_001007074.1,NP_001007075.1 |
analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv |
MyGene.info v3 API |
EFO |
Disease |
A single string of EFO code, e.g. EFO_1000069 , EFO_1002008 , and EFO_1000177 |
data/efo-mondo-map.tsv |
PediatricOpenTargets/OpenPedCan-analysis data release |
MONDO |
Disease |
A single string of MONDO code, e.g. MONDO_0002787 , MONDO_0020560 , and MONDO_0009837 |
data/efo-mondo-map.tsv |
PediatricOpenTargets/OpenPedCan-analysis data release |
GTEx_tissue_group_UBERON |
GTEx_tissue_group |
A single string of UBERON code, e.g. UBERON_0000955 , UBERON_0002107 , and UBERON_0000007 |
data/uberon-map-gtex-group.tsv |
PediatricOpenTargets/OpenPedCan-analysis data release |
GTEx_tissue_subgroup_UBERON |
GTEx_tissue_subgroup |
A single string of UBERON code, e.g. UBERON_0002369 , UBERON_0001870 , and UBERON_0002038 |
data/uberon-map-gtex-subgroup.tsv |
PediatricOpenTargets/OpenPedCan-analysis data release |
Note: only add Gene_type
to gene-level tables, which can be implemented by leaving "Gene_type"
out of the columns_to_add
parameter of the annotate_long_format_table
function in annotator-api.R
, or by leaving "Gene_type"
out of the -c
/--columns-to-add
option when running annotator-cli.R
.
Notes on requiring Gene_symbol
and Gene_Ensembl_ID
:
- Certain annotation files use
Gene_symbol
as key columns, and certain other annotation files useGene_Ensembl_ID
as key columns. - Some
Gene_symbol
s are mapped to multipleGene_Ensembl_ID
s, so addingGene_Ensembl_ID
s by mappingGene_symbol
s withdata/ensg-hugo-pmtl-mapping.tsv
may implicitly introduce duplicated rows. Therefore, addingGene_Ensembl_ID
s by mappingGene_symbol
s is left to users with cautions for potentially introducing unwanted duplicates. - Similarly, some
Gene_Ensembl_ID
s are mapped to multipleGene_symbol
s, so addingGene_symbol
s by mappingGene_Ensembl_ID
s withdata/ensg-hugo-pmtl-mapping.tsv
may implicitly introduce duplicated rows. Therefore, addingGene_symbol
s by mappingGene_Ensembl_ID
s is left to users with cautions for potentially introducing unwanted duplicates.
Notes on annotation data versions:
- The version of PediatricOpenTargets/OpenPedCan-analysis data release is determined by the
download-data.sh
under theOpenPedCan-analysis
directory. - The version of
analyses/fusion_filtering/references/genelistreference.txt
is tracked by GitHub commits, and the GitHub permalink to the currently used file is https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/7fb11a020a92d06c8685736546e860bfe23da7e2/analyses/fusion_filtering/references/genelistreference.txt. - The versions of other sources are listed in the Update downloaded data that are used in this module section.
Notes on GTEx_tissue_subgroup
-> EFO
code mapping:
- The
data/uberon-map-gtex-subgroup.tsv
file contains twoEFO
codes:- EFO_0002009 for Cells - Cultured fibroblast
- EFO_0000572 for Cells - EBV-transformed lymphocytes
- It is decided in a discussion that the
annotator
submodule will not addGTEx_tissue_subgroup
->EFO
mapping, because the cell lines may have no biological context for being searched on PediatricOpenTargets website. - If
GTEx_tissue_subgroup
->EFO
mapping needs to be added to theannotator
submodule at a later point, keep using theEFO
column name forDisease
->EFO
mapping, and addGTEx_tissue_subgroup_EFO
annotation column. Even though the viewers of the annotated tables that have bothEFO
andGTEx_tissue_subgroup_EFO
may be confused, the API and CLI calls will be backward compatible, so that analysis modules that use the annotator will not need to be updated for changing the column name forDisease
EFO
.
The long-format-table-utils/annotator/annotator-api.R
file provides the annotate_long_format_table
function for annotating long-format tables.
Use the long-format table annotator API in an analysis module with the following steps:
- Change the working directory of the analysis module to be
OpenPedCan-analysis
or a subdirectory ofOpenPedCan-analysis
. This allows the API functionannotate_long_format_table
to locate annotation data files. source
thelong-format-table-utils/annotator/annotator-api.R
file.- If the class of the table to be annotated is not
tibble::tbl_df
, convert the table totibble::tbl_df
withtibble::as_tibble
. After conversion, carefully check rownames, colnames, column classes (especially factors), and other properties that may affect the correctness of you code. - If required columns are not all present in the table to be annotated, add new columns or rename existing ones to have all these required columns.
- Check that the annotation columns to be added are not already present in the table that needs to be annotated. If there is any annotation column that needs to be added already exists in the table that needs to be annotated, the
annotate_long_format_table
function will raise an error without annotating the table. - Call
annotate_long_format_table
to add one or more of the available annotation columns, by specifying thecolumns_to_add
parameter in theannotate_long_format_table
function. Read the documentation comment of the function for usage. - Rename, select, and reorder the columns of the annotated table for output in TSV, or JSON, or JSONL formats. NOTE that the names of the annotation columns will be standardized at a later point, as suggested by @jharenza at #56 (review), so it is recommended to use the annotation column names in this module for the results.
Following is an example usage in the rna-seq-expression-summary-stats
module 01-tpm-summary-stats.R
.
> getwd()
[1] "/home/rstudio/OpenPedCan-analysis/analyses/rna-seq-expression-summary-stats"
> source("../long-format-table-utils/annotator/annotator-api.R")
> class(m_tpm_ss_long_tbl)
[1] "tbl_df" "tbl" "data.frame"
> colnames(m_tpm_ss_long_tbl)
[1] "gene_symbol" "gene_id"
[3] "cancer_group" "cohort"
[5] "tpm_mean" "tpm_sd"
[7] "tpm_mean_cancer_group_wise_zscore" "tpm_mean_gene_wise_zscore"
[9] "tpm_mean_cancer_group_wise_quantiles" "n_samples"
>
> # Gene_Ensembl_ID column is required for adding PMTL column
> # Disease column is required for adding EFO and MONDO columns
> renamed_m_tpm_ss_long_tbl <- dplyr::rename(
+ m_tpm_ss_long_tbl, Gene_Ensembl_ID = gene_id, Disease = cancer_group)
>
> annotation_columns_to_add <- c("MONDO", "PMTL", "EFO")
> # Assert all columns to be added are not already present in the
> # colnames(renamed_m_tpm_ss_long_tbl)
> stopifnot(
+ all(!annotation_columns_to_add %in% colnames(renamed_m_tpm_ss_long_tbl)))
>
> annotated_renamed_m_tpm_ss_long_tbl <- annotate_long_format_table(
+ renamed_m_tpm_ss_long_tbl, columns_to_add = annotation_columns_to_add)
>
> m_tpm_ss_long_tbl <- dplyr::rename(
+ annotated_renamed_m_tpm_ss_long_tbl,
+ gene_id = Gene_Ensembl_ID, cancer_group = Disease)
> m_tpm_ss_long_tbl <- dplyr::select(
+ m_tpm_ss_long_tbl, gene_symbol, PMTL, gene_id,
+ cancer_group, EFO, MONDO, n_samples, cohort,
+ tpm_mean, tpm_sd,
+ tpm_mean_cancer_group_wise_zscore, tpm_mean_gene_wise_zscore,
+ tpm_mean_cancer_group_wise_quantiles)
The long-format-table-utils/annotator/annotator-cli.R
file provides an R CLI for using the API to annotate long-format tables.
Use the long-format table annotator CLI in an analysis module with the following steps:
- If required columns are not all present in the the table to be annotated, add new columns or rename existing ones to have all these required columns.
- Output the table that needs to be annotated in TSV format. NOTEs on the TSV file:
- The TSV file should use double quotes for field values thatneed escape, e.g. "NA" for string literal "NA" and "\t" for tab
- Only unquoted NA field values are treated as missing values by
annotator-cli.R
- Leading and trailing white spaces in field values are NOT trimmed by
annotator-cli.R
- Change the working directory to be
OpenPedCan-analysis
or a subdirectory ofOpenPedCan-analysis
. This allows theannotator-cli.R
to locate theannotator-api.R
. - Run the
annotator-cli.R
script withRscript --vanilla path/to/annotator-cli.R
and proper options. TheRscript
command can be invoked by Rsystem("Rscript --vanilla path/to/annotator-cli.R -h")
(if the annotator R API is not preferred) or Python (>= 3.5)import subprocess; subprocess.run("Rscript --vanilla analyses/long-format-table-utils/annotator/annotator-cli.R -h".split())
. For more information about Rsystem
, https://stat.ethz.ch/R-manual/R-devel/library/base/html/system.html. For more information about Python (>= 3.5)subprocess.run
, https://docs.python.org/3/library/subprocess.html#subprocess.run. - Read the annotated table TSV file. It is recommended to read all fields as character/string types, so the format and the number of significant digits of the double/float can be preserved.
- Rename, select, and reorder the columns of the annotated table for output in TSV, or JSON, or JSONL formats.
Following is an example usage in the rna-seq-expression-summary-stats
module 01-tpm-summary-stats.R
.
> getwd()
[1] "/home/rstudio/OpenPedCan-analysis/analyses/rna-seq-expression-summary-stats"
> class(m_tpm_ss_long_tbl)
[1] "tbl_df" "tbl" "data.frame"
> colnames(m_tpm_ss_long_tbl)
[1] "gene_symbol" "gene_id"
[3] "cancer_group" "cohort"
[5] "tpm_mean" "tpm_sd"
[7] "tpm_mean_cancer_group_wise_zscore" "tpm_mean_gene_wise_zscore"
[9] "tpm_mean_cancer_group_wise_quantiles" "n_samples"
>
> # Gene_Ensembl_ID column is required for adding PMTL column
> # Disease column is required for adding EFO and MONDO columns
> renamed_m_tpm_ss_long_tbl <- dplyr::rename(
+ m_tpm_ss_long_tbl, Gene_Ensembl_ID = gene_id, Disease = cancer_group)
>
> readr::write_tsv(
+ renamed_m_tpm_ss_long_tbl,
+ "../../scratch/renamed_m_tpm_ss_long_tbl.tsv")
>
> system(paste(
+ "Rscript --vanilla ../long-format-table-utils/annotator/annotator-cli.R",
+ "-r -v -c MONDO,PMTL,EFO",
+ "-i ../../scratch/renamed_m_tpm_ss_long_tbl.tsv",
+ "-o ../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv"))
Read ../../scratch/renamed_m_tpm_ss_long_tbl.tsv...
Annotate ../../scratch/renamed_m_tpm_ss_long_tbl.tsv...
Output ../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv...
Done.
>
> annotated_renamed_m_tpm_ss_long_tbl <- readr::read_tsv(
+ "../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv",
+ na = character(),
+ col_types = readr::cols(.default = readr::col_character()))
|=================================================================================================| 100% 226 MB
> m_tpm_ss_long_tbl <- dplyr::rename(
+ annotated_renamed_m_tpm_ss_long_tbl,
+ gene_id = Gene_Ensembl_ID, cancer_group = Disease)
> m_tpm_ss_long_tbl <- dplyr::select(
+ m_tpm_ss_long_tbl, gene_symbol, PMTL, gene_id,
+ cancer_group, EFO, MONDO, n_samples, cohort,
+ tpm_mean, tpm_sd,
+ tpm_mean_cancer_group_wise_zscore, tpm_mean_gene_wise_zscore,
+ tpm_mean_cancer_group_wise_quantiles)
The unit testing is implemented using the testthat
package version 2.1.1, as suggested by @jharenza and @NHJohnson in the reviews of PR #55.
To run all unit tests, run bash annotator/run-tests.sh
in the Docker image/container from any working directory. Following is an example run.
$ bash annotator/run-tests.sh
✔ | OK F W S | Context
✔ | 55 | tests/test_annotate_long_format_table.R [22.4 s]
✔ | 45 | tests/test_annotator_cli.R [49.6 s]
✔ | 8 | tests/test_collapse_name_vec.R
✔ | 7 | tests/test_collapse_rp_lists.R
✔ | 21 | tests/test_helper_import_function.R
══ Results ═════════════════════
Duration: 72.2 s
OK: 136
Failed: 0
Warnings: 0
Skipped: 0
Done running run-tests.sh
To add more tests, create additional test*R
files under the annotator/tests
directory, with available test*R
files as reference.
Notes on the testthat
unit testing framework:
testthat::test_dir("tests")
finds alltest*R
files under thetests
directory to run, which is used inannotator/run-tests.sh
.testthat::test_dir("tests")
also finds and runs allhelper*R
files under thetests
directory before running thetest*R
files.- The working directory is
tests
when running thehelper*R
andtest*R
files throughtestthat::test_dir("tests")
. - In order to import a funciton for testing from an R file without running the whole file, a helper function
import_function
is defined attests/helper_import_function.R
, and theimport_function
is also tested in thetests/test_helper_import_function.R
file. - Even though the
testthat
2.1.1 documentation of thefilter
parameter oftest_dir
function says that "Matching is performed on the file name after it's stripped of "test-" and ".R", the R code uses the following. Therefore, naming test files withtest_some_test_file.R
can be found by thetest_dir
function."^test.*\\.[rR]$"
for finding test files infind_test_scripts
sub("^test-?", "", test_names)
,sub("\\.[rR]$", "", test_names)
, andgrepl(filter, test_names, ...)
for filtering test files intestthat:::filter_test_scripts
.