This analysis is DEPRECATED and was last run with OpenPedCan data release v12
.
Module author: Yuanchao Zhang (@logstar)
- Annotate SNV table with mutation frequencies
- Annotate each non-synonymous variant in
snv-consensus-plus-hotspots.maf.tsv.gz
with mutation frequencies per(cohort, cancer_group, primary/relapse)
. - Annotate each gene in
snv-consensus-plus-hotspots.maf.tsv.gz
with non-synonymous variant mutation frequencies per(cohort, cancer_group, primary/relapse)
.
Issues addressed:
- Variant-level mutation frequencies: d3b-center/ticket-tracker-OPC#64
- Gene-level mutation frequencies: d3b-center/ticket-tracker-OPC#91
- d3b-center/ticket-tracker-OPC#8. This issue is no longer compatible with the purpose of this module. This module intends to compute mutation frequencies for each variant, but this issue intents to compute the mutation frequencies for each gene. This issue is listed here for future reference.
- Change independent sample lists used in snv-frequencies: d3b-center/ticket-tracker-OPC#137.
- Change biospecimen filtering procedure: d3b-center/ticket-tracker-OPC#287.
- Updated analysis: update snv-frequencies for v11 data release: d3b-center/ticket-tracker-OPC#361 This issue updates the tables after the OpenPedCan-analysis v11 data release and also excludes CHOP DGD samples from inclusion in "All Cohorts"
- Assert compatibility between histology and independent specimen lists: d3b-center/ticket-tracker-OPC#293
- Updated analysis: rerun snv-frequencies with updated v11 pedcbio case IDs d3b-center/ticket-tracker-OPC#386
- Updated analysis:
snv-frequencies
to add HGVSg d3b-center/ticket-tracker-OPC#315 - update snv-frequencies tables for v12 d3b-center/ticket-tracker-OPC#529
Subset snv-consensus-plus-hotspots.maf.tsv.gz
to keep only samples with sample_type == 'Tumor'
and non-NA cancer_group
and cohort
values in histologies.tsv
.
Subset histologies.tsv
, ../../data/independent-specimens.wgswxspanel.primary.prefer.wxs.tsv
and ../../data/independent-specimens.wgswxspanel.relapse.prefer.wxs.tsv
for all cohorts
independent samples, and ../../data/independent-specimens.wgswxspanel.primary.eachcohort.prefer.wxs.tsv
and ../../data/independent-specimens.wgswxspanel.relapse.eachcohort.prefer.wxs.tsv
for each cohort independent samples to keep only samples that are in the snv-consensus-plus-hotspots.maf.tsv.gz
subset.
Subset snv-consensus-plus-hotspots.maf.tsv.gz
to keep only non-synonymous variants with the following code.
Variant_Classification %in% c('Frame_Shift_Del',
'Frame_Shift_Ins',
'Splice_Site',
'Nonsense_Mutation',
'Nonstop_Mutation',
'In_Frame_Del',
'In_Frame_Ins',
'Missense_Mutation',
'Fusion',
'Multi_Hit',
'Multi_Hit_Fusion',
'Hom_Deletion',
'Hem_Deletion',
'Amp',
'Del',
'Translation_Start_Site')
For vairant-level analysis, create a Variant_ID
for each variant by concatenating non-NA Chromosome
, Start_Position
, Reference_Allele
, and Tumor_Seq_Allele2
with '_'
.
For each cancer_group
, get each cohort and all cohorts. Call each cancer_group
and cohort
(s) combination as a cancer_group_cohort
. For example,
cancer_group | cohort | n_samples |
---|---|---|
Neuroblastoma | CBTN | 2 |
Neuroblastoma | GMKF | 541 |
Neuroblastoma | TARGET | 889 |
Neuroblastoma | CBTN&GMKF&TARGET | 1432 |
For each cancer_group_cohort
with n_samples
>= 3, compute Frequency_in_overall_dataset
, Frequency_in_primary_tumors
, and Frequency_in_relapse_tumors
as following:
-
Frequency_in_overall_dataset
:- For each unique variant/gene, count the number of patients (identified by
Kids_First_Participant_ID
) that have mutations at the variant/gene, and call this numberTotal_mutations
. - Count the total number of patients in the
cancer_group_cohort
, and call this numberPatients_in_dataset
. Frequency_in_overall_dataset = Total_mutations / Patients_in_dataset
.
- For each unique variant/gene, count the number of patients (identified by
-
Frequency_in_primary_tumors
:- For each unique variant/gene, count the number of samples (identified by
Kids_First_Biospecimen_ID
) that have mutations at the variant/gene and are in the independent primary sample list, and call this numberTotal_primary_tumors_mutated
. - Count the total number of samples in the
cancer_group_cohort
that are also in the independent primary sample list, and call this numberPrimary_tumors_in_dataset
. Frequency_in_primary_tumors = Total_primary_tumors_mutated / Primary_tumors_in_dataset
.
- For each unique variant/gene, count the number of samples (identified by
-
Frequency_in_relapse_tumors
:- For each unique variant/gene, count the number of samples (identified by
Kids_First_Biospecimen_ID
) that have mutations at the variant/gene and are in the independent relapse sample list, and call this numberTotal_relapse_tumors_mutated
. - Count the total number of samples in the
cancer_group_cohort
that are also in the independent relapse sample list, and call this numberRelapse_tumors_in_dataset
. Frequency_in_relapse_tumors = Total_relapse_tumors_mutated / Relapse_tumors_in_dataset
.
- For each unique variant/gene, count the number of samples (identified by
Format the SNV mutation frequency table according to the latest spreadsheet that is attached in d3b-center/ticket-tracker-OPC#64.
Merge the SNV mutation frequency tables of all cancer_group_cohort
s.
Add the following annotation columns to variant-level and gene-level tables using long-format-table-utils/annotator
. For more details about the annotation columns, see long-format-table-utils
README.md
.
Gene_full_name
Protein_RefSeq_ID
EFO
MONDO
OncoKB_cancer_gene
OncoKB_oncogene_TSG
Add Gene_type
, PedcBio_PedOT_oncoprint_plot_URL
and PedcBio_PedOT_mutations_plot_URL
annotation columns to gene-level tables.
The PedcBioPortal case_set_id
s in the URLs are obtained from the sample-lists
PedcBioPortal web API, with the following command, where you have to specify the PedcBio STUDY_NAME, ex: openpedcan_v11
, and your API-ACCESS-TOKEN:
curl -X GET "https://pedcbioportal.kidsfirstdrc.org/api/studies/STUDY_NAME/sample-lists" -H "accept: application/json" -H "Authorization: Bearer API-ACCESS-TOKEN" > ped_opentargets_pedcbio_case_set_ids.json
To update the case_set_id
s, rerun the curl
command with your own PedcBioPortal web API access token. The token can be requested and downloaded at https://pedcbioportal.kidsfirstdrc.org/webAPI. More information about the access token is at https://docs.cbioportal.org/2.2-authorization-and-authentication/authenticating-users-via-tokens#using-data-access-tokens.
Two more changes in column names and additions of three more columns are made for OT compatibility:
Gene_Ensembl_Id
is changed totargetFromSourceId
EFO
todiseaseFromSourceMappedId
Three additional columns added:
chop_uuid
includes 32 digit UUIDs that are unique for each row of the filedatatypeId
column withsomatic_mutation
in each rowdatasourceId
column withchop_gene_level_snv
in each row for gene-level table andchop_variant_level_snv
in each row for variant-level table.
Results are generated using PediatricOpenTargets/OpenPedCan-analysis data release v10.
The merged variant-level and gene-level SNV mutation frequency tables of all cancer_group_cohort
s is output in TSV and JSONL formats.
-
results/gene-level-snv-consensus-annotated-mut-freq.jsonl.gz
-
results/gene-level-snv-consensus-annotated-mut-freq.tsv.gz
-
results/variant-level-snv-consensus-annotated-mut-freq.jsonl.gz
-
results/variant-level-snv-consensus-annotated-mut-freq.tsv.gz
Note: When reading {variant,gene}-level-snv-consensus-annotated-mut-freq.tsv
using readr::read_tsv
, you can specify na = character(0)
to avoid converting blank string entries to NA
s, as suggested by @jharenza at #45 (review). This can be useful when converting {variant,gene}-level-snv-consensus-annotated-mut-freq.tsv
to JSON, because the PediatricOpenTargets/OpenPedCan-analysis project favors empty string (''
) as missing value over NA
/NaN
/NULL
.
- Change working directory to local
OpenPBTA-analysis
. - Download data using
bash download-data.sh
. - Run this analysis module in the continuous integration (CI) docker image using
./scripts/run_in_ci.sh bash analyses/snv-frequencies/run-snv-frequencies.sh
.
This script annotates each non-synonymous variant in snv-consensus-plus-hotspots.maf.tsv.gz
with mutation frequencies per (cohort, cancer_group, primary/relapse)
.
Usage:
Rscript --vanilla '01-snv-frequencies.R'
Input:
../../data/histologies.tsv
../../data/snv-consensus-plus-hotspots.maf.tsv.gz
../../data/independent-specimens.wgswxspanel.primary.prefer.wxs.tsv
../../data/independent-specimens.wgswxspanel.relapse.prefer.wxs.tsv
../../data/independent-specimens.wgswxspanel.primary.eachcohort.prefer.wxs.tsv
../../data/independent-specimens.wgswxspanel.relapse.eachcohort.prefer.wxs.tsv
input/ped_opentargets_pedcbio_case_set_ids.json
Output:
results/gene-level-snv-consensus-annotated-mut-freq.json
results/gene-level-snv-consensus-annotated-mut-freq.tsv
results/variant-level-snv-consensus-annotated-mut-freq.json
results/variant-level-snv-consensus-annotated-mut-freq.tsv
Read the script. 01-snv-frequencies.R
contains the following sections:
- Function definitions. This section contains the definitions of all functions used in this script. These functions directly use column names of input data, so any future updates need to check whether the column names are changed in new data releases. This script is originally developed using v6 data release.
- Create output dir. This section creates output directory.
- Read data. This section reads all input data.
- Subset tumor samples and used columns in MAF table. This section subsets MAF table.
- Subset independent samples in histology table. This section subsets histology table and creates primary and relapse independent sample histology tables.
- Compute mutation frequencies. This section:
- computes mutation frequencies for 1) each cancer_group and each cohort and 2) each cancer_group and all cohorts
- add each-cohot and all-cohort PedcBio URLs
- creates gene-level and variant-level mutation frqeuency tables
- Add annotations to the output table. This section adds various annotations to the mutation frequency tables.
- Output TSV and JSON. This section outputs mutation frequency tables in TSV and JSON formats.
Common updates:
- Add a new annotation column:
- If the annotation is available in the
long-format-table-utils/annotator
:- Add new annotation column names in the
columns_to_add
parameter ofannotate_long_format_table
calls.
- Add new annotation column names in the
- If the annotation is not available in the the
long-format-table-utils/annotator
:- Load the annotation table in the Read data section. Make sure the
join_by
column has no NA or duplicate. - Add annotations to the result table by
dplyr::left_join
. ReplaceNA
s with''
.
- Load the annotation table in the Read data section. Make sure the
- Reorder new columns before output.
- If the annotation is available in the
- Change the notation for all_cohorts:
- Change
'all_cohorts'
to other values in theget_cohort_set_value()
function. - Change
tests/test_get_cohort_set_value.R
- Change
- Change the format of PedcBio URLs:
- Change the
get_pcb_pot_csi()
andget_pcb_pot_ploAt_url()
functions. - Change
tests/test_get_pcb_pot_csi.R
andtests/test_get_pcb_pot_plot_url.R
. - If
case_set_id
s are no longer used for linking PedcBio, the whole worflow needs to be redesigned.
- Change the
- Add new PedcBio URLs:
- Change the
add_cg_ch_pedcbio_pedot_plot_urls()
function. - Retain the new URL columns in the
get_cg_ch_var_level_mut_freq_tbl()
andget_cg_ch_gene_level_mut_freq_tbl()
functions.
- Change the
- Add new mutation frequency columns:
- Change the
get_opr_mut_freq_tbl()
function. - Change
tests/test_get_opr_mut_freq_tbl.R
. - Retain the new mutation frequency columns in the
get_cg_ch_var_level_mut_freq_tbl()
andget_cg_ch_gene_level_mut_freq_tbl()
functions.
- Change the
The stopifnot()
statements are assertions for the input data, so the output would be expected. If any assertion fails, additional code needs to be added mainly to remove NAs and handle duplicates in the data, so that the assertion pases.
The unit testing is implemented using the testthat
package version 2.1.1, as suggested by @jharenza and @NHJohnson in the reviews of PR #55.
To run all unit tests, run bash run-tests.sh
in the Docker image/container from any working directory. Following is an example run.
$ bash analyses/snv-frequencies/run-tests.sh
✔ | OK F W S | Context
✔ | 23 | tests/test_get_cohort_set_value.R
✔ | 31 | tests/test_get_opr_mut_freq_tbl.R [0.8 s]
✔ | 33 | tests/test_get_pcb_pot_csi.R
✔ | 42 | tests/test_get_pcb_pot_plot_url.R
✔ | 21 | tests/test_helper_import_function.R
✔ | 12 | tests/test_num_to_pct_chr.R
══ Results ═════════════════════════════
Duration: 1.0 s
OK: 162
Failed: 0
Warnings: 0
Skipped: 0
Done running run-tests.sh
To add more tests, create additional test*R
files under the tests
directory, with available test*R
files as reference.
Notes on the testthat
unit testing framework:
testthat::test_dir("tests")
finds alltest*R
files under thetests
directory to run, which is used inrun-tests.sh
.testthat::test_dir("tests")
also finds and runs allhelper*R
files under thetests
directory before running thetest*R
files.- The working directory is
tests
when running thehelper*R
andtest*R
files throughtestthat::test_dir("tests")
. - In order to import a funciton for testing from an R file without running the whole file, a helper function
import_function
is defined attests/helper_import_function.R
, and theimport_function
is also tested in thetests/test_helper_import_function.R
file. - Even though the
testthat
2.1.1 documentation of thefilter
parameter oftest_dir
function says that "Matching is performed on the file name after it's stripped of "test-" and ".R", the R code uses the following. Therefore, naming test files withtest_some_test_file.R
can be found by thetest_dir
function."^test.*\\.[rR]$"
for finding test files infind_test_scripts
sub("^test-?", "", test_names)
,sub("\\.[rR]$", "", test_names)
, andgrepl(filter, test_names, ...)
for filtering test files intestthat:::filter_test_scripts
.