OpenPedCan-analysis/analyses/snv-frequencies at dev · d3b-center/OpenPedCan-analysis

Name	Name	Last commit message	Last commit date
parent directory ..
input	input
results	results
tests	tests
.gitignore	.gitignore
01-snv-frequencies.R	01-snv-frequencies.R
README.md	README.md
run-snv-frequencies.sh	run-snv-frequencies.sh
run-tests.sh	run-tests.sh

Annotate SNV table with mutation frequencies

This analysis is DEPRECATED and was last run with OpenPedCan data release v12.

Module author: Yuanchao Zhang (@logstar)

Annotate SNV table with mutation frequencies

Purpose

Annotate each non-synonymous variant in snv-consensus-plus-hotspots.maf.tsv.gz with mutation frequencies per (cohort, cancer_group, primary/relapse).
Annotate each gene in snv-consensus-plus-hotspots.maf.tsv.gz with non-synonymous variant mutation frequencies per (cohort, cancer_group, primary/relapse).

Issues addressed:

Variant-level mutation frequencies: d3b-center/ticket-tracker-OPC#64
Gene-level mutation frequencies: d3b-center/ticket-tracker-OPC#91
d3b-center/ticket-tracker-OPC#8. This issue is no longer compatible with the purpose of this module. This module intends to compute mutation frequencies for each variant, but this issue intents to compute the mutation frequencies for each gene. This issue is listed here for future reference.
Change independent sample lists used in snv-frequencies: d3b-center/ticket-tracker-OPC#137.
Change biospecimen filtering procedure: d3b-center/ticket-tracker-OPC#287.
Updated analysis: update snv-frequencies for v11 data release: d3b-center/ticket-tracker-OPC#361 This issue updates the tables after the OpenPedCan-analysis v11 data release and also excludes CHOP DGD samples from inclusion in "All Cohorts"
Assert compatibility between histology and independent specimen lists: d3b-center/ticket-tracker-OPC#293
Updated analysis: rerun snv-frequencies with updated v11 pedcbio case IDs d3b-center/ticket-tracker-OPC#386
Updated analysis: snv-frequencies to add HGVSg d3b-center/ticket-tracker-OPC#315
update snv-frequencies tables for v12 d3b-center/ticket-tracker-OPC#529

Methods

Subset overlapping tumor samples between `snv-consensus-plus-hotspots.maf.tsv.gz` and `histologies.tsv`

Subset snv-consensus-plus-hotspots.maf.tsv.gz to keep only samples with sample_type == 'Tumor' and non-NA cancer_group and cohort values in histologies.tsv.

Subset histologies.tsv, ../../data/independent-specimens.wgswxspanel.primary.prefer.wxs.tsv and ../../data/independent-specimens.wgswxspanel.relapse.prefer.wxs.tsv for all cohorts independent samples, and ../../data/independent-specimens.wgswxspanel.primary.eachcohort.prefer.wxs.tsv and ../../data/independent-specimens.wgswxspanel.relapse.eachcohort.prefer.wxs.tsv for each cohort independent samples to keep only samples that are in the snv-consensus-plus-hotspots.maf.tsv.gz subset.

Subset non-synonymous variants

Subset snv-consensus-plus-hotspots.maf.tsv.gz to keep only non-synonymous variants with the following code.

Variant_Classification %in% c('Frame_Shift_Del',
                              'Frame_Shift_Ins',
                              'Splice_Site',
                              'Nonsense_Mutation',
                              'Nonstop_Mutation',
                              'In_Frame_Del',
                              'In_Frame_Ins',
                              'Missense_Mutation',
                              'Fusion',
                              'Multi_Hit',
                              'Multi_Hit_Fusion',
                              'Hom_Deletion',
                              'Hem_Deletion',
                              'Amp',
                              'Del',
                              'Translation_Start_Site')

Generate variant-level and gene-level non-synonymous mutation frequencies

For vairant-level analysis, create a Variant_ID for each variant by concatenating non-NA Chromosome, Start_Position, Reference_Allele, and Tumor_Seq_Allele2 with '_'.

For each cancer_group, get each cohort and all cohorts. Call each cancer_group and cohort(s) combination as a cancer_group_cohort. For example,

cancer_group	cohort	n_samples
Neuroblastoma	CBTN	2
Neuroblastoma	GMKF	541
Neuroblastoma	TARGET	889
Neuroblastoma	CBTN&GMKF&TARGET	1432

For each cancer_group_cohort with n_samples >= 3, compute Frequency_in_overall_dataset, Frequency_in_primary_tumors, and Frequency_in_relapse_tumors as following:

Frequency_in_overall_dataset:
- For each unique variant/gene, count the number of patients (identified by Kids_First_Participant_ID) that have mutations at the variant/gene, and call this number Total_mutations.
- Count the total number of patients in the cancer_group_cohort, and call this number Patients_in_dataset.
- Frequency_in_overall_dataset = Total_mutations / Patients_in_dataset.
Frequency_in_primary_tumors:
- For each unique variant/gene, count the number of samples (identified by Kids_First_Biospecimen_ID) that have mutations at the variant/gene and are in the independent primary sample list, and call this number Total_primary_tumors_mutated.
- Count the total number of samples in the cancer_group_cohort that are also in the independent primary sample list, and call this number Primary_tumors_in_dataset.
- Frequency_in_primary_tumors = Total_primary_tumors_mutated / Primary_tumors_in_dataset.
Frequency_in_relapse_tumors:
- For each unique variant/gene, count the number of samples (identified by Kids_First_Biospecimen_ID) that have mutations at the variant/gene and are in the independent relapse sample list, and call this number Total_relapse_tumors_mutated.
- Count the total number of samples in the cancer_group_cohort that are also in the independent relapse sample list, and call this number Relapse_tumors_in_dataset.
- Frequency_in_relapse_tumors = Total_relapse_tumors_mutated / Relapse_tumors_in_dataset.

Format the SNV mutation frequency table according to the latest spreadsheet that is attached in d3b-center/ticket-tracker-OPC#64.

Merge the SNV mutation frequency tables of all cancer_group_cohorts.

Add annotations

Add the following annotation columns to variant-level and gene-level tables using long-format-table-utils/annotator. For more details about the annotation columns, see long-format-table-utils README.md.

Gene_full_name
Protein_RefSeq_ID
EFO
MONDO
OncoKB_cancer_gene
OncoKB_oncogene_TSG

Add Gene_type, PedcBio_PedOT_oncoprint_plot_URL and PedcBio_PedOT_mutations_plot_URL annotation columns to gene-level tables.

The PedcBioPortal case_set_ids in the URLs are obtained from the sample-lists PedcBioPortal web API, with the following command, where you have to specify the PedcBio STUDY_NAME, ex: openpedcan_v11, and your API-ACCESS-TOKEN:

curl -X GET "https://pedcbioportal.kidsfirstdrc.org/api/studies/STUDY_NAME/sample-lists" -H "accept: application/json" -H "Authorization: Bearer API-ACCESS-TOKEN" > ped_opentargets_pedcbio_case_set_ids.json

To update the case_set_ids, rerun the curl command with your own PedcBioPortal web API access token. The token can be requested and downloaded at https://pedcbioportal.kidsfirstdrc.org/webAPI. More information about the access token is at https://docs.cbioportal.org/2.2-authorization-and-authentication/authenticating-users-via-tokens#using-data-access-tokens.

Two more changes in column names and additions of three more columns are made for OT compatibility:

Gene_Ensembl_Id is changed to targetFromSourceId
EFO to diseaseFromSourceMappedId

Three additional columns added:

chop_uuid includes 32 digit UUIDs that are unique for each row of the file
datatypeId column with somatic_mutation in each row
datasourceId column with chop_gene_level_snv in each row for gene-level table and chop_variant_level_snv in each row for variant-level table.

Results

Results are generated using PediatricOpenTargets/OpenPedCan-analysis data release v10.

The merged variant-level and gene-level SNV mutation frequency tables of all cancer_group_cohorts is output in TSV and JSONL formats.

results/gene-level-snv-consensus-annotated-mut-freq.jsonl.gz
results/gene-level-snv-consensus-annotated-mut-freq.tsv.gz
results/variant-level-snv-consensus-annotated-mut-freq.jsonl.gz
results/variant-level-snv-consensus-annotated-mut-freq.tsv.gz

Note: When reading {variant,gene}-level-snv-consensus-annotated-mut-freq.tsv using readr::read_tsv, you can specify na = character(0) to avoid converting blank string entries to NAs, as suggested by @jharenza at #45 (review). This can be useful when converting {variant,gene}-level-snv-consensus-annotated-mut-freq.tsv to JSON, because the PediatricOpenTargets/OpenPedCan-analysis project favors empty string ('') as missing value over NA/NaN/NULL.

Usage

Change working directory to local OpenPBTA-analysis.
Download data using bash download-data.sh.
Run this analysis module in the continuous integration (CI) docker image using ./scripts/run_in_ci.sh bash analyses/snv-frequencies/run-snv-frequencies.sh.

Analysis scripts

`01-snv-frequencies.R`

This script annotates each non-synonymous variant in snv-consensus-plus-hotspots.maf.tsv.gz with mutation frequencies per (cohort, cancer_group, primary/relapse).

Usage:

Rscript --vanilla '01-snv-frequencies.R'

Input:

../../data/histologies.tsv
../../data/snv-consensus-plus-hotspots.maf.tsv.gz
../../data/independent-specimens.wgswxspanel.primary.prefer.wxs.tsv
../../data/independent-specimens.wgswxspanel.relapse.prefer.wxs.tsv
../../data/independent-specimens.wgswxspanel.primary.eachcohort.prefer.wxs.tsv
../../data/independent-specimens.wgswxspanel.relapse.eachcohort.prefer.wxs.tsv
input/ped_opentargets_pedcbio_case_set_ids.json

Output:

results/gene-level-snv-consensus-annotated-mut-freq.json
results/gene-level-snv-consensus-annotated-mut-freq.tsv
results/variant-level-snv-consensus-annotated-mut-freq.json
results/variant-level-snv-consensus-annotated-mut-freq.tsv

Guide on how to update `01-snv-frequencies.R`

Read the script. 01-snv-frequencies.R contains the following sections:

Function definitions. This section contains the definitions of all functions used in this script. These functions directly use column names of input data, so any future updates need to check whether the column names are changed in new data releases. This script is originally developed using v6 data release.
Create output dir. This section creates output directory.
Read data. This section reads all input data.
Subset tumor samples and used columns in MAF table. This section subsets MAF table.
Subset independent samples in histology table. This section subsets histology table and creates primary and relapse independent sample histology tables.
Compute mutation frequencies. This section:
- computes mutation frequencies for 1) each cancer_group and each cohort and 2) each cancer_group and all cohorts
- add each-cohot and all-cohort PedcBio URLs
- creates gene-level and variant-level mutation frqeuency tables
Add annotations to the output table. This section adds various annotations to the mutation frequency tables.
Output TSV and JSON. This section outputs mutation frequency tables in TSV and JSON formats.

Common updates:

Add a new annotation column:
- If the annotation is available in the long-format-table-utils/annotator:
  - Add new annotation column names in the columns_to_add parameter of annotate_long_format_table calls.
- If the annotation is not available in the the long-format-table-utils/annotator:
  - Load the annotation table in the Read data section. Make sure the join_by column has no NA or duplicate.
  - Add annotations to the result table by dplyr::left_join. Replace NAs with ''.
- Reorder new columns before output.
Change the notation for all_cohorts:
- Change 'all_cohorts' to other values in the get_cohort_set_value() function.
- Change tests/test_get_cohort_set_value.R
Change the format of PedcBio URLs:
- Change the get_pcb_pot_csi() and get_pcb_pot_ploAt_url() functions.
- Change tests/test_get_pcb_pot_csi.R and tests/test_get_pcb_pot_plot_url.R.
- If case_set_ids are no longer used for linking PedcBio, the whole worflow needs to be redesigned.
Add new PedcBio URLs:
- Change the add_cg_ch_pedcbio_pedot_plot_urls() function.
- Retain the new URL columns in the get_cg_ch_var_level_mut_freq_tbl() and get_cg_ch_gene_level_mut_freq_tbl() functions.
Add new mutation frequency columns:
- Change the get_opr_mut_freq_tbl() function.
- Change tests/test_get_opr_mut_freq_tbl.R.
- Retain the new mutation frequency columns in the get_cg_ch_var_level_mut_freq_tbl() and get_cg_ch_gene_level_mut_freq_tbl() functions.

The stopifnot() statements are assertions for the input data, so the output would be expected. If any assertion fails, additional code needs to be added mainly to remove NAs and handle duplicates in the data, so that the assertion pases.

The unit testing is implemented using the testthat package version 2.1.1, as suggested by @jharenza and @NHJohnson in the reviews of PR #55.

To run all unit tests, run bash run-tests.sh in the Docker image/container from any working directory. Following is an example run.

$ bash analyses/snv-frequencies/run-tests.sh
✔ |  OK F W S | Context
✔ |  23       | tests/test_get_cohort_set_value.R
✔ |  31       | tests/test_get_opr_mut_freq_tbl.R [0.8 s]
✔ |  33       | tests/test_get_pcb_pot_csi.R
✔ |  42       | tests/test_get_pcb_pot_plot_url.R
✔ |  21       | tests/test_helper_import_function.R
✔ |  12       | tests/test_num_to_pct_chr.R

══ Results ═════════════════════════════
Duration: 1.0 s

OK:       162
Failed:   0
Warnings: 0
Skipped:  0
Done running run-tests.sh

To add more tests, create additional test*R files under the tests directory, with available test*R files as reference.

Notes on the testthat unit testing framework:

testthat::test_dir("tests") finds all test*R files under the tests directory to run, which is used in run-tests.sh.
testthat::test_dir("tests") also finds and runs all helper*R files under the tests directory before running the test*R files.
The working directory is tests when running the helper*R and test*R files through testthat::test_dir("tests").
In order to import a funciton for testing from an R file without running the whole file, a helper function import_function is defined at tests/helper_import_function.R, and the import_function is also tested in the tests/test_helper_import_function.R file.
Even though the testthat 2.1.1 documentation of the filter parameter of test_dir function says that "Matching is performed on the file name after it's stripped of "test-" and ".R", the R code uses the following. Therefore, naming test files with test_some_test_file.R can be found by the test_dir function.
- "^test.*\\.[rR]$" for finding test files in find_test_scripts
- sub("^test-?", "", test_names), sub("\\.[rR]$", "", test_names), and grepl(filter, test_names, ...) for filtering test files in testthat:::filter_test_scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snv-frequencies

snv-frequencies

README.md

Annotate SNV table with mutation frequencies

Purpose

Methods

Subset overlapping tumor samples between `snv-consensus-plus-hotspots.maf.tsv.gz` and `histologies.tsv`

Subset non-synonymous variants

Generate variant-level and gene-level non-synonymous mutation frequencies

Add annotations

Results

Usage

Analysis scripts

`01-snv-frequencies.R`

Guide on how to update `01-snv-frequencies.R`

Files

snv-frequencies

Directory actions

More options

Directory actions

More options

Latest commit

History

snv-frequencies

Folders and files

parent directory

README.md

Annotate SNV table with mutation frequencies

Purpose

Methods

Subset overlapping tumor samples between snv-consensus-plus-hotspots.maf.tsv.gz and histologies.tsv

Subset non-synonymous variants

Generate variant-level and gene-level non-synonymous mutation frequencies

Add annotations

Results

Usage

Analysis scripts

01-snv-frequencies.R

Guide on how to update 01-snv-frequencies.R

Subset overlapping tumor samples between `snv-consensus-plus-hotspots.maf.tsv.gz` and `histologies.tsv`

`01-snv-frequencies.R`

Guide on how to update `01-snv-frequencies.R`