Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotate SNV table with mutation frequencies #45

Merged
merged 55 commits into from
Jul 27, 2021

Conversation

logstar
Copy link

@logstar logstar commented Jun 30, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

Annotate each non-synonymous variant in snv-consensus-plus-hotspots.maf.tsv.gz with mutation frequencies per (cohort, cancer_group, primary/relapse).

What was your approach?

Subset snv-consensus-plus-hotspots.maf.tsv.gz to keep only samples with sample_type == 'Tumor'.

Subset snv-consensus-plus-hotspots.maf.tsv.gz to keep only non-synonymous variants with the following code.

Variant_Classification %in% c('Frame_Shift_Del',
                              'Frame_Shift_Ins',
                              'Splice_Site',
                              'Nonsense_Mutation',
                              'Nonstop_Mutation',
                              'In_Frame_Del',
                              'In_Frame_Ins',
                              'Missense_Mutation',
                              'Fusion',
                              'Multi_Hit',
                              'Multi_Hit_Fusion',
                              'Hom_Deletion',
                              'Hem_Deletion',
                              'Amp',
                              'Del',
                              'Translation_Start_Site')

Create a Variant_ID for each variant by concatenating Chromosome, Start_Position, Reference_Allele, and Tumor_Seq_Allele2 with '_'.

Add Gene_full_name and Protein_RefSeq_ID columns to each variant with annotations obtained from mygene.info.

For each cancer_group, get each cohort and all cohorts. Call each cancer_group and cohort(s) combination as a cancer_group_cohort. For example,

cancer_group cohort n_samples
Neuroblastoma CBTN 2
Neuroblastoma GMKF 541
Neuroblastoma TARGET 889
Neuroblastoma CBTN&GMKF&TARGET 1432

For each cancer_group_cohort with n_samples >= 5, compute Frequency_in_overall_dataset, Frequency_in_primary_tumors, and Frequency_in_relapse_tumors as following:

  • Frequency_in_overall_dataset:

    • For each unique variant, count the number of patients (identified by Kids_First_Participant_ID) that have the variant, and call this number Total_mutations.
    • Count the total number of patients in the cancer_group_cohort, and call this number Patients_in_dataset.
    • Frequency_in_overall_dataset = Total_mutations / Patients_in_dataset.
  • Frequency_in_primary_tumors:

    • For each unique variant, count the number of samples (identified by Kids_First_Biospecimen_ID) that are in the ../independent-samples/results/independent-specimens.wgs.primary.tsv, and call this number Total_primary_tumors_mutated.
    • Count the total number of samples in the cancer_group_cohort that are also in the ../independent-samples/results/independent-specimens.wgs.primary.tsv, and call this number Primary_tumors_in_dataset.
    • Frequency_in_primary_tumors = Total_primary_tumors_mutated / Primary_tumors_in_dataset.
  • Frequency_in_relapse_tumors:

    • For each unique variant, count the number of samples (identified by Kids_First_Biospecimen_ID) that are in the ../independent-samples/results/independent-specimens.wgs.relapse.tsv, and call this number Total_relapse_tumors_mutated.
    • Count the total number of samples in the cancer_group_cohort that are also in the ../independent-samples/results/independent-specimens.wgs.relapse.tsv, and call this number Relapse_tumors_in_dataset.
    • Frequency_in_relapse_tumors = Total_relapse_tumors_mutated / Relapse_tumors_in_dataset.

Format the SNV mutation frequency table according to the latest spreadsheet that is attached in d3b-center/ticket-tracker-OPC#64.

Merge the SNV mutation frequency tables of all cancer_group_cohorts.

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

The get_cg_ch_mut_freq_tbl function for computing frequencies.

The get_cg_cs_tbl function for getting cacner_group_cohorts.

Is there anything that you want to discuss further?

Is it OK to retain the chr prefix in the Chromosome when creating Identifer hg38 (See ClinVar ex) (Variant_ID in the result table)? The example value is 10_102599545_G_A in the spreadsheet, but it will be chr10_102599545_G_A in the output table.

I will add RMTL and COSMIC annotations when CosmicMutantExportCensus.tsv​ and RMTL_Expanded_3_RMTL_for_OT.csv are available in future data releases.

I will work on d3b-center/ticket-tracker-OPC#82 to add oncoKB annotations to the result table.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

Results

What types of results are included (e.g., table, figure)?

Table

What is your summary of the results?

 wc -l results/snv-consensus-annotated-mut-freq.tsv 
  130644 results/snv-consensus-annotated-mut-freq.tsv

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile. Note: wait to be merged at Update Dockerfile #36.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

Squashed commit of the following:

commit d87986b7ce1a517f4807430ce6beaac5950b50ca
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 17:21:59 2021 -0400

    Rename mutation-frequencies to snv-frequencies

    Rename module.

commit b2d2fd5c391b43214825e7b458d0edcb5ac22f1a
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 17:13:18 2021 -0400

    Annotate SNV table with mutation frequencies

    Issues addressed:

    - <d3b-center/ticket-tracker-OPC#64>
    - <d3b-center/ticket-tracker-OPC#8>.
      This issue is no longer compatible with the purpose of this module.
      This module intends to compute mutation frequencies for each variant,
      but this issue intents to compute the mutation frequencies for each gene.
      This issue is listed here for future reference.

commit 84cacf28927121037f4b9ba895e5baa5d12c7b31
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 16:23:20 2021 -0400

    [WIP] Update run-mutation-frequencies.sh

commit 29ae8ef19f2339ae08f78c26ab42e6cf75d3556e
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 16:14:50 2021 -0400

    [WIP] Generate annotated SNV frequency table

commit 2cb06741ca192f77a3043d03574649a184459b11
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 14:54:39 2021 -0400

    [WIP] Replace NA with blank string

    Also replaced HotSpot value 1 with Y and 0 with N.

commit 57776f61e576a5e3e2672370370fd1090f3aa478
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 14:02:13 2021 -0400

    [WIP] Use mygene.info to query gene IDs

    mygene.info seems to be actively maintained. The query results are more
    comprehensive than
    [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html).

    Relevant URLs:
    - <http://mygene.info/about>
    - <https://bioconductor.org/packages/release/bioc/html/mygene.html>

    mygene.info is suggested by @taylordm and  @jharenza.

commit 76bb0f5236378648adce429e45d3827009735b58
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 10:55:30 2021 -0400

    [WIP] Generate SNV frequency tables

    Issue addressed:
    d3b-center/ticket-tracker-OPC#64
Add snv-frequencies module in the table.
When there is no primary or relapse samples, mutation frequencies are
0/0 = NaN in R. Replace 'NaN%' in the output with '', in order to be
consistent with missing values in other columns.
@logstar
Copy link
Author

logstar commented Jul 1, 2021

Update to this PR:

Replace 'NaN%' frequencies with ''

When there is no primary or relapse sample in a cancer_group_cohort, mutation frequencies are computed with 0 / 0, which takes the value of NaN in R.

In the result table, replace 'NaN%' with empty string '' for the NaN frequencies, in order to be consistent with missing values in other columns.

Copy link
Member

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @logstar! Thanks for the quick PR for this table!

Now that #34 is in, will you update this with the files in v6? This will change your CBTN dataset/cohort to PBTA and will condense a few cancer groups. In addition, per our call on Wednesday, we want to add the EFO and MONDO columns to this file, and those can be added with a merge of efo-mondo-map.tsv. Note, this is not yet complete - we only have about half of the cancer groups in there so far, but it is a start.

For CBTN&GMKF&TARGET in Dataset, can you recode this to PedOT - we will use this general designation to mean inclusive of all datasets.

I also noticed that you have some NAs in SIFT, PolyPhen, and some other columns when I read this into R using read_tsv(), but it looks like if I do:

freq <- read_tsv("~/Documents/GitHub/ped-ot/OpenPedCan-analysis/analyses/snv-frequencies/results/snv-consensus-annotated-mut-freq.tsv",
                 na=character(0))

the blanks are retained, so I wanted to note that here because when we do the JSON conversions, which may be in staggered PRs, we will have to add that argument.

Overall, the format looks good, however, I am worried about the results - for instance in neuroblastoma, the expected number of ALK mutations in primary tumors should be ~10%, but we are only seeing ~5.85% and this doesn't align with the current GMKF pedcbio study which has a frequency of 9%. There are some mutations in pedcbio we are not capturing here, but I have a suspicion that the study loaded in pedcbio is only using Strelka2 data and not a consensus of 4 callers, so, I have to check into this a bit. Regardless, even with consensus, this number should be higher. cc @migbro @yuankunzhu

@logstar
Copy link
Author

logstar commented Jul 2, 2021

Thank you for the review @jharenza !

Hi @logstar! Thanks for the quick PR for this table!

Now that #34 is in, will you update this with the files in v6? This will change your CBTN dataset/cohort to PBTA and will condense a few cancer groups. In addition, per our call on Wednesday, we want to add the EFO and MONDO columns to this file, and those can be added with a merge of efo-mondo-map.tsv. Note, this is not yet complete - we only have about half of the cancer groups in there so far, but it is a start.

Yes. I will update this with the v6 data release files and add EFO and MONDO columns.

For CBTN&GMKF&TARGET in Dataset, can you recode this to PedOT - we will use this general designation to mean inclusive of all datasets.

Sure. Will do.

I also noticed that you have some NAs in SIFT, PolyPhen, and some other columns when I read this into R using read_tsv(), but it looks like if I do:

freq <- read_tsv("~/Documents/GitHub/ped-ot/OpenPedCan-analysis/analyses/snv-frequencies/results/snv-consensus-annotated-mut-freq.tsv",
                 na=character(0))

the blanks are retained, so I wanted to note that here because when we do the JSON conversions, which may be in staggered PRs, we will have to add that argument.

Good catch! I will add it in the README.md. It will be helpful for anyone who wants to read the tsv table.

For JSON conversion, I was wondering if I could add jsonlite::write_json(result_tibble, 'result.json') in the end of the R script to output an additional JSON file, so we can skip the read_tsv step and staggered PRs.

Overall, the format looks good, however, I am worried about the results - for instance in neuroblastoma, the expected number of ALK mutations in primary tumors should be ~10%, but we are only seeing ~5.85% and this doesn't align with the current GMKF pedcbio study which has a frequency of 9%. There are some mutations in pedcbio we are not capturing here, but I have a suspicion that the study loaded in pedcbio is only using Strelka2 data and not a consensus of 4 callers, so, I have to check into this a bit. Regardless, even with consensus, this number should be higher. cc @migbro @yuankunzhu

I see a similar ~6.10% ALK overall SNV frequency in my updated oncoprint-landscape module result table.

@logstar
Copy link
Author

logstar commented Jul 2, 2021

Updates to this PR:

  • Added EFO, MONDO, and RMTL. For RMTL, add a single column combining rmtl and version in the format of rmtl (version), i.e. Relevant Molecular Target (RMTL version 1) or Non-Relevant Molecular Target (RMTL version 1).
  • Renamed cohort PBTA&GMKF&TARGET with PedOT.
  • Updated README.md to note na=character(0).

@logstar
Copy link
Author

logstar commented Jul 3, 2021

Update to this PR:

  • Output SNV mutation frequency table in JSON format.

Output m_mut_freq_tbl in 01-snv-frequencies.R with the following code:

jsonlite::write_json(
  m_mut_freq_tbl,
  file.path(tables_dir, 'snv-consensus-annotated-mut-freq.json'))

gzip the JSON file in run-snv-frequencies.sh to save space with the following code:

# --no-name option stops the filename and timestamp from being stored in the
# output file. So rerun will have the same file.
gzip --no-name results/snv-consensus-annotated-mut-freq.json

There is one duplicate of ENSG-HUGO mapping.

Also added additional assertions.
Move RMTL column after Gene_symbol.

Rename Variant_ID to Variant_ID_hg38.
Use `%>% mutate_all(function(x) replace_na(x, replace = ''))` to replace
NA/NaNs in all columns with ''.

Remove the added column that is not used in the script.
In run-snv-frequencies.sh, remove existing results dir before running
01-snv-frequencies.R, so gzip will not ask for confirmation to
overwrite.
@kgaonkar6
Copy link

Hi @logstar just wanted to ask if you envision 1 module for each data type like snv-frequencies, cnv-frequencies fusion-frequencies etc OR can we rename this module so we can retain frequency/json files here for all data types?

@logstar
Copy link
Author

logstar commented Jul 6, 2021

Hi @logstar just wanted to ask if you envision 1 module for each data type like snv-frequencies, cnv-frequencies fusion-frequencies etc OR can we rename this module so we can retain frequency/json files here for all data types?

Thank you for asking. I think 1 module for each data type might be more efficient to implement, document, maintain, and review.

If we need to combine the tables into one in the future, we could add another module for combining snv, cnv, and fusion frequency tables together into one.

If there is any reason that we should have one module for all data types, we could also proceed with that.

Removed the `rm -r results` part in run-rna-seq-expression-summary-stats.sh,
as suggested by @jharenza at
d3b-center#27 (comment)
Copy link

@kgaonkar6 kgaonkar6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code overall is well documented and implements all the checks 👍

Do you think we could divide the 01 script to have all your functions in 1 or multiple utils scripts in this module to gather the counts given a Kids_First_Biospecimen_ID and a Variant_ID/Alt_ID and another 01 script to read mutation/annotation files, call this function and left_join the counts?

Just wanted to discuss this since I think this would a good way to have reusable function for other data types ( I did play around your code and seem to have one for fusion. which I can add as PR for your review) . Thoughts?

Update README.md for recent patches.
@logstar
Copy link
Author

logstar commented Jul 7, 2021

The code overall is well documented and implements all the checks 👍

Do you think we could divide the 01 script to have all your functions in 1 or multiple utils scripts in this module to gather the counts given a Kids_First_Biospecimen_ID and a Variant_ID/Alt_ID and another 01 script to read mutation/annotation files, call this function and left_join the counts?

Just wanted to discuss this since I think this would a good way to have reusable function for other data types ( I did play around your code and seem to have one for fusion. which I can add as PR for your review) . Thoughts?

Thank you for the review @kgaonkar6 !

I agree that we could reuse the workflow design of this module. I will also extract a function to generate the frequency related columns and put it in utils, in order to reuse it to generate the gene-level tables that we discussed this morning. If you need the function immediately, you could adapt the linked code for your own purposes, https://github.com/logstar/OpenPedCan-analysis/blob/f70645b6c7e4eb15ea29e45e9ebf0adeb5798b9b/analyses/snv-frequencies/01-snv-frequencies.R#L146-L227.

However, I woud prefer to use the code from other modules by adapting them rather than directly resuing via source('../another-module/utils/some_function.R'), so I do not have rely on any code that has no guarantee to not change interface in the future.

test_num_to_pct_chr.R tests the num_to_pct_chr function in
01-snv-frequencies.R.
Test get_cohort_set_value in tests/test_get_cohort_set_value.R.
get_pcb_pot_csi is tested in tests/test_get_pcb_pot_csi.R.
Test get_pcb_pot_plot_url in tests/test_get_pcb_pot_plot_url.R.
Put the imported function in the same environment as the import_function
being called, so the imported functions can call other imported
functions.
Require Kids_First_Biospecimen_ID and Kids_First_Participant_ID columns
exist and have no NA in overall_histology_df, primary_histology_df, and
relapse_histology_df.
get_opr_mut_freq_tbl is the core function to compute mutation
frequencies. get_opr_mut_freq_tbl is called by get_cg_ch_var_level_mut_freq_tbl and
get_cg_ch_gene_level_mut_freq_tbl to compute mutation frequencies.

get_cg_ch_var_level_mut_freq_tbl and get_cg_ch_gene_level_mut_freq_tbl
are not tested, because they only format the mutation frequency tables
generated by get_opr_mut_freq_tbl.
@logstar
Copy link
Author

logstar commented Jul 23, 2021

@jharenza @kgaonkar6 I have added unit testing for this PR and removed test case comments in 01-snv-frequencies.R.

All tests passed in the Docker image. See the commits from c58c8d5 for more details.

$ bash run-tests.sh
✔ |  OK F W S | Context
✔ |  23       | tests/test_get_cohort_set_value.R
✔ |  31       | tests/test_get_opr_mut_freq_tbl.R [0.8 s]
✔ |  33       | tests/test_get_pcb_pot_csi.R
✔ |  42       | tests/test_get_pcb_pot_plot_url.R
✔ |  21       | tests/test_helper_import_function.R
✔ |  12       | tests/test_num_to_pct_chr.R

══ Results ═════════════════════════════════════════════════════════════════════════════════════════════════════
Duration: 1.1 s

OK:       162
Failed:   0
Warnings: 0
Skipped:  0
Done running run-tests.sh

I will work on d3b-center/ticket-tracker-OPC#122 before incorporating the annotator API into this PR. If this PR is merged, I will create a new PR for refactoring the annotation part in this module.

In import_function, envir = parent.frame() makes the
environment(imported_function) to be the same as the import function
being called.

Even though the eval documentation says the default envir parameter is
parent.frame(), leaving envir = parent.frame() in the eval call will
surprisingly make the environment(imported_function) to be the same as
the environment of the import_function call. The tests also fail without
specifying envir = parent.frame(). Maybe this is caused by the place
where parent.frame() is evaluated? If specified, parent.frame() is
evaluated in the calling function; if not specified, parent.frame() is
evaluated in the eval call environment?

Examples executed in terminal R, in order to avoid RStudio customizations.
> print(environment())
<environment: R_GlobalEnv>
>
> foo <- function() {
+   print(parent.frame())
+   print(environment())
+   return(eval(quote(function() { return(2) })))
+ }
>
> bar <- foo()
<environment: R_GlobalEnv>
<environment: 0x5645ead67670>
> print(environment(bar))
<environment: 0x5645ead67670>
>
> baz <- function() {
+   print(parent.frame())
+   print(environment())
+   return(eval(quote(function() { return(2) }),
+               envir = parent.frame()))
+ }
>
> qux <- baz()
<environment: R_GlobalEnv>
<environment: 0x5645ead5fec0>
> print(environment(qux))
<environment: R_GlobalEnv>
@logstar
Copy link
Author

logstar commented Jul 26, 2021

Added WIP label to this PR.

I will use the annotator API to add annotations to the SNV tables. This will reduce the number of lines in analyses/snv-frequencies/01-snv-frequencies.R, which may reduce the workload for the reviewers to review this PR.

I will keep using v6 data and independent sample lists in this PR, and I will create a ticket for rerunning the snv-frequency module using the v7 data and independent sample lists.

@logstar
Copy link
Author

logstar commented Jul 26, 2021

@jharenza @kgaonkar6 Sorry for changing the plan.

This PR will not be further changed for integrating the annotator API, because the v6 independent sample lists and the annotator API cannot be both available in this PR without including the analyses/long-format-table-utils directory in the "Files changed" tab of this PR.

I will create a new PR for integrating the annotator API and rerunning for v7, after the v7 data release is available.

Following is the clarification for the forced push:

The forced push is to completely revert merging with the lft-utils-ann-r-cli branch. I merged the lft-utils-ann-r-cli rather than the dev branch, because dev branch has v7 independent lists, in order to get the annotator API. However, the files added in lft-utils-ann-r-cli showed up in the "Files changed" tab, even though the lft-utils-ann-r-cli has already been merged to the base dev branch of this PR, which would make the "Files changed" tab and the commit history confusing.

The forced push completely reverted this PR to the state of this commit 861be07, which is the latest one before merging with the lft-utils-ann-r-cli branch. Following are the commands used for this forced push.

I assumed no one has pulled the snv-freq branch from my own fork after the merging was pushed and before the forced push. Otherwise, a forced pull may be necessary to sync with this PR/branch.

A merge can also be reverted with git --revert, but it may cause complex issues to the commit history.

$ git reset 861be07959c8590c25711d0c16e55fa92b016e8e
Unstaged changes after reset:
M       analyses/README.md

$ git st
On branch snv-freq
Your branch is behind 'origin/snv-freq' by 102 commits, and can be fast-forwarded.
  (use "git pull" to update your local branch)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   README.md

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        long-format-table-utils/

$ git reset --hard
HEAD is now at 861be07 Specify envir = parent.frame() in eval call

# long-format-table-utils is not tracked, so rm -r is necessary to remove it
$ rm -r long-format-table-utils

$ git st
On branch snv-freq
Your branch is behind 'origin/snv-freq' by 102 commits, and can be fast-forwarded.
  (use "git pull" to update your local branch)

nothing to commit, working tree clean

$ git push -f
Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/logstar/OpenPedCan-analysis.git
 + 4f1781f...861be07 snv-freq -> snv-freq (forced update)

Copy link
Member

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice job on this - two minor comments about removing redundant json files and renaming the variant level file, and once addressed, this can be merged.

analyses/README.md Show resolved Hide resolved
analyses/snv-frequencies/run-snv-frequencies.sh Outdated Show resolved Hide resolved
analyses/snv-frequencies/01-snv-frequencies.R Show resolved Hide resolved
Remove results/gene-level-snv-consensus-annotated-mut-freq.json.gz and
results/var-level-snv-consensus-annotated-mut-freq.json.gz.
As suggested by @jharenza at
<d3b-center#45 (comment)>,
use variant-level, rather than var-level, in the variant-level filenames.
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants