Proposed Analysis: Annotate SNV table of mutation frequencies #64

jharenza · 2021-06-24T17:18:50Z

What are the scientific goals of the analysis?

Annotate the SNV TSV table of mutation frequencies per cohort+cancer group+primary/relapse as will be created in #8 for conversion to JSON format.

What methods do you plan to use to accomplish the scientific goals?

Annotate the table with headings as below:
OT_SomaticTables_SNV_CNV.xlsx

Much of this can be achieved by leveraging MAF fields corresponding to the exact variant calls. For ClinVar, we may need to download a version of the database.

Update June 30th

Instead of COSMIC frequency, annotate as COSMIC mutation census with tier from CosmicMutantExportCensus.tsv
Proposed Analysis: add oncoKB annotation to consensus MAF #82
Updated analysis: add pedcbio links to SNV tables #83
add RMTL (Y/N)

What input data are required for this analysis?

snv-consensus-plus-hotspots.maf.tsv.gz

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

2-3 days

Who will complete the analysis (please add a GitHub handle here if relevant)?

@logstar ?

What relevant scientific literature relates to this analysis?

The text was updated successfully, but these errors were encountered:

logstar · 2021-06-29T15:18:26Z

Hi @jharenza . Thank you for the analysis description.

I have a few questions about how to generate the columns in the "Example Somatic Mutation Table" sheet of OT_SomaticTables_SNV_CNV.xlsx.

To generate Frequency in Overall Dataset (example value "4.17%"), should I use all samples, or all independent-primary-plus samples, or other set of samples in a cancer_group/cancer_group_cohort?
To generate Mutations/Total Samples ( example value ".1/24"), should I aggregate all variants of the corresponding gene and divide by "Total Samples"? Similarly, for "Total Samples", should I use all samples, or all independent-primary-plus samples, or other set of samples in a cancer_group/cancer_group_cohort?
To generate Identifer hg38 (See ClinVar ex) (example value "10_102599545_G_A"), should I concatenate Chromosome, Start_Position, End_Position, Reference_Allele, and Tumor_Seq_Allele2 with "_"? I briefly went over ClinVar identifier documentation and several examples, but I could not find any identifier exactly like "10_102599545_G_A".
To generate the following columns, do we have mapping tables available? If not, I will download them from their corresponding databases.
- Protein Identifier/Name
- Protein Refseq ID
- Predicted Mutation Impact Score
- Overall COSMIC frequency
- OncoKB cancer gene
- OncoKB oncogene/TS gene
~~To generate Hotspot, should I use the the HotSpotAllele column in the MAF file?~~ Update (Jun 29 17:39:21 2021): found answer at doc/data-formats.md. I will use the HotSpotAllele column in the MAF file as the Hotspot columns in OT_SomaticTables_SNV_CNV.xlsx.

I will work on the frequencies first.

jharenza · 2021-06-29T17:47:08Z

Hi @logstar

Let's tackle cancer_group_cohort first.

To generate Frequency in Overall Dataset (example value "4.17%"), should I use all samples, or all independent-primary-plus samples, or other set of samples in a cancer_group/cancer_group_cohort?

To generate Mutations/Total Samples ( example value ".1/24"), should I aggregate all variants of the corresponding gene and divide by "Total Samples"? Similarly, for "Total Samples", should I use all samples, or all independent-primary-plus samples, or other set of samples in a cancer_group/cancer_group_cohort?

Ah, good question. I hadn't recalled this field, but the way we would create this is within cancer_group_cohort, identify the unique variants per patient (right now, they are at a sample level, but you could pull out BS_IDs and the mutation metadata, then merge it with the PT_IDs, drop the BS_IDs and unique them to get patient-level variant calls. This would be the data that goes into Frequency in Overall Dataset and I think instead of Mutations/Total Samples, we should make this column two columns: Total mutations and Total patients in dataset, otherwise, it looks like it would just be the fraction that corresponds to the percent.

Then, the independent-primary will be used for Frequency in primary tumors and independent-relapse will be used for Frequency in relapse tumors. That being said, I think we need another four columns for Total primary tumors mutated and Total primary tumors in dataset Total relapse tumors mutated and Total relapse tumors in dataset. Let me update the excel file, too.

To generate Identifer hg38 (See ClinVar ex) (example value "10_102599545_G_A"), should I concatenate Chromosome, Start_Position, End_Position, Reference_Allele, and Tumor_Seq_Allele2 with "_"? I briefly went over ClinVar identifier documentation and several examples, but I could not find any identifier exactly like "10_102599545_G_A".

I think you have this almost right; there would be no end position in the above identifier.

To generate the following columns, do we have mapping tables available? If not, I will download them from their corresponding databases.

Protein Identifier/Name
Protein Refseq ID
Predicted Mutation Impact Score
Overall COSMIC frequency
OncoKB cancer gene
OncoKB oncogene/TS gene

Yes, let me update the excel file- give me a few min.

To generate Hotspot, should I use the the HotSpotAllele column in the MAF file?

yes

logstar · 2021-06-29T18:08:43Z

@jharenza Thank you for the detailed reply.

I will work on cancer_group_cohort first.

I agree it is more informative to have the revised columns.

If I understand correctly, Frequency in Overall Dataset = Total mutations / Total patients in dataset. The Total mutations is the number of patients that has the corresponding variant of the row.

jharenza · 2021-06-29T18:10:54Z

If I understand correctly, Frequency in Overall Dataset = Total mutations / Total patients in dataset. The Total mutations is the number of patients that has the corresponding variant of the row.

yes, where total mutations is total N of that specific mutation in the dataset

logstar · 2021-06-29T18:31:52Z

If I understand correctly, Frequency in Overall Dataset = Total mutations / Total patients in dataset. The Total mutations is the number of patients that has the corresponding variant of the row.

yes, where total mutations is total N of that specific mutation in the dataset

Got it. Thank you for the quick reply. I will work on the analysis accordingly.

kgaonkar6 · 2021-06-30T14:18:29Z

Just wanted to make a note from the June 30th call (feel free to update/edit)

Instead of COSMIC frequency, annotate as COSMIC mutation census with tier from CosmicMutantExportCensus.tsv
add RMTL (Y/N)

logstar · 2021-06-30T14:27:48Z

Thank you for the notes.

I wonder where I can get the CosmicMutantExportCensus.tsv and RMTL table for annotation. Are they going to be available in future data releases?

kgaonkar6 · 2021-06-30T14:47:06Z

RMTL will be (soon) provided in v6
CosmicMutantExportCensus.tsv ( Originally from https://cancer.sanger.ac.uk/census or a reusable version of the file with the info) will also be provided in v6

logstar · 2021-06-30T14:50:52Z

RMTL will be (soon) provided in v6
CosmicMutantExportCensus.tsv ( Originally from https://cancer.sanger.ac.uk/census or a reusable version of the file with the info) will also be provided in v6

Got it. Thank you for the quick reply.

jharenza · 2021-06-30T18:33:12Z

I am having problems downloading the CosmicMutantExportCensus.tsv file - I cannot access via the Qiagen website as they suggest, so I submitted an email asking for help there. So, for now, proceed without this annotation.

@taylordm

Squashed commit of the following: commit d87986b7ce1a517f4807430ce6beaac5950b50ca Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 17:21:59 2021 -0400 Rename mutation-frequencies to snv-frequencies Rename module. commit b2d2fd5c391b43214825e7b458d0edcb5ac22f1a Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 17:13:18 2021 -0400 Annotate SNV table with mutation frequencies Issues addressed: - <d3b-center/ticket-tracker-OPC#64> - <d3b-center/ticket-tracker-OPC#8>. This issue is no longer compatible with the purpose of this module. This module intends to compute mutation frequencies for each variant, but this issue intents to compute the mutation frequencies for each gene. This issue is listed here for future reference. commit 84cacf28927121037f4b9ba895e5baa5d12c7b31 Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 16:23:20 2021 -0400 [WIP] Update run-mutation-frequencies.sh commit 29ae8ef19f2339ae08f78c26ab42e6cf75d3556e Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 16:14:50 2021 -0400 [WIP] Generate annotated SNV frequency table commit 2cb06741ca192f77a3043d03574649a184459b11 Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 14:54:39 2021 -0400 [WIP] Replace NA with blank string Also replaced HotSpot value 1 with Y and 0 with N. commit 57776f61e576a5e3e2672370370fd1090f3aa478 Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 14:02:13 2021 -0400 [WIP] Use mygene.info to query gene IDs mygene.info seems to be actively maintained. The query results are more comprehensive than [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html). Relevant URLs: - <http://mygene.info/about> - <https://bioconductor.org/packages/release/bioc/html/mygene.html> mygene.info is suggested by @taylordm and @jharenza. commit 76bb0f5236378648adce429e45d3827009735b58 Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 10:55:30 2021 -0400 [WIP] Generate SNV frequency tables Issue addressed: d3b-center/ticket-tracker-OPC#64

logstar · 2021-07-27T21:20:42Z

@jharenza Is the following unfinished task included in the Gene_type annotation? The "COSMIC genes" are listed as a data source for the genelistreference.txt at https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/analyses/fusion_filtering.

Instead of COSMIC frequency, annotate as COSMIC mutation census with tier from CosmicMutantExportCensus.tsv

If not, I could add the required annotation for the v7 annotator and snv-frequencies module updates.

jharenza · 2021-07-27T21:33:37Z

No, this was to use the COSMIC mutation evidence rather than the genes. I never heard back from them, so we can add it as a future ticket and enhancement if we hear back.

logstar · 2021-07-29T14:17:59Z

No, this was to use the COSMIC mutation evidence rather than the genes. I never heard back from them, so we can add it as a future ticket and enhancement if we hear back.

Got it. I think we could leave this ticket open as a reference.

I will also submit two tickets for adding COSMIC mutation evidence to snv-frequencies and annotator, label them with blocked, and refer to this issue.

runjin326 · 2021-09-16T11:51:50Z

Closing with PR45 merged.

jharenza assigned logstar Jun 24, 2021

jharenza mentioned this issue Jun 24, 2021

Proposed Analysis: Create JSON files for SNV tables #69

Closed

This was referenced Jun 24, 2021

Proposed Analysis: Create mutation frequencies for Ped OT platform #8

Closed

[WIP] Update oncoprint-landscape module to generate mutation frequency table d3b-center/OpenPedCan-analysis#32

Closed

jharenza mentioned this issue Jun 30, 2021

Updated analysis: add pedcbio links to SNV tables #83

Closed

logstar mentioned this issue Jun 30, 2021

Update Dockerfile d3b-center/OpenPedCan-analysis#36

Merged

1 task

logstar mentioned this issue Jun 30, 2021

Annotate SNV table with mutation frequencies d3b-center/OpenPedCan-analysis#45

Merged

5 tasks

logstar mentioned this issue Jul 29, 2021

Updated analysis: add COSMIC mutation evidence annotation to annotator submodule #141

Closed

runjin326 closed this as completed Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Analysis: Annotate SNV table of mutation frequencies #64

Proposed Analysis: Annotate SNV table of mutation frequencies #64

jharenza commented Jun 24, 2021 •

edited by logstar

Loading

logstar commented Jun 29, 2021 •

edited

Loading

jharenza commented Jun 29, 2021

logstar commented Jun 29, 2021

jharenza commented Jun 29, 2021

logstar commented Jun 29, 2021

kgaonkar6 commented Jun 30, 2021 •

edited

Loading

logstar commented Jun 30, 2021

kgaonkar6 commented Jun 30, 2021

logstar commented Jun 30, 2021

jharenza commented Jun 30, 2021

logstar commented Jul 27, 2021

jharenza commented Jul 27, 2021

logstar commented Jul 29, 2021

runjin326 commented Sep 16, 2021

Proposed Analysis: Annotate SNV table of mutation frequencies #64

Proposed Analysis: Annotate SNV table of mutation frequencies #64

Comments

jharenza commented Jun 24, 2021 • edited by logstar Loading

What are the scientific goals of the analysis?

What methods do you plan to use to accomplish the scientific goals?

What input data are required for this analysis?

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

Who will complete the analysis (please add a GitHub handle here if relevant)?

What relevant scientific literature relates to this analysis?

logstar commented Jun 29, 2021 • edited Loading

jharenza commented Jun 29, 2021

logstar commented Jun 29, 2021

jharenza commented Jun 29, 2021

logstar commented Jun 29, 2021

kgaonkar6 commented Jun 30, 2021 • edited Loading

logstar commented Jun 30, 2021

kgaonkar6 commented Jun 30, 2021

logstar commented Jun 30, 2021

jharenza commented Jun 30, 2021

logstar commented Jul 27, 2021

jharenza commented Jul 27, 2021

logstar commented Jul 29, 2021

runjin326 commented Sep 16, 2021

jharenza commented Jun 24, 2021 •

edited by logstar

Loading

logstar commented Jun 29, 2021 •

edited

Loading

kgaonkar6 commented Jun 30, 2021 •

edited

Loading