Proposed Analysis: Molecularly subtype ependymoma tumors #245

jharenza · 2019-11-08T15:21:27Z

Scientific goals

What are the scientific goals of the analysis?
Subtype ependymomas into ST-EPN-RELA, ST-EPN-YAP1, PF-EPN-A, and PF-EPN-B. Note: The publication listed below contains 9 subtypes of ependymomas, but only the 4 listed here are relevant to the OpenPBTA dataset (due to age at diagnosis) and will be discussed below.

Proposed methods

What methods do you plan to use to accomplish the scientific goals?

Review of copy number, mutation, and expression data.
Render results in tabular form in a notebook.

ST-EPN-RELA (Supratentorial Ependymoma, RELA fused)

Most commonly harbor C11orf95-RELA fusions due to chromothripsis of chr 11q13.1 (ref). Only subgroup to exhibit chromothripsis.
Other reported RELA fusions include: LTBP3-RELA
PTEN-TAS2R1 has been found in in RELA negative samples (results in loss of PTEN): "Interestingly, in one case of the ST-EPN-RELA subgroup, negative for RELA fusion types 1 and 2, we detected a PTEN-TAS2R1 fusion leading to a frame shift and subsequent disruption of PTEN (Table S1)."
Fusions result in activation of NF-κB target genes - See Figure 4c in above ref.
Genes over-expressed relative to other EPN tumors: RELA and L1CAM
Common CN changes: CDKN2A deletions. Chr9 or Chr9p loss

ST-EPN-YAP1 (Supratentorial Ependymoma, YAP1 fused)

Harbor C11orf95-YAP1 , YAP1-MAMLD1, YAP1-MAMLD2, and YAP1-FAM118B fusions
Other reported fusions include: C11orf95-MAML2
Genes over-expressed relative to other EPN tumors: ARLD4 and CLDN1
Focal 11q most frequent CNA, with rest of genome stable

PF-EPN-A (Posterior Fossa Ependymoma, Type A)

Genes over-expressed relative to other EPN tumors: CXorf67 and TKTL1
1q gain most frequent CNA

PF-EPN-B (Posterior Fossa Ependymoma, Type B)

Genes over-expressed relative to other EPN tumors: GPBP1 and IFT46
Chr 6 loss most frequency CNA
Show highest degree of genome instability aside from ST-EPN-RELA subtype, with gains and losses of many chromosomes

May be able to determine brain regions using the primary_site from pbta_histologies.tsv - Ref:

posterior fossa aka infratentorial (eg: cerebellum, tectum, fourth ventricle, and brain stem (midbrain, pons, and medulla)
supratentorial (eg: cerebrum, lateral ventricle and third ventricle, choroid plexus, pineal gland, hypothalamus, pituitary gland, and optic nerve)

Required input data

What input data will you use for this analysis?
RNA fusions, RNA expression, copy number, SVs, histologies file

Proposed timeline

What is the timeline for the analysis?
1 week

Relevant literature

If there is relevant scientific literature, put links to those items here.
Link to Molecular Classification of Ependymal Tumors
across All CNS Compartments, Histopathological
Grades, and Age Groups
Link to C11orf95-RELA fusions drive oncogenic NF-κB signalling in ependymoma.

The text was updated successfully, but these errors were encountered:

naqvia · 2020-01-14T16:39:32Z

I will work on this. I will do this after the release of v13 (next week).

tkoganti · 2020-01-17T13:26:16Z

I will be working on this ticket starting next week.

jaclyn-taroni · 2020-01-21T11:44:13Z

@cansavvy there is the following note above:

Show highest degree of genome instability aside from ST-EPN-RELA subtype, with gains and losses of many chromosomes

Is there any output you can include (e.g., summary statistic) as part of chromosomal-instability (#394) that could be helpful for this? Right now it looks like the only outputs for that module are plots.

cansavvy · 2020-01-21T14:38:57Z

I can file a PR that saves total chromosomal breakpoint numbers per biospecimen to TSV file. But perhaps to make it more interpretable we need it to be divided by the size of the effectively surveyed region of the genome (i.e. WGS vs WXS)?

jaclyn-taroni · 2020-01-21T15:13:23Z

I can file a PR that saves total chromosomal breakpoint numbers per biospecimen to TSV file. But perhaps to make it more interpretable we need it to be divided by the size of the effectively surveyed region of the genome (i.e. WGS vs WXS)?

Moving discussion about the specifics over to #394 to keep this focused, but to close the loop - yes, we will want this information, divided by the size of the effectively surveyed genome, saved as a TSV.

jaclyn-taroni · 2020-01-22T20:00:42Z

Hi @tkoganti, as we discussed in person I am filling in a bit of the how behind this ticket.

Now that I have revisited this ticket, I think analyses/molecular-subtyping-EPN is a good place to add this analysis.

Continuous integration

Continuous integration (CI) has some special considerations when we are working in the context of the molecular subtyping tickets.

For continuous integration, we use a set of files that only contains a limited number of participants to save on download time and the amount of RAM needed to run the analyses (Continuous Integration (CI) section of the README).

What this means for this issue (and any other subtyping issue) is that there often will not be a large number of or sometimes any of the relevant samples used in continuous integration and that will cause things to fail in continuous integration. To get around this, I suggest that the first thing you add is a script that subsets all the files you will need for subtyping to only the ependymoma samples you are interested in subsetting. You will need to add and commit these files to the repository (perhaps in analyses/molecular-subtyping-EPN/subset-files) so that they can be checked out by the machine that runs continuous integration.

The molecular-subtyping-ATRT module is an example where we do this. Importantly, you will not be able to run the script that does the subsetting in CI (for the same reasons mentioned above). Because you will be committing the subset files to the repository, they will be associated with a particular data release (release-v13-20200116). To avoid other folks from accidentally using outdated files you can add the subsetting to a shell script that runs all the steps for subtyping EPN tumors with an option to skip it in CI:

Here's the ATRT example following the passing variables only in CI instructions:

OpenPBTA-analysis/analyses/molecular-subtyping-ATRT/run-molecular-subtyping-ATRT.sh

Line 13 in d143e1a

SUBSET=${OPENPBTA_SUBSET:-1}

And then subsetting will be run by default (OPENPBTA_SUBSET=1):

OpenPBTA-analysis/analyses/molecular-subtyping-ATRT/run-molecular-subtyping-ATRT.sh

Line 22 in b4b5230

if [ "$SUBSET" -gt "0" ]; then

But it is skipped in CI with OPENPBTA_SUBSET=0:

OpenPBTA-analysis/.circleci/config.yml

Line 102 in d143e1a

    
           command:  OPENPBTA_SUBSET=0 ./scripts/run_in_ci.sh bash analyses/molecular-subtyping-ATRT/run-molecular-subtyping-ATRT.sh

Where to find specific features

Now we need to figure out which files you will need to satisfy the goals of this issue. Some of these files you will need to subset; others you will not. More on that below.

Files you will not need to subset

Some files in CI are small enough that we don't generate OR they are committed in the repository, so they are available in full.

For all fusions relevant to this ticket (e.g. C11orf95-RELA, YAP1-MAMLD1, etc.) you can use the fusion summary file: data/fusion_summary_ependymoma_foi.tsv. No need to subset, we copy the original file for testing in CI (ref).
Fusions result in activation of NF-κB target genes

For this, I would look at the GSVA scores for the HALLMARK_TNFA_SIGNALING_VIA_NFKB pathway. You can find the scores for the poly-A RNA-seq data and stranded RNA-seq data at analyses/gene-set-enrichment-analysis/results/gsva_scores_polya.tsv and analyses/gene-set-enrichment-analysis/results/gsva_scores_stranded.tsv, respectively. These are committed to the repo, so no need to subset.
For broad copy number changes (e.g., Chr9 or Chr9p loss, 1q gain, etc.), you can use the broad_values_by_arm.txt files in the GISTIC output (data/pbta-cnv-cnvkit-gistic.zip). We copy the whole zipped file over for testing in CI (ref).
May be able to determine brain regions using the primary_site from data/pbta_histologies.tsv

The pbta-histologies.tsv is available in full in CI as well (ref). @cbethell has an example in the molecular-subtyping-ATRT module of recoding the primary_site to brain region you can see here:

OpenPBTA-analysis/analyses/molecular-subtyping-ATRT/01-ATRT-molecular-subtyping-data-prep.Rmd

Line 133 in d143e1a

# Define regions of the brain (using Anatomy of the Brain figure found at
Show highest degree of genome instability aside from ST-EPN-RELA subtype, with gains and losses of many chromosomes

Another committed file you can use to get the breakpoint density - analyses/chromosomal-instability/breakpoint-data/union_of_breaks_densities.tsv.
For gene-specific focal deletions (e.g., CDKN2A deletions) you can use committed files in focal-cn-file-preparation - I'd recommend using the CNVkit files to be consistent with what GISTIC was run on: analyses/focal-cn-file-preparation/results/cnvkit_annotated_cn_autosomes.tsv.gz and analyses/focal-cn-file-preparation/results/cnvkit_annotated_cn_x_and_y.tsv.gz

Files you will need to subset

I think the main information you will need to generate subset files for are the instances that say

Genes over-expressed relative to other EPN tumors

So you will need to use the expression files (the collapsed expression files to be consistent with other subtyping modules): data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds and pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds. (Note I'm not sure if there are any EPN samples in the poly-A data.) I would recommend presenting the values for the genes mentioned as z-scores, where you scale each gene only once you have subset to the EPN samples.

Other gotchas that come to mind

We don't have any measure of chromothripsis quite yet - that's being worked on in Proposed Analysis: Chromothripsis analysis with ShatterSeek, SV signatures #393 and the related PR Add shatterseek #449.
You'll need to combine RNA-seq data (e.g., fusion) and DNA-seq data (e.g., GISTIC) from the same sample to get a complete picture. Here's documentation about how to approach this: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/data-formats.md#mapping-between-dna-seq-and-rna-seq-data-for-the-same-sample

If you want to see an example of what the output of one of these subtyping tickets look like in terms of the final table and presentation, see what is linked in #435 (comment)

Let me know if you have any questions!

tkoganti · 2020-01-23T17:14:44Z

@jaclyn-taroni Thank you so much for the information!!! This really helps to get me started on this ticket

I am listing all the files I am going to be using along with the exact filters. Please review this and rectify if any changes need to be made. Also @jharenza if you could answer the questions I wrote here, that would be helpful!

ST-EPN-RELA subtype criterion -

If any one of these fusions present in fusions_summary_ependymoma_foi.tsv file -
C11orf95-RELA
LTBP3-RELA
PTEN-TAS2R1 (Do we need samples with this fusion in this group? It says RELA negative have this fusion)
If hallmark_name column in gsva_scores_polya.tsv/gsva_scores_stranded.tsv says the following -
HALLMARK_TNFA_SIGNALING_VIA_NFKB
If value in row 9p and 9q (or) 9p is =< -1 in this file - broad_values_by_arm.txt
Genes over-expressed relative to other EPN tumors: RELA and L1CAM
For this one I can -
- get the expression data from (pbta-gene-expression-rsem-fpkm.polya.rds/stranded) for genes (ENSEMBL IDs) ENSG00000173039.18_RELA and ENSG00000198910.12_L1CAM for all samples.
- filter this to only the 749 BSID’s in fusion_summary_ependymoma_foi.tsv file (I am using this file since I want expression data only for the ependymoma samples)
- Calculate the mean expression data from this for RELA and L1CAM
- If there are samples that agree on all the above criterion and also have higher expression than the average in RELA and L1CAM, categorize into this subgroup

ST-EPN-YAP1 subtype criterion(Either needs one of the fusion or the CNA in 11q??) -

If any one of the following fusions are present in fusions_summary_ependymoma_foi.tsv file -
C11orf95-YAP1 , YAP1-MAMLD1, YAP1-MAMLD2, and YAP1-FAM118B, C11orf95-MAML2
if row 11q in file broad_values_by_arm.txt shows >= 1.0 and the rest of the columns show 0
Same as above. For over-expressed genes except the filter category would be for ARLD4(I did not find this gene in HGNC, any ENSEMBL ID I can use??) and CLDN1 (ENSG00000163347.5_CLDN1)

PF-EPN-A

If 1q row in broad_values_by_arm.txt file has a value of >=1.0
Same as above for gene over expression for CXorf67(ENSG00000187690.3_CXorf67) and TKTL1(ENSG00000007350.16_TKTL1)

PF-EPN-B

If broad_values_by_arm.txt file shows value in rows 6p and/or 6q as <=-1 and there are at least 3 more CNA calls?? (I randomly picked three to satisfy the “genome instability because of many chromosomal losses and gains” criterion, if there is anything specifically I should look for to check instability, please let me know. )
Same as above for over expressed samples in GPBP1(ENSG00000062194.15_GPBP1) and IFT46(ENSG00000118096.7_IFT46)

jaclyn-taroni · 2020-01-23T18:26:07Z

Hi @tkoganti - I'll clarify a couple things I'm able to comment on below.

If hallmark_name column in gsva_scores_polya.tsv/gsva_scores_stranded.tsv says the following -
HALLMARK_TNFA_SIGNALING_VIA_NFKB

There should be a row for each sample (Kids_First_Biospecimen_ID) for HALLMARK_TNFA_SIGNALING_VIA_NFKB - you will want to look for EPN samples with high gsea_score values to satisfy the "Fusions result in activation of NF-κB target genes" portion of ST-EPN-RELA subtype criteria.

For

Genes over-expressed relative to other EPN tumors: RELA and L1CAM

I would use the collapsed files: data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds and data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds. These files have already been processed to drop genes with zero counts, etc. (see: collapse-rnaseq) and gene symbols are stored as rownames.

filter this to only the 749 BSID’s in fusion_summary_ependymoma_foi.tsv file (I am using this file since I want expression data only for the ependymoma samples)

The 749 BSIDs correspond to more than the ependymoma samples. (There is a conversation over on #410 about that.) What you will want to do to identify ependymoma samples is to filter the pbta-histologies.tsv file to rows where disease_type_new are Ependymoma, and then if you further filter to rows where experimental_strategy is RNA-Seq, what's left in Kids_First_Biospecimen_ID is what you will need to subset the expression files. I would then log2(x + 1) transform and z-score the RELA and L1CAM values (gene-wise) so the final value gives you an idea of the expression relative to all other EPN samples.

For things like:

if row 11q in file broad_values_by_arm.txt shows +> 1.0 and the rest of the lines show 0

I would caution that the GISTIC results that are available were run on the output of one CNV method (CNVkit) and not on the consensus, so there may be instances right now where GISTIC doesn't call something as neutral but will when using the consensus file (#453). Something to keep in mind.

tkoganti · 2020-01-24T18:41:16Z

Hi @jaclyn-taroni and @jharenza!

There are 93 ependymoma samples in total(from pbta-histologies.tsv file) and the fusion_summary_ependymoma_foi.tsv file only has 74 samples. These are the 19 samples missing from the fusion_ependymoma file. Should I assume the value for these in sample rows would be zero?

BS_0BXY0F9N',
'BS_0QYS36NR',
'BS_4FZS7TX4',
'BS_4SCWT0FX',
'BS_C80S5N37',
'BS_EMYET8F4',
'BS_GV3NZ9QD',
'BS_J8VX4D17',
'BS_PRAEF32W',
'BS_PSW27ZTE',
'BS_Q5WZYWCT',
'BS_RCFQ31XF',
'BS_RGPSEMHC',
'BS_SDCNP7MW',
'BS_TCGEZJ5F',
'BS_TXY8QYWA',
'BS_WRHDP7WF',
'BS_XQYHPBFS',
'BS_YE1MAQYJ'

jaclyn-taroni · 2020-01-24T18:47:58Z

@tkoganti can you check if those samples are in the Arriba and STARFusion files? I believe they would need to be in both of the original files to make it to the summary file. If they are not in both, I would consider that data to be missing, rather than the absence of fusions in those samples.

tkoganti · 2020-01-24T19:41:06Z

I checked for a few samples and they are present in both pbta-fusion-starfusion.tsv.gz and pbta-fusion-arriba.tsv.gz. Should I use those files as input then?

jaclyn-taroni · 2020-01-24T19:48:45Z

Can you file a data issue please @tkoganti and describe what you found? We should dig into if there’s an issue with the fusion summary file. I suspect what happened is that these samples are not represented in the putative oncogenic file — i.e., the have 0 fusions that meet the filtering criteria — so it’s the equivalent of having a zero for EPN-relevant fusions if my suspicions are correct.

jaclyn-taroni · 2020-01-27T23:28:12Z

Hi @tkoganti, now that #478 is merged, if you update the branch you are working on to be in sync with this master branch (see the command line instructions here) you can use the analyses/fusion-summary/results/fusion_summary_ependymoma_foi.tsv for this instead of the file in data. That won't get updated until the next release.

tkoganti · 2020-01-29T15:48:04Z

@jharenza I am not finding ARLD4 gene in the gene expression file and also in HGNC. Is there an alias I should be using?

jaclyn-taroni · 2020-01-29T16:17:52Z

Maybe: https://www.genecards.org/cgi-bin/carddisp.pl?gene=ARL4D&keywords=ARL4D ?

jharenza · 2020-01-29T16:55:17Z

AHhh, yes, it is ARL4D from this publication https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5867234/, Supp Figure 1. Sorry about that!

jharenza · 2020-03-09T18:33:30Z

Upon exploration of fusion data for another analysis @kgaonkar6 and I found that the meningioma sample BS_23M72ABG harbors a YAP1--FAM118B fusion. Per this publication, this is specific to ST-EPN-YAP1 subtype, so I think we should rework some criteria for this analysis to search the entire cohort for fusions specific to the ST-EPN-YAP1 and ST-EPN-RELA subtypes as noted above.

jaclyn-taroni · 2020-05-23T14:51:06Z

Closed by #555 / subsumed by #667. Any required updates should go in a new issue.

jharenza added the proposed analysis label Nov 8, 2019

jaclyn-taroni added cnv Related to or requires CNV data molecular subtyping Related to molecular subtyping of tumors snv Related to or requires SNV data transcriptomic Related to or requires transcriptomic data labels Nov 8, 2019

jharenza mentioned this issue Nov 8, 2019

Planned Analysis: Molecularly subtype all tumors #19

Closed

7 tasks

jaclyn-taroni added fusion Related to or requires fusion data sv Related to or requires SV data and removed snv Related to or requires SNV data labels Nov 10, 2019

kgaonkar6 mentioned this issue Dec 12, 2019

added script to add recurrent fusions per histology as results #315

Merged

2 tasks

This was referenced Jan 3, 2020

Proposed Analysis: visualization of CNV and SV data with Circos plot #397

Closed

Proposed Analysis: Fusion files specifically for consumption by molecular subtyping analyses #398

Closed

Added new fusion summary module #410

Merged

jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Jan 15, 2020

jaclyn-taroni mentioned this issue Jan 21, 2020

Proposed Analysis: chromosomal instability burden, recurrently altered genes #394

Closed

cansavvy mentioned this issue Jan 21, 2020

Reconfigure the chromosomal-instability analysis code #461

Merged

5 tasks

tkoganti mentioned this issue Jan 24, 2020

Missing ependymoma samples in fusion summary file #477

Closed

jaclyn-taroni mentioned this issue Jan 26, 2020

Update fusion-summary to include union of biospecimen IDs in fusion callers #478

Merged

5 tasks

tkoganti mentioned this issue Jan 30, 2020

Ependymoma subtyping #490

Merged

tkoganti mentioned this issue Feb 24, 2020

Ependymoma subgroupsamples #555

Merged

jharenza mentioned this issue Mar 5, 2020

Planned Data Release: V16 #601

Closed

5 tasks

jharenza mentioned this issue Mar 11, 2020

Updated analysis: ependymoma subtyping lesions table update #626

Closed

yuankunzhu mentioned this issue Mar 27, 2020

[deprecated] Planned Data Release: V17 #656

Closed

jaclyn-taroni mentioned this issue Apr 1, 2020

Updated analysis: Include EPN subtyping in molecular-subtyping-pathology #667

Closed

jaclyn-taroni closed this as completed May 23, 2020

baileyckelly mentioned this issue Jul 23, 2020

Planned Data Release: V17 #732

Closed

jaclyn-taroni mentioned this issue Sep 18, 2020

molecular subtyping EPN update #785

Merged

5 tasks

jaclyn-taroni mentioned this issue Aug 30, 2021

EPN subtype method added AlexsLemonade/OpenPBTA-manuscript#150

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Analysis: Molecularly subtype ependymoma tumors #245

Proposed Analysis: Molecularly subtype ependymoma tumors #245

jharenza commented Nov 8, 2019 •

edited

Loading

naqvia commented Jan 14, 2020

tkoganti commented Jan 17, 2020

jaclyn-taroni commented Jan 21, 2020

cansavvy commented Jan 21, 2020

jaclyn-taroni commented Jan 21, 2020

jaclyn-taroni commented Jan 22, 2020 •

edited

Loading

tkoganti commented Jan 23, 2020 •

edited

Loading

jaclyn-taroni commented Jan 23, 2020

tkoganti commented Jan 24, 2020 •

edited

Loading

jaclyn-taroni commented Jan 24, 2020

tkoganti commented Jan 24, 2020 •

edited

Loading

jaclyn-taroni commented Jan 24, 2020

jaclyn-taroni commented Jan 27, 2020

tkoganti commented Jan 29, 2020 •

edited

Loading

jaclyn-taroni commented Jan 29, 2020

jharenza commented Jan 29, 2020

jharenza commented Mar 9, 2020

jaclyn-taroni commented May 23, 2020

Proposed Analysis: Molecularly subtype ependymoma tumors #245

Proposed Analysis: Molecularly subtype ependymoma tumors #245

Comments

jharenza commented Nov 8, 2019 • edited Loading

Scientific goals

Proposed methods

Required input data

Proposed timeline

Relevant literature

naqvia commented Jan 14, 2020

tkoganti commented Jan 17, 2020

jaclyn-taroni commented Jan 21, 2020

cansavvy commented Jan 21, 2020

jaclyn-taroni commented Jan 21, 2020

jaclyn-taroni commented Jan 22, 2020 • edited Loading

Continuous integration

Where to find specific features

Files you will not need to subset

Files you will need to subset

Other gotchas that come to mind

tkoganti commented Jan 23, 2020 • edited Loading

jaclyn-taroni commented Jan 23, 2020

tkoganti commented Jan 24, 2020 • edited Loading

jaclyn-taroni commented Jan 24, 2020

tkoganti commented Jan 24, 2020 • edited Loading

jaclyn-taroni commented Jan 24, 2020

jaclyn-taroni commented Jan 27, 2020

tkoganti commented Jan 29, 2020 • edited Loading

jaclyn-taroni commented Jan 29, 2020

jharenza commented Jan 29, 2020

jharenza commented Mar 9, 2020

jaclyn-taroni commented May 23, 2020

jharenza commented Nov 8, 2019 •

edited

Loading

jaclyn-taroni commented Jan 22, 2020 •

edited

Loading

tkoganti commented Jan 23, 2020 •

edited

Loading

tkoganti commented Jan 24, 2020 •

edited

Loading

tkoganti commented Jan 24, 2020 •

edited

Loading

tkoganti commented Jan 29, 2020 •

edited

Loading