Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Proposed Analysis: Molecularly subtype ependymoma tumors #245

Closed
jharenza opened this issue Nov 8, 2019 · 18 comments
Closed

Proposed Analysis: Molecularly subtype ependymoma tumors #245

jharenza opened this issue Nov 8, 2019 · 18 comments
Labels
cnv Related to or requires CNV data fusion Related to or requires fusion data in progress Someone is working on this issue, but feel free to propose an alternative approach! molecular subtyping Related to molecular subtyping of tumors proposed analysis sv Related to or requires SV data transcriptomic Related to or requires transcriptomic data

Comments

@jharenza
Copy link
Collaborator

jharenza commented Nov 8, 2019

Scientific goals

What are the scientific goals of the analysis?
Subtype ependymomas into ST-EPN-RELA, ST-EPN-YAP1, PF-EPN-A, and PF-EPN-B. Note: The publication listed below contains 9 subtypes of ependymomas, but only the 4 listed here are relevant to the OpenPBTA dataset (due to age at diagnosis) and will be discussed below.

Proposed methods

What methods do you plan to use to accomplish the scientific goals?

  1. Review of copy number, mutation, and expression data.
  2. Render results in tabular form in a notebook.

ST-EPN-RELA (Supratentorial Ependymoma, RELA fused)

  • Most commonly harbor C11orf95-RELA fusions due to chromothripsis of chr 11q13.1 (ref). Only subgroup to exhibit chromothripsis.
  • Other reported RELA fusions include: LTBP3-RELA
  • PTEN-TAS2R1 has been found in in RELA negative samples (results in loss of PTEN): "Interestingly, in one case of the ST-EPN-RELA subgroup, negative for RELA fusion types 1 and 2, we detected a PTEN-TAS2R1 fusion leading to a frame shift and subsequent disruption of PTEN (Table S1)."
  • Fusions result in activation of NF-κB target genes - See Figure 4c in above ref.
  • Genes over-expressed relative to other EPN tumors: RELA and L1CAM
  • Common CN changes: CDKN2A deletions. Chr9 or Chr9p loss

ST-EPN-YAP1 (Supratentorial Ependymoma, YAP1 fused)

  • Harbor C11orf95-YAP1 , YAP1-MAMLD1, YAP1-MAMLD2, and YAP1-FAM118B fusions
  • Other reported fusions include: C11orf95-MAML2
  • Genes over-expressed relative to other EPN tumors: ARLD4 and CLDN1
  • Focal 11q most frequent CNA, with rest of genome stable

PF-EPN-A (Posterior Fossa Ependymoma, Type A)

  • Genes over-expressed relative to other EPN tumors: CXorf67 and TKTL1
  • 1q gain most frequent CNA

PF-EPN-B (Posterior Fossa Ependymoma, Type B)

  • Genes over-expressed relative to other EPN tumors: GPBP1 and IFT46
  • Chr 6 loss most frequency CNA
  • Show highest degree of genome instability aside from ST-EPN-RELA subtype, with gains and losses of many chromosomes

May be able to determine brain regions using the primary_site from pbta_histologies.tsv - Ref:

  • posterior fossa aka infratentorial (eg: cerebellum, tectum, fourth ventricle, and brain stem (midbrain, pons, and medulla)
  • supratentorial (eg: cerebrum, lateral ventricle and third ventricle, choroid plexus, pineal gland, hypothalamus, pituitary gland, and optic nerve)

Required input data

What input data will you use for this analysis?
RNA fusions, RNA expression, copy number, SVs, histologies file

Proposed timeline

What is the timeline for the analysis?
1 week

Relevant literature

If there is relevant scientific literature, put links to those items here.
Link to Molecular Classification of Ependymal Tumors
across All CNS Compartments, Histopathological
Grades, and Age Groups
Link to C11orf95-RELA fusions drive oncogenic NF-κB signalling in ependymoma.

@jaclyn-taroni jaclyn-taroni added cnv Related to or requires CNV data molecular subtyping Related to molecular subtyping of tumors snv Related to or requires SNV data transcriptomic Related to or requires transcriptomic data labels Nov 8, 2019
@jaclyn-taroni jaclyn-taroni added fusion Related to or requires fusion data sv Related to or requires SV data and removed snv Related to or requires SNV data labels Nov 10, 2019
@naqvia
Copy link

naqvia commented Jan 14, 2020

I will work on this. I will do this after the release of v13 (next week).

@jaclyn-taroni jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Jan 15, 2020
@tkoganti
Copy link
Collaborator

I will be working on this ticket starting next week.

@jaclyn-taroni
Copy link
Member

@cansavvy there is the following note above:

Show highest degree of genome instability aside from ST-EPN-RELA subtype, with gains and losses of many chromosomes

Is there any output you can include (e.g., summary statistic) as part of chromosomal-instability (#394) that could be helpful for this? Right now it looks like the only outputs for that module are plots.

@cansavvy
Copy link
Collaborator

I can file a PR that saves total chromosomal breakpoint numbers per biospecimen to TSV file. But perhaps to make it more interpretable we need it to be divided by the size of the effectively surveyed region of the genome (i.e. WGS vs WXS)?

@jaclyn-taroni
Copy link
Member

I can file a PR that saves total chromosomal breakpoint numbers per biospecimen to TSV file. But perhaps to make it more interpretable we need it to be divided by the size of the effectively surveyed region of the genome (i.e. WGS vs WXS)?

Moving discussion about the specifics over to #394 to keep this focused, but to close the loop - yes, we will want this information, divided by the size of the effectively surveyed genome, saved as a TSV.

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Jan 22, 2020

Hi @tkoganti, as we discussed in person I am filling in a bit of the how behind this ticket.

Now that I have revisited this ticket, I think analyses/molecular-subtyping-EPN is a good place to add this analysis.

Continuous integration

Continuous integration (CI) has some special considerations when we are working in the context of the molecular subtyping tickets.

For continuous integration, we use a set of files that only contains a limited number of participants to save on download time and the amount of RAM needed to run the analyses (Continuous Integration (CI) section of the README).

What this means for this issue (and any other subtyping issue) is that there often will not be a large number of or sometimes any of the relevant samples used in continuous integration and that will cause things to fail in continuous integration. To get around this, I suggest that the first thing you add is a script that subsets all the files you will need for subtyping to only the ependymoma samples you are interested in subsetting. You will need to add and commit these files to the repository (perhaps in analyses/molecular-subtyping-EPN/subset-files) so that they can be checked out by the machine that runs continuous integration.

The molecular-subtyping-ATRT module is an example where we do this. Importantly, you will not be able to run the script that does the subsetting in CI (for the same reasons mentioned above). Because you will be committing the subset files to the repository, they will be associated with a particular data release (release-v13-20200116). To avoid other folks from accidentally using outdated files you can add the subsetting to a shell script that runs all the steps for subtyping EPN tumors with an option to skip it in CI:

Here's the ATRT example following the passing variables only in CI instructions:

And then subsetting will be run by default (OPENPBTA_SUBSET=1):

But it is skipped in CI with OPENPBTA_SUBSET=0:

command: OPENPBTA_SUBSET=0 ./scripts/run_in_ci.sh bash analyses/molecular-subtyping-ATRT/run-molecular-subtyping-ATRT.sh

Where to find specific features

Now we need to figure out which files you will need to satisfy the goals of this issue. Some of these files you will need to subset; others you will not. More on that below.

Files you will not need to subset

Some files in CI are small enough that we don't generate OR they are committed in the repository, so they are available in full.

  • For all fusions relevant to this ticket (e.g. C11orf95-RELA, YAP1-MAMLD1, etc.) you can use the fusion summary file: data/fusion_summary_ependymoma_foi.tsv. No need to subset, we copy the original file for testing in CI (ref).

  • Fusions result in activation of NF-κB target genes

    For this, I would look at the GSVA scores for the HALLMARK_TNFA_SIGNALING_VIA_NFKB pathway. You can find the scores for the poly-A RNA-seq data and stranded RNA-seq data at analyses/gene-set-enrichment-analysis/results/gsva_scores_polya.tsv and analyses/gene-set-enrichment-analysis/results/gsva_scores_stranded.tsv, respectively. These are committed to the repo, so no need to subset.

  • For broad copy number changes (e.g., Chr9 or Chr9p loss, 1q gain, etc.), you can use the broad_values_by_arm.txt files in the GISTIC output (data/pbta-cnv-cnvkit-gistic.zip). We copy the whole zipped file over for testing in CI (ref).

  • May be able to determine brain regions using the primary_site from data/pbta_histologies.tsv

    The pbta-histologies.tsv is available in full in CI as well (ref). @cbethell has an example in the molecular-subtyping-ATRT module of recoding the primary_site to brain region you can see here:

    # Define regions of the brain (using Anatomy of the Brain figure found at

  • Show highest degree of genome instability aside from ST-EPN-RELA subtype, with gains and losses of many chromosomes

    Another committed file you can use to get the breakpoint density - analyses/chromosomal-instability/breakpoint-data/union_of_breaks_densities.tsv.

  • For gene-specific focal deletions (e.g., CDKN2A deletions) you can use committed files in focal-cn-file-preparation - I'd recommend using the CNVkit files to be consistent with what GISTIC was run on: analyses/focal-cn-file-preparation/results/cnvkit_annotated_cn_autosomes.tsv.gz and analyses/focal-cn-file-preparation/results/cnvkit_annotated_cn_x_and_y.tsv.gz

Files you will need to subset

I think the main information you will need to generate subset files for are the instances that say

Genes over-expressed relative to other EPN tumors

So you will need to use the expression files (the collapsed expression files to be consistent with other subtyping modules): data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds and pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds. (Note I'm not sure if there are any EPN samples in the poly-A data.) I would recommend presenting the values for the genes mentioned as z-scores, where you scale each gene only once you have subset to the EPN samples.

Other gotchas that come to mind

If you want to see an example of what the output of one of these subtyping tickets look like in terms of the final table and presentation, see what is linked in #435 (comment)

Let me know if you have any questions!

@tkoganti
Copy link
Collaborator

tkoganti commented Jan 23, 2020

@jaclyn-taroni Thank you so much for the information!!! This really helps to get me started on this ticket

I am listing all the files I am going to be using along with the exact filters. Please review this and rectify if any changes need to be made. Also @jharenza if you could answer the questions I wrote here, that would be helpful!

ST-EPN-RELA subtype criterion -

  1. If any one of these fusions present in fusions_summary_ependymoma_foi.tsv file -
    C11orf95-RELA
    LTBP3-RELA
    PTEN-TAS2R1 (Do we need samples with this fusion in this group? It says RELA negative have this fusion)

  2. If hallmark_name column in gsva_scores_polya.tsv/gsva_scores_stranded.tsv says the following -
    HALLMARK_TNFA_SIGNALING_VIA_NFKB

  3. If value in row 9p and 9q (or) 9p is =< -1 in this file - broad_values_by_arm.txt

  4. Genes over-expressed relative to other EPN tumors: RELA and L1CAM
    For this one I can -
    - get the expression data from (pbta-gene-expression-rsem-fpkm.polya.rds/stranded) for genes (ENSEMBL IDs) ENSG00000173039.18_RELA and ENSG00000198910.12_L1CAM for all samples.
    - filter this to only the 749 BSID’s in fusion_summary_ependymoma_foi.tsv file (I am using this file since I want expression data only for the ependymoma samples)
    - Calculate the mean expression data from this for RELA and L1CAM
    - If there are samples that agree on all the above criterion and also have higher expression than the average in RELA and L1CAM, categorize into this subgroup

ST-EPN-YAP1 subtype criterion(Either needs one of the fusion or the CNA in 11q??) -

  1. If any one of the following fusions are present in fusions_summary_ependymoma_foi.tsv file -
    C11orf95-YAP1 , YAP1-MAMLD1, YAP1-MAMLD2, and YAP1-FAM118B, C11orf95-MAML2

  2. if row 11q in file broad_values_by_arm.txt shows >= 1.0 and the rest of the columns show 0

  3. Same as above. For over-expressed genes except the filter category would be for ARLD4(I did not find this gene in HGNC, any ENSEMBL ID I can use??) and CLDN1 (ENSG00000163347.5_CLDN1)

PF-EPN-A

  1. If 1q row in broad_values_by_arm.txt file has a value of >=1.0
  2. Same as above for gene over expression for CXorf67(ENSG00000187690.3_CXorf67) and TKTL1(ENSG00000007350.16_TKTL1)

PF-EPN-B

  1. If broad_values_by_arm.txt file shows value in rows 6p and/or 6q as <=-1 and there are at least 3 more CNA calls?? (I randomly picked three to satisfy the “genome instability because of many chromosomal losses and gains” criterion, if there is anything specifically I should look for to check instability, please let me know. )
  2. Same as above for over expressed samples in GPBP1(ENSG00000062194.15_GPBP1) and IFT46(ENSG00000118096.7_IFT46)

@jaclyn-taroni
Copy link
Member

Hi @tkoganti - I'll clarify a couple things I'm able to comment on below.

If hallmark_name column in gsva_scores_polya.tsv/gsva_scores_stranded.tsv says the following -
HALLMARK_TNFA_SIGNALING_VIA_NFKB

There should be a row for each sample (Kids_First_Biospecimen_ID) for HALLMARK_TNFA_SIGNALING_VIA_NFKB - you will want to look for EPN samples with high gsea_score values to satisfy the "Fusions result in activation of NF-κB target genes" portion of ST-EPN-RELA subtype criteria.

For

Genes over-expressed relative to other EPN tumors: RELA and L1CAM

I would use the collapsed files: data/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds and data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds. These files have already been processed to drop genes with zero counts, etc. (see: collapse-rnaseq) and gene symbols are stored as rownames.

  • filter this to only the 749 BSID’s in fusion_summary_ependymoma_foi.tsv file (I am using this file since I want expression data only for the ependymoma samples)

The 749 BSIDs correspond to more than the ependymoma samples. (There is a conversation over on #410 about that.) What you will want to do to identify ependymoma samples is to filter the pbta-histologies.tsv file to rows where disease_type_new are Ependymoma, and then if you further filter to rows where experimental_strategy is RNA-Seq, what's left in Kids_First_Biospecimen_ID is what you will need to subset the expression files. I would then log2(x + 1) transform and z-score the RELA and L1CAM values (gene-wise) so the final value gives you an idea of the expression relative to all other EPN samples.

For things like:

if row 11q in file broad_values_by_arm.txt shows +> 1.0 and the rest of the lines show 0

I would caution that the GISTIC results that are available were run on the output of one CNV method (CNVkit) and not on the consensus, so there may be instances right now where GISTIC doesn't call something as neutral but will when using the consensus file (#453). Something to keep in mind.

@tkoganti
Copy link
Collaborator

tkoganti commented Jan 24, 2020

Hi @jaclyn-taroni and @jharenza!

There are 93 ependymoma samples in total(from pbta-histologies.tsv file) and the fusion_summary_ependymoma_foi.tsv file only has 74 samples. These are the 19 samples missing from the fusion_ependymoma file. Should I assume the value for these in sample rows would be zero?

BS_0BXY0F9N',
'BS_0QYS36NR',
'BS_4FZS7TX4',
'BS_4SCWT0FX',
'BS_C80S5N37',
'BS_EMYET8F4',
'BS_GV3NZ9QD',
'BS_J8VX4D17',
'BS_PRAEF32W',
'BS_PSW27ZTE',
'BS_Q5WZYWCT',
'BS_RCFQ31XF',
'BS_RGPSEMHC',
'BS_SDCNP7MW',
'BS_TCGEZJ5F',
'BS_TXY8QYWA',
'BS_WRHDP7WF',
'BS_XQYHPBFS',
'BS_YE1MAQYJ'

@jaclyn-taroni
Copy link
Member

@tkoganti can you check if those samples are in the Arriba and STARFusion files? I believe they would need to be in both of the original files to make it to the summary file. If they are not in both, I would consider that data to be missing, rather than the absence of fusions in those samples.

@tkoganti
Copy link
Collaborator

tkoganti commented Jan 24, 2020

I checked for a few samples and they are present in both pbta-fusion-starfusion.tsv.gz and pbta-fusion-arriba.tsv.gz. Should I use those files as input then?

@jaclyn-taroni
Copy link
Member

Can you file a data issue please @tkoganti and describe what you found? We should dig into if there’s an issue with the fusion summary file. I suspect what happened is that these samples are not represented in the putative oncogenic file — i.e., the have 0 fusions that meet the filtering criteria — so it’s the equivalent of having a zero for EPN-relevant fusions if my suspicions are correct.

@jaclyn-taroni
Copy link
Member

Hi @tkoganti, now that #478 is merged, if you update the branch you are working on to be in sync with this master branch (see the command line instructions here) you can use the analyses/fusion-summary/results/fusion_summary_ependymoma_foi.tsv for this instead of the file in data. That won't get updated until the next release.

@tkoganti
Copy link
Collaborator

tkoganti commented Jan 29, 2020

@jharenza I am not finding ARLD4 gene in the gene expression file and also in HGNC. Is there an alias I should be using?

@jaclyn-taroni
Copy link
Member

Maybe: https://www.genecards.org/cgi-bin/carddisp.pl?gene=ARL4D&keywords=ARL4D ?

@jharenza
Copy link
Collaborator Author

AHhh, yes, it is ARL4D from this publication https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5867234/, Supp Figure 1. Sorry about that!

@jharenza
Copy link
Collaborator Author

jharenza commented Mar 9, 2020

Upon exploration of fusion data for another analysis @kgaonkar6 and I found that the meningioma sample BS_23M72ABG harbors a YAP1--FAM118B fusion. Per this publication, this is specific to ST-EPN-YAP1 subtype, so I think we should rework some criteria for this analysis to search the entire cohort for fusions specific to the ST-EPN-YAP1 and ST-EPN-RELA subtypes as noted above.

@jaclyn-taroni
Copy link
Member

Closed by #555 / subsumed by #667. Any required updates should go in a new issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cnv Related to or requires CNV data fusion Related to or requires fusion data in progress Someone is working on this issue, but feel free to propose an alternative approach! molecular subtyping Related to molecular subtyping of tumors proposed analysis sv Related to or requires SV data transcriptomic Related to or requires transcriptomic data
Projects
None yet
Development

No branches or pull requests

5 participants