Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Proposed Analysis: Molecularly subtype ATRT tumors #244

Closed
jharenza opened this issue Nov 8, 2019 · 14 comments
Closed

Proposed Analysis: Molecularly subtype ATRT tumors #244

jharenza opened this issue Nov 8, 2019 · 14 comments
Labels
cnv Related to or requires CNV data in progress Someone is working on this issue, but feel free to propose an alternative approach! molecular subtyping Related to molecular subtyping of tumors proposed analysis snv Related to or requires SNV data sv Related to or requires SV data transcriptomic Related to or requires transcriptomic data

Comments

@jharenza
Copy link
Collaborator

jharenza commented Nov 8, 2019

Scientific goals

What are the scientific goals of the analysis?
Subtype ATRTs into SHH, MYC, TYR.

Proposed methods

What methods do you plan to use to accomplish the scientific goals?
Suggestions:

  1. In the absence of methylation data, perform t-sne or other clustering of expression data to determine whether the total N of ATRTs is high enough to enable subgroup clustering (See Supplemental Figure 1B in the manuscript below)
  2. Review of histologies file, copy number, SV, mutation, and gene expression data across all ATRT samples.

Summarize results in a table, which can be added to a notebook, with molecular subtype designated.

ATRT TYR

  • most are infratentorial localized
  • broad SMARCB1 deletions (most have chr22q loss/monosomy 22)
  • Major expression markers: overexpression of melanosomal genes (eg: TYR, MITF, DCT)
    • TYR is over-expressed in every case in this subgroup but not expressed in any other subgroup (biomarker)
  • overexpression of VEGFA
  • overexpression of genes involved in ciliogenesis (eg: DNAH11 and SPEF1)
    • Explore additional TF targets in Figure 6A in the manuscript below for overexpression if needed

ATRT SHH

  • infratentorial or supratentorial localized
  • focal SMARCB1 deletions
  • Major expression markers: overexpression of SHH pathway (can be accomplished by looking at single sample GSEA results or expression of SHH pathway genes - eg: MYCN, GLI2)
  • overexpression of CDK6
  • overexpression of ASCL1, HES5/6, and DLL1/3, indicating active NOTCH signaling
    • MYC is the most commonly overexpressed gene in this subgroup
    • Explore additional TF targets in Figure S6A in the manuscript below for overexpression if needed
  • can have SMARCA4 loss, but not required
  • this group has more total mutations than the other two

ATRT MYC

  • most are supratentorial localized
  • focal SMARCB1 deletions
  • Major expression markers: overexpression of MYC and HOX gene cluster - eg: MYC, HOTAIR
    • Explore additional TF targets in Figure S6B in the manuscript below for overexpression if needed

May be able to determine brain regions using the primary_site from pbta_histologies.tsv - Ref:

  • infratentorial aka posterior fossa (eg: cerebellum, tectum, fourth ventricle, and brain stem (midbrain, pons, and medulla)
    • more common in first 2 years of life
    • median survival 17 months
    • fewer total mutations, relative to supratentorial tumors
  • supratentorial (eg: cerebrum, lateral ventricle and third ventricle, choroid plexus, pineal gland, hypothalamus, pituitary gland, and optic nerve)

Required input data

What input data will you use for this analysis?

  • RNA-Seq
  • Copy number
  • SV
  • Somatic Mutations (eg: nonsense mutations, deleterious mutations)

Proposed timeline

What is the timeline for the analysis?
1 week

Relevant literature

If there is relevant scientific literature, put links to those items here.
Link to Atypical Teratoid/Rhabdoid Tumors Are Comprised of Three Epigenetic Subgroups with Distinct Enhancer Landscapes, specifically, Table S3 has a nice summary of genotyping the SMARCB1 locus.

@jaclyn-taroni jaclyn-taroni added cnv Related to or requires CNV data transcriptomic Related to or requires transcriptomic data sv Related to or requires SV data molecular subtyping Related to molecular subtyping of tumors snv Related to or requires SNV data labels Nov 8, 2019
@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Nov 10, 2019

I'm adding what I think the table summarizing the results would contain here. From a cursory look, there are 30 samples that are classified as ATRT in the histologies file. That is a large enough sample size for what I'll suggest below. I agree that one of the first analyses would be unsupervised clustering or dimension reduction.

Tabular format

The goal of the table is to summarize all of the information above in a manner that would allow someone with domain expertise to quickly make relatively easy calls and to identify cases where more information is needed. So it should contain everything mentioned above.

Kids_First_Participant_ID Kids_First_Biospecimen_ID age at diagnosis (days) reported gender primary site location summary TYR expression z-score ... ... SHH ssGSEA score (rank?) Notch ssGSEA score (rank?) Focal SMARCB1 status Focal SMARCA4 status chr22q loss Tumor Mutation Burder (rank?)
PT_XXXXXXXX BS_XXXXXXXX 800 Female Cerebellum/Posterior Fossa infratentorial 4.235 ... ... 25 18 loss neutral ...
... ... ... ... ... ... ... ... ... ... ... ... ... ...

Notes on columns

  • age at diagnosis (days), reported gender, primary site can all be obtained from the pbta-histologies.tsv file. location summary would be a recoding of primary site into infratentorial or supratentorial.
  • For any of the genes mentioned in terms of overexpression, a column that contains the z-scores for ATRT samples should be included. This tells us about the expression of that gene in a gene relative to all other ATRT samples.
    • To obtain z-scores, I would expect an analyst to filter an expression matrix to only ATRT samples, log2(x + 1) it and then z-score the rows.
  • For pathways such as SHH, we can use ssGSEA values but we want to represent them in a way that tells us about the pathway score in a sample relative to all other ATRT samples. Listing a rank or perhaps z-scores again come to mind.
    • Notch signaling is captured in the hallmark gene sets used in analyses/ssgsea-hallmark, but different gene sets may need to be identified for SHH.
  • For focal copy number alterations, the files in analyses/focal-cn-file-preparation/results can be used. See also: SMARCB1 deletions in ATRT with current SEG to gene mapping #217
  • To my knowledge, the structural variant data is not yet in an easily consumable format analogous to the focal CN files above.
  • Tumor mutation burden is available as part of the consensus mutation files Consensus Mutation Files Release #207 (included in upcoming planned release Planned data release: v10 #254). This again seems to be something we want to represent relative to other ATRT samples.

@cbethell
Copy link
Contributor

I am going to begin the work on this analysis by implementing the suggestions above.

@cbethell
Copy link
Contributor

cbethell commented Nov 15, 2019

I am going to begin the work on this analysis by implementing the suggestions above.

To be more specific, my plan is as follows:

  • Perform unsupervised hierarchical clustering using Heatmap
  • Perform PCA and t-SNE
  • Use the pbta-histologies.tsv file (filtered for short_histology == "ATRT") to obtain primary_site, Kids_First_Participant_ID, Kids_First_Biospecimen_ID, age_at_diagnosis_days, and reported gender.
  • Use primary_site to define values for location_summary. These values will be infratentorial and supratentorial.
  • Calculate the z-scores on the expression data and join this with the metadata.
  • Use the focal CN files to determine focal SMARCB1 and SMARCA4 status, and add 2 columns denoting the respective status to the existing data.frame with location_summary and expression z-scores.
  • Obtain the relevant pathway information from the ssGSEA file and calculate the z-scores. Add this information to the existing data.frame.
  • Use the SV data to create a column denoting whether or not chr22q loss is present.
  • Use the consensus mutation files to rank Tumor Mutation burden.

@jaclyn-taroni jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Nov 15, 2019
@jharenza
Copy link
Collaborator Author

jharenza commented Dec 10, 2019

@jaclyn-taroni do you and @cbethell want to see if the results from gistic for CNVkit: s3://kf-openaccess-us-east-1-prd-pbta/data/2019-12-10-gistic-results-cnvkit.zip broad_values_by_arm.txt results make sense with the current SMARCB1 deletions found/be good enough for this analysis? If so, we can release these results in the next data release.

@cbethell
Copy link
Contributor

@jaclyn-taroni do you and @cbethell want to see if the results from gistic for CNVkit: s3://kf-openaccess-us-east-1-prd-pbta/data/2019-12-10-gistic-results-cnvkit.zip broad_values_by_arm.txt results make sense with the current SMARCB1 deletions found/be good enough for this analysis? If so, we can release these results in the next data release.

@jharenza yes, I believe the gistic results would be good/useful for this analysis so I would like to see them in the next data release, if possible.

@jharenza jharenza mentioned this issue Dec 10, 2019
5 tasks
jharenza pushed a commit that referenced this issue Dec 17, 2019
### release-v12-20191217
- release date: 2019-12-17
- status: available
- changes:
  - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336)
  - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated:
    - pbta-fusion-arriba.tsv.gz
    - pbta-fusion-starfusion.tsv.gz
    - pbta-gene-expression-rsem-tpm.stranded.rds
    - pbta-gene-expression-rsem-fpkm.stranded.rds
    - pbta-isoform-expression-rsem-tpm.stranded.rds
    - pbta-isoform-counts-rsem-expected_count.stranded.rds
    - pbta-gene-counts-rsem-expected_count.stranded.rds
    - pbta-gene-expression-kallisto.stranded.rds
    - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
  - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering)
  - Update consensus TMB files and MAF [#333]](#333)
  - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273)
  - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage
  - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files
    - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT
  - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)
jaclyn-taroni pushed a commit that referenced this issue Dec 19, 2019
* Release V12 data

### release-v12-20191217
- release date: 2019-12-17
- status: available
- changes:
  - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336)
  - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated:
    - pbta-fusion-arriba.tsv.gz
    - pbta-fusion-starfusion.tsv.gz
    - pbta-gene-expression-rsem-tpm.stranded.rds
    - pbta-gene-expression-rsem-fpkm.stranded.rds
    - pbta-isoform-expression-rsem-tpm.stranded.rds
    - pbta-isoform-counts-rsem-expected_count.stranded.rds
    - pbta-gene-counts-rsem-expected_count.stranded.rds
    - pbta-gene-expression-kallisto.stranded.rds
    - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
  - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering)
  - Update consensus TMB files and MAF [#333]](#333)
  - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273)
  - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage
  - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files
    - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT
  - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)

* Update release-notes.md

fix link

* Update data-files-description.md

fix GISTIC table sectioning

* Update data-files-description.md

fix spacing on data description table

* Update data-files-description.md

fix more spacing in data file description file

* Update download-data.sh

add new release date to download script

* Update the TMB file descriptions

* Update TMB file formats section

* Update fusion section of data formats

Also more specific description of the by sample file

* Add GISTIC file to data-formats

* Update download-data.sh

* Update download-data.sh

* data description md is also included in md5sum

* TMB exon -> coding sequence

* Coding TMB CDS, not exon
@jaclyn-taroni
Copy link
Member

My understand of what is left on this ticket:

  • Use the broad_values_by_arm.txt GISTIC file to obtain the chr22q loss information and add that to the final table.
  • Have someone with domain expertise look at the table we have and make tweaks or follow up as needed.

@jharenza
Copy link
Collaborator Author

jharenza commented Jan 6, 2020

@jaclyn-taroni was just looking at this today as well. Once the table is done, we have a clinician (probably via email) who can check out the data. It may be nice to point him to a notebook of the table with the final subtypes and columns of criteria used.

@jaclyn-taroni
Copy link
Member

molecular-subtyping-ATRT will need to be rerun with annotated focal consensus calls (#186) and new GISTIC calls (#453).

@jharenza
Copy link
Collaborator Author

@cbethell - started going through these results a bit. Can I request that you also add a column for OS_months, derived from OS_days to the results? Thanks!

@jharenza
Copy link
Collaborator Author

jharenza commented Jan 21, 2020

@cbethell sorry for the separate comment - would you be able to also include expression of HES5, HES6, DLL1, and DLL3 and GSEA for http://software.broadinstitute.org/gsea/msigdb/cards/HALLMARK_NOTCH_SIGNALING.html to assess Notch signaling? This is for ATRT SHH group, as noted above.

Looks like the other genes mentioned are all included.

Thanks!

@jaclyn-taroni
Copy link
Member

would you be able to also include expression of HES5, HES6, DLL1, and DLL3 and GSEA for http://software.broadinstitute.org/gsea/msigdb/cards/HALLMARK_NOTCH_SIGNALING.html to assess Notch signaling? This is for ATRT SHH group, as noted above.

I'm going to pick this up right now @jharenza.

@jharenza
Copy link
Collaborator Author

Thanks @jaclyn-taroni !

@jharenza
Copy link
Collaborator Author

Here is a minimal gene list for expression/GSEA for each subgroup:

TYR SHH MYC
TYR MYCN MYC
MITF GLI2 HOTAIR
DCT CDK6 TEAD3
VEGFA ASCL1 MYC GSEA
DNAH11 HES5
SPEF1 HES6
MSX2 DLL1
STAT3 DLL3
PRRX1 LHX2
LMX1 TEAD1
OTX2 Notch GSEA

@jharenza
Copy link
Collaborator Author

Even using the minimal set of genes, the subtyping is not clear-cut based on these genes' expression, and after discussing with @jaclyn-taroni, I think it would be a better approach to develop a classifier for ATRT subtyping, similar to what we was done for MB here. Since this may not make it into the first submission of the paper, I will close this for now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cnv Related to or requires CNV data in progress Someone is working on this issue, but feel free to propose an alternative approach! molecular subtyping Related to molecular subtyping of tumors proposed analysis snv Related to or requires SNV data sv Related to or requires SV data transcriptomic Related to or requires transcriptomic data
Projects
None yet
Development

No branches or pull requests

3 participants