Proposed Analysis: Molecularly subtype ATRT tumors #244

jharenza · 2019-11-08T13:31:19Z

Scientific goals

What are the scientific goals of the analysis?
Subtype ATRTs into SHH, MYC, TYR.

Proposed methods

What methods do you plan to use to accomplish the scientific goals?
Suggestions:

In the absence of methylation data, perform t-sne or other clustering of expression data to determine whether the total N of ATRTs is high enough to enable subgroup clustering (See Supplemental Figure 1B in the manuscript below)
Review of histologies file, copy number, SV, mutation, and gene expression data across all ATRT samples.

Summarize results in a table, which can be added to a notebook, with molecular subtype designated.

ATRT TYR

most are infratentorial localized
broad SMARCB1 deletions (most have chr22q loss/monosomy 22)
Major expression markers: overexpression of melanosomal genes (eg: TYR, MITF, DCT)
- TYR is over-expressed in every case in this subgroup but not expressed in any other subgroup (biomarker)
overexpression of VEGFA
overexpression of genes involved in ciliogenesis (eg: DNAH11 and SPEF1)
- Explore additional TF targets in Figure 6A in the manuscript below for overexpression if needed

ATRT SHH

infratentorial or supratentorial localized
focal SMARCB1 deletions
Major expression markers: overexpression of SHH pathway (can be accomplished by looking at single sample GSEA results or expression of SHH pathway genes - eg: MYCN, GLI2)
overexpression of CDK6
overexpression of ASCL1, HES5/6, and DLL1/3, indicating active NOTCH signaling
- MYC is the most commonly overexpressed gene in this subgroup
- Explore additional TF targets in Figure S6A in the manuscript below for overexpression if needed
can have SMARCA4 loss, but not required
this group has more total mutations than the other two

ATRT MYC

most are supratentorial localized
focal SMARCB1 deletions
Major expression markers: overexpression of MYC and HOX gene cluster - eg: MYC, HOTAIR
- Explore additional TF targets in Figure S6B in the manuscript below for overexpression if needed

May be able to determine brain regions using the primary_site from pbta_histologies.tsv - Ref:

infratentorial aka posterior fossa (eg: cerebellum, tectum, fourth ventricle, and brain stem (midbrain, pons, and medulla)
- more common in first 2 years of life
- median survival 17 months
- fewer total mutations, relative to supratentorial tumors
supratentorial (eg: cerebrum, lateral ventricle and third ventricle, choroid plexus, pineal gland, hypothalamus, pituitary gland, and optic nerve)

Required input data

What input data will you use for this analysis?

RNA-Seq
Copy number
SV
Somatic Mutations (eg: nonsense mutations, deleterious mutations)

Proposed timeline

What is the timeline for the analysis?
1 week

Relevant literature

If there is relevant scientific literature, put links to those items here.
Link to Atypical Teratoid/Rhabdoid Tumors Are Comprised of Three Epigenetic Subgroups with Distinct Enhancer Landscapes, specifically, Table S3 has a nice summary of genotyping the SMARCB1 locus.

The text was updated successfully, but these errors were encountered:

jaclyn-taroni · 2019-11-10T21:37:53Z

I'm adding what I think the table summarizing the results would contain here. From a cursory look, there are 30 samples that are classified as ATRT in the histologies file. That is a large enough sample size for what I'll suggest below. I agree that one of the first analyses would be unsupervised clustering or dimension reduction.

Tabular format

The goal of the table is to summarize all of the information above in a manner that would allow someone with domain expertise to quickly make relatively easy calls and to identify cases where more information is needed. So it should contain everything mentioned above.

Kids_First_Participant_ID	Kids_First_Biospecimen_ID	age at diagnosis (days)	reported gender	primary site	location summary	TYR expression z-score	...	...	SHH ssGSEA score (rank?)	Notch ssGSEA score (rank?)	Focal SMARCB1 status	Focal SMARCA4 status	chr22q loss	Tumor Mutation Burder (rank?)
PT_XXXXXXXX	BS_XXXXXXXX	800	Female	Cerebellum/Posterior Fossa	infratentorial	4.235	...	...	25	18	loss	neutral	...
...	...	...	...	...	...	...	...	...	...	...	...	...	...

Notes on columns

age at diagnosis (days), reported gender, primary site can all be obtained from the pbta-histologies.tsv file. location summary would be a recoding of primary site into infratentorial or supratentorial.
For any of the genes mentioned in terms of overexpression, a column that contains the z-scores for ATRT samples should be included. This tells us about the expression of that gene in a gene relative to all other ATRT samples.
- To obtain z-scores, I would expect an analyst to filter an expression matrix to only ATRT samples, log2(x + 1) it and then z-score the rows.
For pathways such as SHH, we can use ssGSEA values but we want to represent them in a way that tells us about the pathway score in a sample relative to all other ATRT samples. Listing a rank or perhaps z-scores again come to mind.
- Notch signaling is captured in the hallmark gene sets used in analyses/ssgsea-hallmark, but different gene sets may need to be identified for SHH.
For focal copy number alterations, the files in analyses/focal-cn-file-preparation/results can be used. See also: SMARCB1 deletions in ATRT with current SEG to gene mapping #217
To my knowledge, the structural variant data is not yet in an easily consumable format analogous to the focal CN files above.
Tumor mutation burden is available as part of the consensus mutation files Consensus Mutation Files Release #207 (included in upcoming planned release Planned data release: v10 #254). This again seems to be something we want to represent relative to other ATRT samples.

cbethell · 2019-11-15T13:44:20Z

I am going to begin the work on this analysis by implementing the suggestions above.

cbethell · 2019-11-15T21:13:46Z

I am going to begin the work on this analysis by implementing the suggestions above.

To be more specific, my plan is as follows:

Perform unsupervised hierarchical clustering using Heatmap
Perform PCA and t-SNE
Use the pbta-histologies.tsv file (filtered for short_histology == "ATRT") to obtain primary_site, Kids_First_Participant_ID, Kids_First_Biospecimen_ID, age_at_diagnosis_days, and reported gender.
Use primary_site to define values for location_summary. These values will be infratentorial and supratentorial.
Calculate the z-scores on the expression data and join this with the metadata.
Use the focal CN files to determine focal SMARCB1 and SMARCA4 status, and add 2 columns denoting the respective status to the existing data.frame with location_summary and expression z-scores.
Obtain the relevant pathway information from the ssGSEA file and calculate the z-scores. Add this information to the existing data.frame.
Use the SV data to create a column denoting whether or not chr22q loss is present.
Use the consensus mutation files to rank Tumor Mutation burden.

jharenza · 2019-12-10T18:10:45Z

@jaclyn-taroni do you and @cbethell want to see if the results from gistic for CNVkit: s3://kf-openaccess-us-east-1-prd-pbta/data/2019-12-10-gistic-results-cnvkit.zip broad_values_by_arm.txt results make sense with the current SMARCB1 deletions found/be good enough for this analysis? If so, we can release these results in the next data release.

cbethell · 2019-12-10T22:04:22Z

@jaclyn-taroni do you and @cbethell want to see if the results from gistic for CNVkit: s3://kf-openaccess-us-east-1-prd-pbta/data/2019-12-10-gistic-results-cnvkit.zip broad_values_by_arm.txt results make sense with the current SMARCB1 deletions found/be good enough for this analysis? If so, we can release these results in the next data release.

@jharenza yes, I believe the gistic results would be good/useful for this analysis so I would like to see them in the next data release, if possible.

### release-v12-20191217 - release date: 2019-12-17 - status: available - changes: - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336) - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated: - pbta-fusion-arriba.tsv.gz - pbta-fusion-starfusion.tsv.gz - pbta-gene-expression-rsem-tpm.stranded.rds - pbta-gene-expression-rsem-fpkm.stranded.rds - pbta-isoform-expression-rsem-tpm.stranded.rds - pbta-isoform-counts-rsem-expected_count.stranded.rds - pbta-gene-counts-rsem-expected_count.stranded.rds - pbta-gene-expression-kallisto.stranded.rds - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) - Update consensus TMB files and MAF [#333]](#333) - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273) - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)

* Release V12 data ### release-v12-20191217 - release date: 2019-12-17 - status: available - changes: - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336) - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated: - pbta-fusion-arriba.tsv.gz - pbta-fusion-starfusion.tsv.gz - pbta-gene-expression-rsem-tpm.stranded.rds - pbta-gene-expression-rsem-fpkm.stranded.rds - pbta-isoform-expression-rsem-tpm.stranded.rds - pbta-isoform-counts-rsem-expected_count.stranded.rds - pbta-gene-counts-rsem-expected_count.stranded.rds - pbta-gene-expression-kallisto.stranded.rds - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) - Update consensus TMB files and MAF [#333]](#333) - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273) - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8) * Update release-notes.md fix link * Update data-files-description.md fix GISTIC table sectioning * Update data-files-description.md fix spacing on data description table * Update data-files-description.md fix more spacing in data file description file * Update download-data.sh add new release date to download script * Update the TMB file descriptions * Update TMB file formats section * Update fusion section of data formats Also more specific description of the by sample file * Add GISTIC file to data-formats * Update download-data.sh * Update download-data.sh * data description md is also included in md5sum * TMB exon -> coding sequence * Coding TMB CDS, not exon

jaclyn-taroni · 2020-01-06T17:43:55Z

My understand of what is left on this ticket:

Use the broad_values_by_arm.txt GISTIC file to obtain the chr22q loss information and add that to the final table.
Have someone with domain expertise look at the table we have and make tweaks or follow up as needed.

jharenza · 2020-01-06T20:28:24Z

@jaclyn-taroni was just looking at this today as well. Once the table is done, we have a clinician (probably via email) who can check out the data. It may be nice to point him to a notebook of the table with the final subtypes and columns of criteria used.

jaclyn-taroni · 2020-01-18T19:07:50Z

molecular-subtyping-ATRT will need to be rerun with annotated focal consensus calls (#186) and new GISTIC calls (#453).

jharenza · 2020-01-21T16:39:57Z

@cbethell - started going through these results a bit. Can I request that you also add a column for OS_months, derived from OS_days to the results? Thanks!

jharenza · 2020-01-21T18:51:11Z

@cbethell sorry for the separate comment - would you be able to also include expression of HES5, HES6, DLL1, and DLL3 and GSEA for http://software.broadinstitute.org/gsea/msigdb/cards/HALLMARK_NOTCH_SIGNALING.html to assess Notch signaling? This is for ATRT SHH group, as noted above.

Looks like the other genes mentioned are all included.

Thanks!

jaclyn-taroni · 2020-01-21T19:13:04Z

would you be able to also include expression of HES5, HES6, DLL1, and DLL3 and GSEA for http://software.broadinstitute.org/gsea/msigdb/cards/HALLMARK_NOTCH_SIGNALING.html to assess Notch signaling? This is for ATRT SHH group, as noted above.

I'm going to pick this up right now @jharenza.

jharenza · 2020-01-21T19:13:38Z

Thanks @jaclyn-taroni !

jharenza · 2020-01-21T19:49:11Z

Here is a minimal gene list for expression/GSEA for each subgroup:

TYR	SHH	MYC
TYR	MYCN	MYC
MITF	GLI2	HOTAIR
DCT	CDK6	TEAD3
VEGFA	ASCL1	MYC GSEA
DNAH11	HES5
SPEF1	HES6
MSX2	DLL1
STAT3	DLL3
PRRX1	LHX2
LMX1	TEAD1
OTX2	Notch GSEA

jharenza · 2020-01-29T13:36:23Z

Even using the minimal set of genes, the subtyping is not clear-cut based on these genes' expression, and after discussing with @jaclyn-taroni, I think it would be a better approach to develop a classifier for ATRT subtyping, similar to what we was done for MB here. Since this may not make it into the first submission of the paper, I will close this for now.

jharenza added the proposed analysis label Nov 8, 2019

jaclyn-taroni added cnv Related to or requires CNV data transcriptomic Related to or requires transcriptomic data sv Related to or requires SV data molecular subtyping Related to molecular subtyping of tumors snv Related to or requires SNV data labels Nov 8, 2019

jharenza mentioned this issue Nov 8, 2019

Planned Analysis: Molecularly subtype all tumors #19

Closed

7 tasks

jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Nov 15, 2019

cbethell mentioned this issue Nov 20, 2019

PR 1 of 2: Molecular Subtyping - ATRT (Data Prep) #284

Merged

8 tasks

cbethell mentioned this issue Dec 2, 2019

PR 2 of 2: Molecular Subtyping - ATRT (Plotting) #306

Merged

2 tasks

cbethell mentioned this issue Dec 10, 2019

Subset files for ATRT scripts #325

Merged

2 tasks

jharenza mentioned this issue Dec 10, 2019

Planned data release: V12 #326

Closed

5 tasks

jaclyn-taroni mentioned this issue Dec 12, 2019

SMARCB1 deletions in ATRT with current SEG to gene mapping #217

Closed

2 tasks

cbethell mentioned this issue Dec 17, 2019

Molecular Subtyping - ATRT Compare GISTIC results #344

Closed

2 tasks

jharenza mentioned this issue Dec 17, 2019

Release V12 data #347

Merged

jaclyn-taroni mentioned this issue Jan 1, 2020

Update ATRT molecular subtyping module to use GSVA scores #388

Merged

3 tasks

cbethell mentioned this issue Jan 8, 2020

Add chr22q loss variable to ATRT molecular subtyping #414

Merged

3 tasks

cbethell mentioned this issue Jan 21, 2020

Add OS_months to final ATRT molecular subtyping table #460

Merged

5 tasks

jaclyn-taroni mentioned this issue Jan 21, 2020

Update ATRT subtyping to use minimal set of genes #462

Merged

5 tasks

jharenza closed this as completed Jan 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Analysis: Molecularly subtype ATRT tumors #244

Proposed Analysis: Molecularly subtype ATRT tumors #244

jharenza commented Nov 8, 2019 •

edited

Loading

jaclyn-taroni commented Nov 10, 2019 •

edited

Loading

cbethell commented Nov 15, 2019

cbethell commented Nov 15, 2019 •

edited

Loading

jharenza commented Dec 10, 2019 •

edited

Loading

cbethell commented Dec 10, 2019

jaclyn-taroni commented Jan 6, 2020

jharenza commented Jan 6, 2020

jaclyn-taroni commented Jan 18, 2020

jharenza commented Jan 21, 2020

jharenza commented Jan 21, 2020 •

edited

Loading

jaclyn-taroni commented Jan 21, 2020

jharenza commented Jan 21, 2020

jharenza commented Jan 21, 2020

jharenza commented Jan 29, 2020

Proposed Analysis: Molecularly subtype ATRT tumors #244

Proposed Analysis: Molecularly subtype ATRT tumors #244

Comments

jharenza commented Nov 8, 2019 • edited Loading

Scientific goals

Proposed methods

Required input data

Proposed timeline

Relevant literature

jaclyn-taroni commented Nov 10, 2019 • edited Loading

Tabular format

Notes on columns

cbethell commented Nov 15, 2019

cbethell commented Nov 15, 2019 • edited Loading

jharenza commented Dec 10, 2019 • edited Loading

cbethell commented Dec 10, 2019

jaclyn-taroni commented Jan 6, 2020

jharenza commented Jan 6, 2020

jaclyn-taroni commented Jan 18, 2020

jharenza commented Jan 21, 2020

jharenza commented Jan 21, 2020 • edited Loading

jaclyn-taroni commented Jan 21, 2020

jharenza commented Jan 21, 2020

jharenza commented Jan 21, 2020

jharenza commented Jan 29, 2020

jharenza commented Nov 8, 2019 •

edited

Loading

jaclyn-taroni commented Nov 10, 2019 •

edited

Loading

cbethell commented Nov 15, 2019 •

edited

Loading

jharenza commented Dec 10, 2019 •

edited

Loading

jharenza commented Jan 21, 2020 •

edited

Loading