Proposed Analysis: medulloblastoma subtyping #731

jharenza · 2020-07-18T17:08:04Z

What are the scientific goals of the analysis?

Subtype medulloblastoma samples into SHH, WNT, Group 3, and Group 4

What methods do you plan to use to accomplish the scientific goals?

https://github.com/d3b-center/medullo-classifier-package

Additionally, summarize whether the subtypes agree with clinical pathology where reported here.

What input data are required for this analysis?

RNA-Seq FPKM

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

1 week

Who will complete the analysis (please add a GitHub handle here if relevant)?

@komalsrathi has completed the analysis. @komalsrathi will you please create a PR with the analysis when you have a chance?

What relevant scientific literature relates to this analysis?

4 medulloblastoma subgroups: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4334443/

komalsrathi · 2020-07-18T17:53:03Z

@jharenza wasn't this done already?

Also - please make this 1 week instead of 1-2 days.

jharenza · 2020-07-18T20:04:26Z

@komalsrathi - yes, it was done early on, and since it was not within the repository, in discussions with @jaclyn-taroni, we thought we should add it as an official analysis PR to add transparency to the subtyping that ended up in the histologies file. Thank you - not a rush, but something we want to add!

komalsrathi · 2020-07-18T20:27:22Z

Ok makes sense. I'll finish it this week, thank you.

komalsrathi · 2020-07-23T11:53:56Z

@jharenza @jaclyn-taroni Just want to share the results for review before finalizing the code and creating a PR:

I am using the following filter to get 122 MB samples (1 polyA and 121 rRNA depleted) as input for the classification:

clin.mb <- clin %>%
  filter(experimental_strategy == "RNA-Seq",
         integrated_diagnosis  == "Medulloblastoma")

> plyr::count(clin.mb, c("RNA_library"))
  RNA_library freq
1      poly-A    1
2    stranded  121

The classifier was developed to take at least 2 samples as input, so I only have the results for the rRNA depleted 121 samples which I merged with the clinical findings:

comparison.txt

jharenza · 2020-07-23T18:47:15Z

@komalsrathi thanks for this. I think if possible, we would like to add the poly-A sample's subtype. Out of curiosity, why are two samples required if they are analyzed independently?

I know @adamcresnick will prefer to have complete information when we can. A way to get around this could be to duplicate it (so long as each entry is independently analyzed) or perform on the entire poly-A matrix and only retain that sample. Neither are ideal, but both would give us the result.

I do have two comments on the output:

Can you change the "0" p-values to scientific notation so they are not zero? They are probably 2e-16?
Regarding the clinical data - this was put together by @jenn0307 's team and I harmonized the fields a bit. If this is going into the PR, @jaclyn-taroni, do we want to release this as a file in the data release or @allisonheath capture this somehow in the histologies file? It is still a bit messy (free text) because it was information pulled from pathology reports, but the comparison is definitely worth doing, as I can see multiple instances of subtypes changed due to the classifier. cc @adamcresnick (input) and @yuankunzhu (data release) as well.

komalsrathi · 2020-07-23T19:35:20Z

Hi, What I will do is use some additional couple approaches and compare the output: 1. Batch correct and merge the polyA and stranded Medullo data into a single dataset to be used as input. 2. Use the entire matrix (non-MB samples as well) as input and do the classification individually on polyA and stranded datasets. I’ll also convert the 0s to <2e-16 because that’s what it represents. If we can get a “clean” version of the clinical findings, we can also do a statistical test between the two categorical values of observed and expected subtypes. I’ll update you soon. Thanks!!

On Thu, Jul 23, 2020 at 2:47 PM Jo Lynne Rokita ***@***.***> wrote: @komalsrathi <https://github.com/komalsrathi> thanks for this. I think if possible, we would like to add the poly-A sample's subtype. Out of curiosity, why are two samples required if they are analyzed independently? I know @adamcresnick <https://github.com/adamcresnick> will prefer to have complete information when we can. A way to get around this could be to duplicate it (so long as each entry is independently analyzed) or perform on the entire poly-A matrix and only retain that sample. Neither are ideal, but both would give us the result. I do have two comments on the output: 1. Can you change the "0" p-values to scientific notation so they are not zero? They are probably 2e-16? 2. Regarding the clinical data - this was put together by @jenn0307 <https://github.com/jenn0307> 's team and I harmonized the fields a bit. If this is going into the PR, @jaclyn-taroni <https://github.com/jaclyn-taroni>, do we want to release this as a file in the data release or @allisonheath <https://github.com/allisonheath> capture this somehow in the histologies file? It is still a bit messy (free text) because it was information pulled from pathology reports, but the comparison is definitely worth doing, as I can see multiple instances of subtypes changed due to the classifier. cc @adamcresnick <https://github.com/adamcresnick> (input) and @yuankunzhu <https://github.com/yuankunzhu> (data release) as well. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#731 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVNEJ5P7377ZB4O63P4Q3LR5CAULANCNFSM4PAEJHLQ> .

-- *Komal S Rathi* | Bioinformatics Scientist II, DBHi, The Children's Hospital of Philadelphia | rathik@email.chop.edu

komalsrathi · 2020-07-24T15:36:48Z

@jharenza @jaclyn-taroni I did the following:

Merge the polyA MB (n = 1) and rRNA depleted MB (n = 121) data into a
single dataset (n = 122) -> Batch correct on library -> Classify.

2. Use the entire matrix (non-MB samples as well) as input and do the
classification individually on polyA and stranded datasets

Use the rRNA depleted MB matrix as is (n = 121) -> Classify -> Compare with batch corrected output above.

The results are exactly the same for rRNA depleted samples in 1 and 2. Also, the polyA sample from batch corrected input was correctly classified as: SHH (pathology comment: SAME PATIENT AS 7316-95, 7316-278 which were both correctly classified as SHH)

Here is the output:
batch_corrected_classified_MB.txt

I have added a column called Matches at the end of the output which informs if the best fit matches the clinical data.
There are 21 matches and 101 non-matches.
90/101 non-matches were because of NA in clinical data. So technically, we mis-classified 11 samples.

> table(batch.class$Matches)
FALSE  TRUE 
  101    21 

      pathology_subtype freq
1 c("Group3", "Group4")    2
2               non-Wnt    1
3                   SHH    5
4                   WNT    3
5                  <NA>   90

If you do have concerns reg. batch correction: I did some QC on house keeping genes and following are the t-SNE plots. Just to reiterate: only MB samples (polya = 1, stranded = 121) were combined and batch corrected and I used 6 house keeping genes for the tSNE:

[1] "ACTB"   "TUBA1A" "TUBB"   "GAPDH"  "LDHA"   "RPL19"

Please let me know your thoughts.

jharenza · 2020-07-24T16:53:27Z

@komalsrathi - this looks great, and I am glad the results were the same with batch correction. Regarding:

So technically, we mis-classified 11 samples.

I would not say we mis-classified those samples - I think what this means is we have to go back to the pathologists and have them re-review. I am more inclined to think that the initial clinical information could have been wrong, ambiguous, or pathology shows one thing and expression shows another, which would make these good cases to focus on in the paper (cc: @adamcresnick).

I am OK with this method - just want to be sure @jaclyn-taroni is also OK with it, since the batch correction is something we have not yet added to any of the analyses yet, and asking for her advice on how to handle - whether that should be a separate PR (I think we discussed that as a future goal, but that there might still be center batch issues not corrected yet), or lumped into this MB classifier PR.

jaclyn-taroni · 2020-07-24T17:02:53Z

I would not say we mis-classified those samples - I think what this means is we have to go back to the pathologists and have them re-review. I am more inclined to think that the initial clinical information could have been wrong, ambiguous, or pathology shows one thing and expression shows another

Any idea with how those putative misclassifications (based on clinical data) might square (or not) with some of what I saw with unsupervised analysis (#730)? I can provide some output of the training materials if that's helpful.

komalsrathi · 2020-07-24T17:09:55Z

Ok, I just redid the calculation on just combined (without batch correction) polyA and rRNA depleted MB samples (n = 122) and the results are identical to batch corrected data. So in case there are reservations about batch correction, we don't have to use it.

jharenza · 2020-07-24T17:24:56Z

I would not say we mis-classified those samples - I think what this means is we have to go back to the pathologists and have them re-review. I am more inclined to think that the initial clinical information could have been wrong, ambiguous, or pathology shows one thing and expression shows another

Any idea with how those putative misclassifications (based on clinical data) might square (or not) with some of what I saw with unsupervised analysis (#730)? I can provide some output of the training materials if that's helpful.

@komalsrathi - would you be able to see if those 11 are some of the samples that don't cluster as expected in the unsupervised analysis?

komalsrathi · 2020-07-24T17:51:09Z

@jharenza would it be easier for the person who performed the unsupervised analysis to check that? the attached file has information on which ones were misclassified (Matches = TRUE or FALSE).

Here is the output:
batch_corrected_classified_MB.txt

For ticket [here](AlexsLemonade/OpenPBTA-analysis#731)

AlexsLemonade/OpenPBTA-analysis#731

jaclyn-taroni · 2020-08-21T16:04:27Z

Partially addressed by #738 and I've filed #742 to track next steps.

jharenza added the proposed analysis label Jul 18, 2020

jharenza assigned komalsrathi Jul 18, 2020

jharenza added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Aug 14, 2020

jharenza pushed a commit to jharenza/jharenza.github.io that referenced this issue Aug 19, 2020

add pbta-mb-exploration notebook

7e8dfc6

For ticket [here](AlexsLemonade/OpenPBTA-analysis#731)

jharenza mentioned this issue Aug 19, 2020

add pbta-mb-exploration notebook jharenza/jharenza.github.io#1

Merged

jharenza pushed a commit to jharenza/jharenza.github.io that referenced this issue Aug 19, 2020

add mb-exploration

5692acb

AlexsLemonade/OpenPBTA-analysis#731

komalsrathi mentioned this issue Aug 19, 2020

Mb subtypes #738

Merged

5 tasks

jaclyn-taroni mentioned this issue Aug 21, 2020

Updated analysis: Take consensus of two classifiers for medulloblastoma subtype labels #742

Closed

jaclyn-taroni closed this as completed Aug 21, 2020

jharenza mentioned this issue Aug 27, 2020

Updated analysis: medulloblastoma consensus subtypes #747

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Analysis: medulloblastoma subtyping #731

Proposed Analysis: medulloblastoma subtyping #731

jharenza commented Jul 18, 2020 •

edited

Loading

komalsrathi commented Jul 18, 2020 •

edited

Loading

jharenza commented Jul 18, 2020

komalsrathi commented Jul 18, 2020

komalsrathi commented Jul 23, 2020

jharenza commented Jul 23, 2020

komalsrathi commented Jul 23, 2020 via email

komalsrathi commented Jul 24, 2020

jharenza commented Jul 24, 2020

jaclyn-taroni commented Jul 24, 2020

komalsrathi commented Jul 24, 2020

jharenza commented Jul 24, 2020

komalsrathi commented Jul 24, 2020

jaclyn-taroni commented Aug 21, 2020

Proposed Analysis: medulloblastoma subtyping #731

Proposed Analysis: medulloblastoma subtyping #731

Comments

jharenza commented Jul 18, 2020 • edited Loading

What are the scientific goals of the analysis?

What methods do you plan to use to accomplish the scientific goals?

What input data are required for this analysis?

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

Who will complete the analysis (please add a GitHub handle here if relevant)?

What relevant scientific literature relates to this analysis?

komalsrathi commented Jul 18, 2020 • edited Loading

jharenza commented Jul 18, 2020

komalsrathi commented Jul 18, 2020

komalsrathi commented Jul 23, 2020

jharenza commented Jul 23, 2020

komalsrathi commented Jul 23, 2020 via email

komalsrathi commented Jul 24, 2020

jharenza commented Jul 24, 2020

jaclyn-taroni commented Jul 24, 2020

komalsrathi commented Jul 24, 2020

jharenza commented Jul 24, 2020

komalsrathi commented Jul 24, 2020

jaclyn-taroni commented Aug 21, 2020

jharenza commented Jul 18, 2020 •

edited

Loading

komalsrathi commented Jul 18, 2020 •

edited

Loading