Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Proposed Analysis: medulloblastoma subtyping #731

Closed
jharenza opened this issue Jul 18, 2020 · 13 comments
Closed

Proposed Analysis: medulloblastoma subtyping #731

jharenza opened this issue Jul 18, 2020 · 13 comments
Assignees
Labels
in progress Someone is working on this issue, but feel free to propose an alternative approach! proposed analysis

Comments

@jharenza
Copy link
Collaborator

jharenza commented Jul 18, 2020

What are the scientific goals of the analysis?

Subtype medulloblastoma samples into SHH, WNT, Group 3, and Group 4

What methods do you plan to use to accomplish the scientific goals?

https://github.com/d3b-center/medullo-classifier-package

Additionally, summarize whether the subtypes agree with clinical pathology where reported here.

What input data are required for this analysis?

RNA-Seq FPKM

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

1 week

Who will complete the analysis (please add a GitHub handle here if relevant)?

@komalsrathi has completed the analysis. @komalsrathi will you please create a PR with the analysis when you have a chance?

What relevant scientific literature relates to this analysis?

4 medulloblastoma subgroups: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4334443/

@komalsrathi
Copy link
Collaborator

komalsrathi commented Jul 18, 2020

@jharenza wasn't this done already?

Also - please make this 1 week instead of 1-2 days.

@jharenza
Copy link
Collaborator Author

@komalsrathi - yes, it was done early on, and since it was not within the repository, in discussions with @jaclyn-taroni, we thought we should add it as an official analysis PR to add transparency to the subtyping that ended up in the histologies file. Thank you - not a rush, but something we want to add!

@komalsrathi
Copy link
Collaborator

Ok makes sense. I'll finish it this week, thank you.

@komalsrathi
Copy link
Collaborator

@jharenza @jaclyn-taroni Just want to share the results for review before finalizing the code and creating a PR:

I am using the following filter to get 122 MB samples (1 polyA and 121 rRNA depleted) as input for the classification:

clin.mb <- clin %>%
  filter(experimental_strategy == "RNA-Seq",
         integrated_diagnosis  == "Medulloblastoma")

> plyr::count(clin.mb, c("RNA_library"))
  RNA_library freq
1      poly-A    1
2    stranded  121

The classifier was developed to take at least 2 samples as input, so I only have the results for the rRNA depleted 121 samples which I merged with the clinical findings:

comparison.txt

@jharenza
Copy link
Collaborator Author

@komalsrathi thanks for this. I think if possible, we would like to add the poly-A sample's subtype. Out of curiosity, why are two samples required if they are analyzed independently?

I know @adamcresnick will prefer to have complete information when we can. A way to get around this could be to duplicate it (so long as each entry is independently analyzed) or perform on the entire poly-A matrix and only retain that sample. Neither are ideal, but both would give us the result.

I do have two comments on the output:

  1. Can you change the "0" p-values to scientific notation so they are not zero? They are probably 2e-16?
  2. Regarding the clinical data - this was put together by @jenn0307 's team and I harmonized the fields a bit. If this is going into the PR, @jaclyn-taroni, do we want to release this as a file in the data release or @allisonheath capture this somehow in the histologies file? It is still a bit messy (free text) because it was information pulled from pathology reports, but the comparison is definitely worth doing, as I can see multiple instances of subtypes changed due to the classifier. cc @adamcresnick (input) and @yuankunzhu (data release) as well.

@komalsrathi
Copy link
Collaborator

komalsrathi commented Jul 23, 2020 via email

@komalsrathi
Copy link
Collaborator

@jharenza @jaclyn-taroni I did the following:

  1. Merge the polyA MB (n = 1) and rRNA depleted MB (n = 121) data into a
    single dataset (n = 122) -> Batch correct on library -> Classify.

2. Use the entire matrix (non-MB samples as well) as input and do the
classification individually on polyA and stranded datasets

  1. Use the rRNA depleted MB matrix as is (n = 121) -> Classify -> Compare with batch corrected output above.

The results are exactly the same for rRNA depleted samples in 1 and 2. Also, the polyA sample from batch corrected input was correctly classified as: SHH (pathology comment: SAME PATIENT AS 7316-95, 7316-278 which were both correctly classified as SHH)

Here is the output:
batch_corrected_classified_MB.txt

I have added a column called Matches at the end of the output which informs if the best fit matches the clinical data.
There are 21 matches and 101 non-matches.
90/101 non-matches were because of NA in clinical data. So technically, we mis-classified 11 samples.

> table(batch.class$Matches)
FALSE  TRUE 
  101    21 

      pathology_subtype freq
1 c("Group3", "Group4")    2
2               non-Wnt    1
3                   SHH    5
4                   WNT    3
5                  <NA>   90

If you do have concerns reg. batch correction: I did some QC on house keeping genes and following are the t-SNE plots. Just to reiterate: only MB samples (polya = 1, stranded = 121) were combined and batch corrected and I used 6 house keeping genes for the tSNE:

[1] "ACTB"   "TUBA1A" "TUBB"   "GAPDH"  "LDHA"   "RPL19" 

image

Please let me know your thoughts.

@jharenza
Copy link
Collaborator Author

@komalsrathi - this looks great, and I am glad the results were the same with batch correction. Regarding:

So technically, we mis-classified 11 samples.

I would not say we mis-classified those samples - I think what this means is we have to go back to the pathologists and have them re-review. I am more inclined to think that the initial clinical information could have been wrong, ambiguous, or pathology shows one thing and expression shows another, which would make these good cases to focus on in the paper (cc: @adamcresnick).

I am OK with this method - just want to be sure @jaclyn-taroni is also OK with it, since the batch correction is something we have not yet added to any of the analyses yet, and asking for her advice on how to handle - whether that should be a separate PR (I think we discussed that as a future goal, but that there might still be center batch issues not corrected yet), or lumped into this MB classifier PR.

@jaclyn-taroni
Copy link
Member

I would not say we mis-classified those samples - I think what this means is we have to go back to the pathologists and have them re-review. I am more inclined to think that the initial clinical information could have been wrong, ambiguous, or pathology shows one thing and expression shows another

Any idea with how those putative misclassifications (based on clinical data) might square (or not) with some of what I saw with unsupervised analysis (#730)? I can provide some output of the training materials if that's helpful.

@komalsrathi
Copy link
Collaborator

Ok, I just redid the calculation on just combined (without batch correction) polyA and rRNA depleted MB samples (n = 122) and the results are identical to batch corrected data. So in case there are reservations about batch correction, we don't have to use it.

@jharenza
Copy link
Collaborator Author

I would not say we mis-classified those samples - I think what this means is we have to go back to the pathologists and have them re-review. I am more inclined to think that the initial clinical information could have been wrong, ambiguous, or pathology shows one thing and expression shows another

Any idea with how those putative misclassifications (based on clinical data) might square (or not) with some of what I saw with unsupervised analysis (#730)? I can provide some output of the training materials if that's helpful.

@komalsrathi - would you be able to see if those 11 are some of the samples that don't cluster as expected in the unsupervised analysis?

@komalsrathi
Copy link
Collaborator

@jharenza would it be easier for the person who performed the unsupervised analysis to check that? the attached file has information on which ones were misclassified (Matches = TRUE or FALSE).

Here is the output:
batch_corrected_classified_MB.txt

@jharenza jharenza added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Aug 14, 2020
jharenza pushed a commit to jharenza/jharenza.github.io that referenced this issue Aug 19, 2020
jharenza pushed a commit to jharenza/jharenza.github.io that referenced this issue Aug 19, 2020
@komalsrathi komalsrathi mentioned this issue Aug 19, 2020
5 tasks
@jaclyn-taroni
Copy link
Member

Partially addressed by #738 and I've filed #742 to track next steps.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
in progress Someone is working on this issue, but feel free to propose an alternative approach! proposed analysis
Projects
None yet
Development

No branches or pull requests

3 participants