Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Molecular subtypes (MB) summary notebook #743

Merged
merged 45 commits into from
Aug 27, 2020

Conversation

komalsrathi
Copy link
Collaborator

Purpose/implementation Section

What scientific question is your analysis addressing?

Summarizes the following:

  1. Medulloblastoma subtype classification using the two classifiers MM2S and medulloPackage on batch corrected and uncorrected expression matrix along with the expected subtypes from pathology reports.
  2. Associated performance metric in terms of % accuracy for each classifier.
  3. Assignment of consensus subtype between the two classifiers as the molecular_subtype.

What was your approach?

As suggested in #742, I have added some details to the notebook:

  1. % Accuracy is currently being calculated by matching observed and expected subtypes where expected subtype info is available. In case of ambiguous subtypes, we treat it as a match if the observed subtype matches with any one of the expected subtypes
  2. Consensus tabs: Molecular subtype labels are assigned only when the two classifiers agree, leaving all other samples as unclassified.

What GitHub issue does your pull request address?

#742

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Accuracy calculation and consensus molecular subtype assignment.

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

.html output containing summary tables

What is your summary of the results?

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

komalsrathi and others added 30 commits August 18, 2020 09:07
CI related edits to MB subtype steps
Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>
Skip filtering and batch correction. in CI
@komalsrathi komalsrathi changed the title Mb class nb Molecular subtypes (MB) summary notebook Aug 21, 2020
@jaclyn-taroni
Copy link
Member

Looks like the total accuracy for the consensus calls is the same on the corrected vs. uncorrected matrices. Was the single poly-A sample one of the samples where the classifiers disagreed?

@komalsrathi
Copy link
Collaborator Author

komalsrathi commented Aug 22, 2020

Some more details:

# 25 samples with consensus classes matching pathology (corrected matrix)
corrected.diff <- consensus.corrected[which(consensus.corrected$match  == TRUE),'Kids_First_Biospecimen_ID']
length(corrected.diff)
[1] 25

# 25 samples with consensus classes matching pathology (uncorrected matrix)
uncorrected.diff <- consensus.uncorrected[which(consensus.uncorrected$match == TRUE),'Kids_First_Biospecimen_ID']
length(uncorrected.diff)
[1] 25

# what's the difference? in each case, a different sample does not agree
# consensus corrected matrix has a match but uncorrected does not
sample1 <- setdiff(corrected.diff, uncorrected.diff) # BS_HB03GSHF

# check sample1 in consensus uncorrected output
consensus.uncorrected %>%
  filter(Kids_First_Biospecimen_ID %in% sample1) %>%
  dplyr::select(Kids_First_Biospecimen_ID, pathology_subtype,  MM2S_best_fit, medulloPackage_best_fit, match)
Kids_First_Biospecimen_ID pathology_subtype MM2S_best_fit medulloPackage_best_fit match
1               BS_HB03GSHF               WNT           SHH                     WNT    NA

# consensus uncorrected matrix has a match but corrected does not
sample2 <- setdiff(uncorrected.diff, corrected.diff) # BS_V96WVE3Z

# check sample2 in consensus corrected output
consensus.corrected %>%
  filter(Kids_First_Biospecimen_ID %in% sample2) %>%
  dplyr::select(Kids_First_Biospecimen_ID, pathology_subtype,  MM2S_best_fit, medulloPackage_best_fit, match)
Kids_First_Biospecimen_ID pathology_subtype MM2S_best_fit medulloPackage_best_fit match
1               BS_V96WVE3Z               SHH        Group3                     SHH    NA

In both cases, the medulloPackage classification matches the pathology report but MM2S does not. Both samples are stranded

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @komalsrathi, thanks for sending this and providing a bit more background information! I had a few comments in service of making these results easier to revisit in the future if necessary, preparing the subtype labels for consumption elsewhere, and DRYing up the notebook a bit.

analyses/molecular-subtyping-MB/02-compare-classes.Rmd Outdated Show resolved Hide resolved
Comment on lines +32 to +34
mutate(pathology_subtype = replace(pathology_subtype,
pathology_subtype == "Group 3 or 4", "Group3, Group4")) %>%
mutate(pathology_subtype = gsub(" ", "", pathology_subtype))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would accomplish the same thing but I'm not sure if there are other subtypes separated by , in pathology_subtype

Suggested change
mutate(pathology_subtype = replace(pathology_subtype,
pathology_subtype == "Group 3 or 4", "Group3, Group4")) %>%
mutate(pathology_subtype = gsub(" ", "", pathology_subtype))
mutate(pathology_subtype = replace(pathology_subtype,
pathology_subtype == "Group 3 or 4", "Group3,Group4"))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for this, there is Group 3 and Group 4 in path reports and we have Group3 and Group4 which is why the gsub.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so pathology_subtype can have the values Group 3 or 4 or Group 3, Group 4 before you do any mutating - is that correct?

Copy link
Collaborator Author

@komalsrathi komalsrathi Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the unique values for expected types (pathology):

unique(dat$pathology_subtype)
[1] NA "WNT" "Group 3 or 4" "SHH" "non-WNT"
[6] "Group 4"

So I convert Group 3 or 4 to Group3, Group4 using replace (so that we can match either Group3 or Group4 predicted subtype) and gsub converts Group 4 to Group4 (because for observed types we have SHH, WNT, Group3 and Group4 i.e. no spaces)


```{r, echo = TRUE, warning = FALSE, message = FALSE}
# merge observed and expected subtypes
mm2s.corrected <- obs.class[[1]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an index here means we're relying on an analyst to know or remember what order the observed results come back in. There are no names in obs.class at the moment. I'd recommend making alterations to the upstream step (01-classify-mb.R) such that there is information about the method and dataset used in this object. That way someone who did not author the module could use this object "off-the-shelf" without much digging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes totally makes sense - will add names to the list items.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
```

#### Details:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add information about how many samples have pathology subtype labels and (briefly) what the process is for obtaining those labels (for you, not what goes into that for pathology! 🙂 ) please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would 'mirror' this information in the module README as you've done with these other points, too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated README with breakdown of samples and subtypes from path report:
https://github.com/komalsrathi/OpenPBTA-analysis/tree/mb-class-nb/analyses/molecular-subtyping-MB#02-compare-classesrmd

Just unsure of what the second part means: process of obtaining those labels. I am using a file that @jharenza provided as input.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically meant how do we get the labels from what I assume is a pathology report to the file you got and does it come from another database, for example. But you state it's from a pathology report, so I think that's sufficient for now. Thank you for the update!

Comment on lines 111 to 129
```{r, echo = TRUE, warning = FALSE, message = FALSE}
# merge observed and expected subtypes
medullo.classifier.corrected <- obs.class[[3]]
medullo.classifier.corrected <- exp.class %>%
inner_join(medullo.classifier.corrected, by = c('Kids_First_Biospecimen_ID' = 'sample')) %>%
mutate(match = str_detect(pathology_subtype, best.fit))

# % accuracy
medullo.classifier.corrected.acc <- medullo.classifier.corrected %>%
filter(!is.na(pathology_subtype)) %>%
group_by(match) %>%
summarise(n = n()) %>%
mutate(Accuracy = paste0(n/sum(n)*100, '%')) %>%
filter(match) %>%
.$Accuracy
print(paste0("Accuracy: ", medullo.classifier.corrected.acc))

# output table
viewDataTable(medullo.classifier.corrected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this code is repeated between different classifier-dataset combinations. That suggests to me that it may be useful to wrap up joining with the expected subtype information and accuracy calculation in a custom function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# output table
viewDataTable(consensus.uncorrected)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the output from this notebook to include a table of the consensus labels (preferably as a TSV).

In modules for subtyping other histologies, we've joined all the identifiers (DNA and RNA) into a single table: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/e8fbc7dc6aa8a36d7236fbe10a038657e3782e09/analyses/molecular-subtyping-HGG/results/HGG_molecular_subtype.tsv This particular step can come in a subsequent notebook if you'd prefer. But I do not expect this notebook to be overly long, particularly with the custom function changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@komalsrathi
Copy link
Collaborator Author

komalsrathi commented Aug 24, 2020

Hi @jaclyn-taroni thanks again, I will incorporate these and update latest by tomorrow. was just handed over some high priority work, will try to update asap (latest by end of this week).

@komalsrathi
Copy link
Collaborator Author

@jaclyn-taroni I am doing a bit of a QC just to make sure what I have is correct - I will update you shortly when this is ready to review.

@komalsrathi
Copy link
Collaborator Author

komalsrathi commented Aug 27, 2020

@jaclyn-taroni I think this is ready now. I just added another tab with consensus comparison with some details. Total accuracy of consensus output remains the same for batch corrected and uncorrected consensus output with 25/32 matches to the reported pathology subtypes. Between the two consensus outputs, there are 24/25 matches. BS_V96WVE3Z is correctly predicted by consensus uncorrected output but not by batch-corrected output and BS_HB03GSHF is correctly predicted by consensus batch-corrected output but not by uncorrected output.

Added the above info to module README as well.

The only question remains how do we pick batch corrected vs uncorrected - I am not sure on what basis because the total accuracy remains the same.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for the updates! I agree that we have to make a somewhat arbitrary choice between uncorrected and batch-corrected here, but hopefully the review of the pathology reports may inform that decision in the future (related: #746, #747).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants