-
Notifications
You must be signed in to change notification settings - Fork 67
Molecular subtypes (MB) summary notebook #743
Conversation
CI related edits to MB subtype steps
Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>
Skip filtering and batch correction. in CI
Looks like the total accuracy for the consensus calls is the same on the corrected vs. uncorrected matrices. Was the single poly-A sample one of the samples where the classifiers disagreed? |
Some more details:
In both cases, the medulloPackage classification matches the pathology report but MM2S does not. Both samples are stranded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @komalsrathi, thanks for sending this and providing a bit more background information! I had a few comments in service of making these results easier to revisit in the future if necessary, preparing the subtype labels for consumption elsewhere, and DRYing up the notebook a bit.
mutate(pathology_subtype = replace(pathology_subtype, | ||
pathology_subtype == "Group 3 or 4", "Group3, Group4")) %>% | ||
mutate(pathology_subtype = gsub(" ", "", pathology_subtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would accomplish the same thing but I'm not sure if there are other subtypes separated by ,
in pathology_subtype
mutate(pathology_subtype = replace(pathology_subtype, | |
pathology_subtype == "Group 3 or 4", "Group3, Group4")) %>% | |
mutate(pathology_subtype = gsub(" ", "", pathology_subtype)) | |
mutate(pathology_subtype = replace(pathology_subtype, | |
pathology_subtype == "Group 3 or 4", "Group3,Group4")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for this, there is Group 3
and Group 4
in path reports and we have Group3
and Group4
which is why the gsub
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so pathology_subtype
can have the values Group 3 or 4
or Group 3, Group 4
before you do any mutating - is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the unique values for expected types (pathology):
unique(dat$pathology_subtype)
[1] NA "WNT" "Group 3 or 4" "SHH" "non-WNT"
[6] "Group 4"
So I convert Group 3 or 4
to Group3, Group4
using replace (so that we can match either Group3 or Group4 predicted subtype) and gsub converts Group 4
to Group4
(because for observed types we have SHH, WNT, Group3 and Group4 i.e. no spaces)
|
||
```{r, echo = TRUE, warning = FALSE, message = FALSE} | ||
# merge observed and expected subtypes | ||
mm2s.corrected <- obs.class[[1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an index here means we're relying on an analyst to know or remember what order the observed results come back in. There are no names in obs.class
at the moment. I'd recommend making alterations to the upstream step (01-classify-mb.R
) such that there is information about the method and dataset used in this object. That way someone who did not author the module could use this object "off-the-shelf" without much digging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes totally makes sense - will add names to the list items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added names to the list items:
https://github.com/komalsrathi/OpenPBTA-analysis/blob/mb-class-nb/analyses/molecular-subtyping-MB/01-classify-mb.R#L49
} | ||
``` | ||
|
||
#### Details: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add information about how many samples have pathology subtype labels and (briefly) what the process is for obtaining those labels (for you, not what goes into that for pathology! 🙂 ) please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would 'mirror' this information in the module README as you've done with these other points, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated README with breakdown of samples and subtypes from path report:
https://github.com/komalsrathi/OpenPBTA-analysis/tree/mb-class-nb/analyses/molecular-subtyping-MB#02-compare-classesrmd
Just unsure of what the second part means: process of obtaining those labels
. I am using a file that @jharenza provided as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I basically meant how do we get the labels from what I assume is a pathology report to the file you got and does it come from another database, for example. But you state it's from a pathology report, so I think that's sufficient for now. Thank you for the update!
```{r, echo = TRUE, warning = FALSE, message = FALSE} | ||
# merge observed and expected subtypes | ||
medullo.classifier.corrected <- obs.class[[3]] | ||
medullo.classifier.corrected <- exp.class %>% | ||
inner_join(medullo.classifier.corrected, by = c('Kids_First_Biospecimen_ID' = 'sample')) %>% | ||
mutate(match = str_detect(pathology_subtype, best.fit)) | ||
|
||
# % accuracy | ||
medullo.classifier.corrected.acc <- medullo.classifier.corrected %>% | ||
filter(!is.na(pathology_subtype)) %>% | ||
group_by(match) %>% | ||
summarise(n = n()) %>% | ||
mutate(Accuracy = paste0(n/sum(n)*100, '%')) %>% | ||
filter(match) %>% | ||
.$Accuracy | ||
print(paste0("Accuracy: ", medullo.classifier.corrected.acc)) | ||
|
||
# output table | ||
viewDataTable(medullo.classifier.corrected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of this code is repeated between different classifier-dataset combinations. That suggests to me that it may be useful to wrap up joining with the expected subtype information and accuracy calculation in a custom function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added individual functions here:
https://github.com/komalsrathi/OpenPBTA-analysis/blob/mb-class-nb/analyses/molecular-subtyping-MB/02-compare-classes.Rmd#L18
|
||
# output table | ||
viewDataTable(consensus.uncorrected) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect the output from this notebook to include a table of the consensus labels (preferably as a TSV).
In modules for subtyping other histologies, we've joined all the identifiers (DNA and RNA) into a single table: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/e8fbc7dc6aa8a36d7236fbe10a038657e3782e09/analyses/molecular-subtyping-HGG/results/HGG_molecular_subtype.tsv This particular step can come in a subsequent notebook if you'd prefer. But I do not expect this notebook to be overly long, particularly with the custom function changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated markdown to save consensus subtype output merged with RNA+DNA id to tsv files: https://github.com/komalsrathi/OpenPBTA-analysis/blob/mb-class-nb/analyses/molecular-subtyping-MB/02-compare-classes.Rmd#L256
and
https://github.com/komalsrathi/OpenPBTA-analysis/blob/mb-class-nb/analyses/molecular-subtyping-MB/02-compare-classes.Rmd#L279
Also updated the README at the analyses and module level.
Hi @jaclyn-taroni thanks again, |
Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>
@jaclyn-taroni I am doing a bit of a QC just to make sure what I have is correct - I will update you shortly when this is ready to review. |
@jaclyn-taroni I think this is ready now. I just added another tab with consensus comparison with some details. Total accuracy of consensus output remains the same for batch corrected and uncorrected consensus output with 25/32 matches to the reported pathology subtypes. Between the two consensus outputs, there are 24/25 matches. BS_V96WVE3Z is correctly predicted by consensus uncorrected output but not by batch-corrected output and BS_HB03GSHF is correctly predicted by consensus batch-corrected output but not by uncorrected output. Added the above info to module README as well. The only question remains how do we pick batch corrected vs uncorrected - I am not sure on what basis because the total accuracy remains the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Purpose/implementation Section
What scientific question is your analysis addressing?
Summarizes the following:
molecular_subtype
.What was your approach?
As suggested in #742, I have added some details to the notebook:
What GitHub issue does your pull request address?
#742
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Accuracy calculation and consensus molecular subtype assignment.
Is there anything that you want to discuss further?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes
Results
What types of results are included (e.g., table, figure)?
.html output containing summary tables
What is your summary of the results?
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.