-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V12 mb subtyping (2/N) #322
Conversation
Hi @jharenza I will look into the best way to do this and discuss with you offline. |
We do have a list of signature genes obtained after training the classifier. They are in the form of gene pairs (or ratios that were selected)
So I can convert them into ENSG if that is preferable, but I know that Ensembl identifiers between Gencode versions also differ. For e.g. by default Gencode has 1 or 2 digits appended to the actual Ensembl identifier which causes the difference. So I could ignore the appended digits and only take the actual Ensembl identifier. Following is the comparison of Gencode v27 and v39 (protein coding genes only):
As we know, Gencode identifiers are different between versions
This can be resolved by comparing only Ensembl identifiers but there are some differences between them too:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @jharenza! The code update looks good to me. Reviewed the code, specifically in script 04 logic where you assign subtypes based on RNA-Seq and methylation. Results are reproducible.
v12 CI subset files (16/N)
V12 neurocytoma (8/N)
V12 chordoma (5/N)
Purpose/implementation Section
What scientific question is your analysis addressing?
This PR runs MB subtyping for v12
What was your approach?
Additionally, I had to revamp the subtyping output to include all associated bs_ids to each tumor which was subtyped. I made the following changes:
id
which consisted ofsample_id+composition
.id
were then assigned the respective subtypes.MB, To be classified
.Note: there is another subtype called "MB_MYO" in the methylation data - this is medullomyoblastoma. This is not in WHO 2021, but I kept it nonetheless, because it might be informative.
Note 2: This analysis is reviewable, but may change if we need to update the medullo classifier because of updated gene symbols being used. cc @komalsrathi to inform on this and if yes, I suggest we update the classifier to use ENSG ids to avoid any future symbol update needs.
Note 3: I think it will be good to assess accuracy of medullo classifier using the methylation subtypes and/or report which are discrepant. I will add this, but my initial check some months ago was >90%. In these cases, I might also update to methyl subtypes as the gold standard.
UPDATE: I did assess this and it was 99.35% accurate (154/155 correct). Added methyl subtype as another column in the output file.
What GitHub issue does your pull request address?
Part of v12 release
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Please review script 04 in detail. Did I capture everything accurately as described above?
Is there anything that you want to discuss further?
See notes above.
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes
Results
What types of results are included (e.g., table, figure)?
medulloblastoma subtype table output.
Note 4: I made this in long format so it will be easier to join to the histologies file later.
What is your summary of the results?
885 MB bs ids
267 patients
286 tumors subtyped:
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.