Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V12 mb subtyping (2/N) #322

Merged
merged 383 commits into from
Apr 30, 2023
Merged

V12 mb subtyping (2/N) #322

merged 383 commits into from
Apr 30, 2023

Conversation

jharenza
Copy link
Member

@jharenza jharenza commented Feb 26, 2023

Purpose/implementation Section

What scientific question is your analysis addressing?

This PR runs MB subtyping for v12

What was your approach?

Additionally, I had to revamp the subtyping output to include all associated bs_ids to each tumor which was subtyped. I made the following changes:

  • Updated script 03 to only do accuracy assessment rather than export of subtypes
  • I moved the MB expression subset file to be created to scratch rather than the output directory
  • I redid script 04 totally and that focuses on assigning subtypes.
    • I preferentially chose the RNA-Seq classifier subtype and if there was any discrepancy for two bs_ids of the same tumor, I used methylation to assign the subtype. I did this using a match id called id which consisted of sample_id+composition.
    • If no RNA-Seq subtype, but high confidence methylation subtype, the tumor was assigned the methyl subtype.
    • All non-RNA-Seq and non-methyl samples matching the id were then assigned the respective subtypes.
    • All samples with NA in molecular subtype at the end of this were deemed MB, To be classified.
  • The pathology "true positive" file from OpenPBTA data was not the actual file being used in 03 for input (there was an RDS in here), so I swapped them. Results are the same.
  • The README was wildly out of date, so I updated it.

Note: there is another subtype called "MB_MYO" in the methylation data - this is medullomyoblastoma. This is not in WHO 2021, but I kept it nonetheless, because it might be informative.

Note 2: This analysis is reviewable, but may change if we need to update the medullo classifier because of updated gene symbols being used. cc @komalsrathi to inform on this and if yes, I suggest we update the classifier to use ENSG ids to avoid any future symbol update needs.

Note 3: I think it will be good to assess accuracy of medullo classifier using the methylation subtypes and/or report which are discrepant. I will add this, but my initial check some months ago was >90%. In these cases, I might also update to methyl subtypes as the gold standard.
UPDATE: I did assess this and it was 99.35% accurate (154/155 correct). Added methyl subtype as another column in the output file.

What GitHub issue does your pull request address?

Part of v12 release

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Please review script 04 in detail. Did I capture everything accurately as described above?

Is there anything that you want to discuss further?

See notes above.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

medulloblastoma subtype table output.

Note 4: I made this in long format so it will be easier to join to the histologies file later.

What is your summary of the results?

885 MB bs ids
267 patients
286 tumors subtyped:

  molecular_subtype        n
  <chr>                <int>
1 MB, Group3              71
2 MB, Group4             114
3 MB, SHH                 77
4 MB, To be classified    55
5 MB, WNT                 24

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@jharenza jharenza changed the base branch from dev to v12-analysis-files February 26, 2023 21:58
@komalsrathi
Copy link

Note 2: This analysis is reviewable, but may change if we need to update the medullo classifier because of updated gene symbols being used. cc @komalsrathi to inform on this and if yes, I suggest we update the classifier to use ENSG ids to avoid any future symbol update needs.

Hi @jharenza I will look into the best way to do this and discuss with you offline.

@komalsrathi
Copy link

We do have a list of signature genes obtained after training the classifier. They are in the form of gene pairs (or ratios that were selected)

> bestFeaturesNew %>% head()
[1] "ATP8A1_NRN1"    "PAPSS1_PTP4A1"  "AXIN2_FAM171A2" "PAPSS1_UNCX"    "AXIN2_BRPF3"    "AXIN2_CDK5R1"  

So I can convert them into ENSG if that is preferable, but I know that Ensembl identifiers between Gencode versions also differ. For e.g. by default Gencode has 1 or 2 digits appended to the actual Ensembl identifier which causes the difference. So I could ignore the appended digits and only take the actual Ensembl identifier.

Following is the comparison of Gencode v27 and v39 (protein coding genes only):

# version 27
gencode_gtf_v27 <- 'data/gencode.v27.primary_assembly.annotation.gtf.gz'
gencode_gtf_v27 <- rtracklayer::import(con = gencode_gtf_v27)
gencode_gtf_v27 <- as.data.frame(gencode_gtf_v27)
gencode_gtf_v27 <- gencode_gtf_v27 %>%
  filter(gene_type == "protein_coding") %>%
  dplyr::select(gene_name, gene_id) %>%
  unique()

# version 39
gencode_gtf_v39 <- data/gencode.v39.primary_assembly.annotation.gtf.gz'
gencode_gtf_v39 <- rtracklayer::import(con = gencode_gtf_v39)
gencode_gtf_v39 <- as.data.frame(gencode_gtf_v39)
gencode_gtf_v39 <- gencode_gtf_v39 %>%
  filter(gene_type == "protein_coding") %>%
  dplyr::select(gene_name, gene_id) %>%
  unique()

# compare protein coding genes in both versions
comparison <- gencode_gtf_v39 %>%
  dplyr::mutate(ensembl_id_v39 = gsub("[.].*", "", gene_id)) %>% # remove appended digits from Gencode to get Ensembl id
  dplyr::rename("gencode_id_v39" = "gene_id") %>%
  inner_join(gencode_gtf_v27 %>%
               dplyr::mutate(ensembl_id_v27 = gsub("[.].*", "", gene_id)) %>% # remove appended digits from Gencode to get Ensembl id
               dplyr::rename("gencode_id_v27" = "gene_id"), by = c("gene_name"))

# common gene symbols between both versions
dim(comparison)
[1] 18676     5

As we know, Gencode identifiers are different between versions

comparison[which(comparison$gencode_id_v39 != comparison$gencode_id_v27),] %>% nrow()
[1] 17493

comparison[which(comparison$gencode_id_v39 != comparison$gencode_id_v27),] %>% head()
gene_name     gencode_id_v39  ensembl_id_v39     gencode_id_v27  ensembl_id_v27
1     OR4F5  ENSG00000186092.7 ENSG00000186092  ENSG00000186092.5 ENSG00000186092
2    OR4F29  ENSG00000284733.2 ENSG00000284733  ENSG00000284733.1 ENSG00000284733
3    OR4F16  ENSG00000284662.2 ENSG00000284662  ENSG00000284662.1 ENSG00000284662
4    SAMD11 ENSG00000187634.13 ENSG00000187634 ENSG00000187634.11 ENSG00000187634
5     NOC2L ENSG00000188976.11 ENSG00000188976 ENSG00000188976.10 ENSG00000188976
6    KLHL17 ENSG00000187961.15 ENSG00000187961 ENSG00000187961.13 ENSG00000187961

This can be resolved by comparing only Ensembl identifiers but there are some differences between them too:

# Ensembl id difference between both versions
comparison[which(comparison$ensembl_id_v39 != comparison$ensembl_id_v27),] %>% nrow()
[1] 43

# first few genes where Ensembl identifiers are different
comparison[which(comparison$ensembl_id_v39 != comparison$ensembl_id_v27),] %>% head()
gene_name     gencode_id_v39  ensembl_id_v39     gencode_id_v27  ensembl_id_v27
835      BTBD8 ENSG00000189195.15 ENSG00000189195  ENSG00000284413.2 ENSG00000284413
1026   PPIAL4D  ENSG00000289549.1 ENSG00000289549  ENSG00000256374.2 ENSG00000256374
1059   PPIAL4C  ENSG00000288867.1 ENSG00000288867  ENSG00000263464.2 ENSG00000263464
1827      TBCE  ENSG00000285053.1 ENSG00000285053 ENSG00000116957.12 ENSG00000116957
1828      TBCE  ENSG00000284770.2 ENSG00000284770 ENSG00000116957.12 ENSG00000116957
2166     STPG4 ENSG00000239605.11 ENSG00000239605  ENSG00000273269.2 ENSG00000273269

Copy link

@ewafula ewafula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jharenza! The code update looks good to me. Reviewed the code, specifically in script 04 logic where you assign subtypes based on RNA-Seq and methylation. Results are reproducible.

@jharenza jharenza merged commit 0e7e47c into v12-analysis-files Apr 30, 2023
@jharenza jharenza deleted the v12-mb branch April 30, 2023 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants