V12 mb subtyping (2/N) #322

jharenza · 2023-02-26T21:57:20Z

Purpose/implementation Section

What scientific question is your analysis addressing?

This PR runs MB subtyping for v12

What was your approach?

Additionally, I had to revamp the subtyping output to include all associated bs_ids to each tumor which was subtyped. I made the following changes:

Updated script 03 to only do accuracy assessment rather than export of subtypes
I moved the MB expression subset file to be created to scratch rather than the output directory
I redid script 04 totally and that focuses on assigning subtypes.
- I preferentially chose the RNA-Seq classifier subtype and if there was any discrepancy for two bs_ids of the same tumor, I used methylation to assign the subtype. I did this using a match id called id which consisted of sample_id+composition.
- If no RNA-Seq subtype, but high confidence methylation subtype, the tumor was assigned the methyl subtype.
- All non-RNA-Seq and non-methyl samples matching the id were then assigned the respective subtypes.
- All samples with NA in molecular subtype at the end of this were deemed MB, To be classified.
The pathology "true positive" file from OpenPBTA data was not the actual file being used in 03 for input (there was an RDS in here), so I swapped them. Results are the same.
The README was wildly out of date, so I updated it.

Note: there is another subtype called "MB_MYO" in the methylation data - this is medullomyoblastoma. This is not in WHO 2021, but I kept it nonetheless, because it might be informative.

Note 2: This analysis is reviewable, but may change if we need to update the medullo classifier because of updated gene symbols being used. cc @komalsrathi to inform on this and if yes, I suggest we update the classifier to use ENSG ids to avoid any future symbol update needs.

Note 3: I think it will be good to assess accuracy of medullo classifier using the methylation subtypes and/or report which are discrepant. I will add this, but my initial check some months ago was >90%. In these cases, I might also update to methyl subtypes as the gold standard.
UPDATE: I did assess this and it was 99.35% accurate (154/155 correct). Added methyl subtype as another column in the output file.

What GitHub issue does your pull request address?

Part of v12 release

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Please review script 04 in detail. Did I capture everything accurately as described above?

Is there anything that you want to discuss further?

See notes above.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

medulloblastoma subtype table output.

Note 4: I made this in long format so it will be easier to join to the histologies file later.

What is your summary of the results?

885 MB bs ids
267 patients
286 tumors subtyped:

  molecular_subtype        n
  <chr>                <int>
1 MB, Group3              71
2 MB, Group4             114
3 MB, SHH                 77
4 MB, To be classified    55
5 MB, WNT                 24

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

komalsrathi · 2023-02-27T13:16:55Z

Note 2: This analysis is reviewable, but may change if we need to update the medullo classifier because of updated gene symbols being used. cc @komalsrathi to inform on this and if yes, I suggest we update the classifier to use ENSG ids to avoid any future symbol update needs.

Hi @jharenza I will look into the best way to do this and discuss with you offline.

komalsrathi · 2023-02-27T13:50:15Z

We do have a list of signature genes obtained after training the classifier. They are in the form of gene pairs (or ratios that were selected)

> bestFeaturesNew %>% head()
[1] "ATP8A1_NRN1"    "PAPSS1_PTP4A1"  "AXIN2_FAM171A2" "PAPSS1_UNCX"    "AXIN2_BRPF3"    "AXIN2_CDK5R1"

So I can convert them into ENSG if that is preferable, but I know that Ensembl identifiers between Gencode versions also differ. For e.g. by default Gencode has 1 or 2 digits appended to the actual Ensembl identifier which causes the difference. So I could ignore the appended digits and only take the actual Ensembl identifier.

Following is the comparison of Gencode v27 and v39 (protein coding genes only):

# version 27
gencode_gtf_v27 <- 'data/gencode.v27.primary_assembly.annotation.gtf.gz'
gencode_gtf_v27 <- rtracklayer::import(con = gencode_gtf_v27)
gencode_gtf_v27 <- as.data.frame(gencode_gtf_v27)
gencode_gtf_v27 <- gencode_gtf_v27 %>%
  filter(gene_type == "protein_coding") %>%
  dplyr::select(gene_name, gene_id) %>%
  unique()

# version 39
gencode_gtf_v39 <- data/gencode.v39.primary_assembly.annotation.gtf.gz'
gencode_gtf_v39 <- rtracklayer::import(con = gencode_gtf_v39)
gencode_gtf_v39 <- as.data.frame(gencode_gtf_v39)
gencode_gtf_v39 <- gencode_gtf_v39 %>%
  filter(gene_type == "protein_coding") %>%
  dplyr::select(gene_name, gene_id) %>%
  unique()

# compare protein coding genes in both versions
comparison <- gencode_gtf_v39 %>%
  dplyr::mutate(ensembl_id_v39 = gsub("[.].*", "", gene_id)) %>% # remove appended digits from Gencode to get Ensembl id
  dplyr::rename("gencode_id_v39" = "gene_id") %>%
  inner_join(gencode_gtf_v27 %>%
               dplyr::mutate(ensembl_id_v27 = gsub("[.].*", "", gene_id)) %>% # remove appended digits from Gencode to get Ensembl id
               dplyr::rename("gencode_id_v27" = "gene_id"), by = c("gene_name"))

# common gene symbols between both versions
dim(comparison)
[1] 18676     5

As we know, Gencode identifiers are different between versions

comparison[which(comparison$gencode_id_v39 != comparison$gencode_id_v27),] %>% nrow()
[1] 17493

comparison[which(comparison$gencode_id_v39 != comparison$gencode_id_v27),] %>% head()
gene_name     gencode_id_v39  ensembl_id_v39     gencode_id_v27  ensembl_id_v27
1     OR4F5  ENSG00000186092.7 ENSG00000186092  ENSG00000186092.5 ENSG00000186092
2    OR4F29  ENSG00000284733.2 ENSG00000284733  ENSG00000284733.1 ENSG00000284733
3    OR4F16  ENSG00000284662.2 ENSG00000284662  ENSG00000284662.1 ENSG00000284662
4    SAMD11 ENSG00000187634.13 ENSG00000187634 ENSG00000187634.11 ENSG00000187634
5     NOC2L ENSG00000188976.11 ENSG00000188976 ENSG00000188976.10 ENSG00000188976
6    KLHL17 ENSG00000187961.15 ENSG00000187961 ENSG00000187961.13 ENSG00000187961

This can be resolved by comparing only Ensembl identifiers but there are some differences between them too:

# Ensembl id difference between both versions
comparison[which(comparison$ensembl_id_v39 != comparison$ensembl_id_v27),] %>% nrow()
[1] 43

# first few genes where Ensembl identifiers are different
comparison[which(comparison$ensembl_id_v39 != comparison$ensembl_id_v27),] %>% head()
gene_name     gencode_id_v39  ensembl_id_v39     gencode_id_v27  ensembl_id_v27
835      BTBD8 ENSG00000189195.15 ENSG00000189195  ENSG00000284413.2 ENSG00000284413
1026   PPIAL4D  ENSG00000289549.1 ENSG00000289549  ENSG00000256374.2 ENSG00000256374
1059   PPIAL4C  ENSG00000288867.1 ENSG00000288867  ENSG00000263464.2 ENSG00000263464
1827      TBCE  ENSG00000285053.1 ENSG00000285053 ENSG00000116957.12 ENSG00000116957
1828      TBCE  ENSG00000284770.2 ENSG00000284770 ENSG00000116957.12 ENSG00000116957
2166     STPG4 ENSG00000239605.11 ENSG00000239605  ENSG00000273269.2 ENSG00000273269

ewafula

Thanks, @jharenza! The code update looks good to me. Reviewed the code, specifically in script 04 logic where you assign subtypes based on RNA-Seq and methylation. Results are reproducible.

…ent-samples

v12 CI subset files (16/N)

…ples

V12 neurocytoma (8/N)

V12 chordoma (5/N)

jharenza requested review from ewafula and zzgeng February 26, 2023 21:58

jharenza changed the base branch from dev to v12-analysis-files February 26, 2023 21:58

jharenza added the ready for review label Feb 27, 2023

ewafula approved these changes Feb 27, 2023

View reviewed changes

zzgeng approved these changes Feb 27, 2023

View reviewed changes

jharenza and others added 22 commits April 13, 2023 20:53

update Other tumors (all histiocytic JXG)

6b9ec8b

update SEGA subtype SEGA, To be classified

6955143

Merge remote-tracking branch 'origin/v12-lgg' into v12-path

74666fb

rerun

a9f7a7b

Merge branch 'v12-path' into v12-integrate

5a55eed

rerun module

53c2678

add missing methyl samples, rerun

b9f5e3a

Merge branch 'v12-hgg' into v12-atrt

ea37734

Merge branch 'v12-atrt' into v12-path

eb1f2eb

rerun with hgg changes

f3f325c

Merge branch 'v12-path' into v12-integrate

f72feed

rerun with hgg changes

bc46a14

get rid of discrepancies, rerun

7643718

Merge branch 'v12-hgg' into v12-atrt

b8ac633

Merge branch 'v12-atrt' into v12-path

dcab142

rerun

5f44f10

Merge branch 'v12-path' into v12-integrate

a3e6867

add mol subtype methyl in compiled file

045c297

Merge branch 'v12-path' into v12-integrate

893ad37

include non-subtyped LGG methyl

21c2d83

update 04-subtype file

59702c3

include non-subtyped LGG methyl

bae82b3

Ubuntu added 7 commits April 26, 2023 04:18

update samples 7316-6047

7fb9c92

Merge remote-tracking branch 'origin/v12-path' into v12-integrate

2b0db64

update samples 7316-6047

5ee4b93

Merge remote-tracking branch 'origin/v12-integrate' into v12-independ…

aa08934

…ent-samples

rerun with updates final histologies

64d20a3

v12 CI subset files

1552211

update dockerfile to use debian stretch archive

5171fff

ewafula force-pushed the v12-mb branch from e61bfaf to 29140b2 Compare April 27, 2023 19:43

Ubuntu and others added 18 commits April 27, 2023 22:12

update molecular-subtyping-MB to run with CI subsets

3105cbe

update subtyping module to run with CI subsets

ff817f8

update tp53 to run with CI subsets

dbad3aa

Merge pull request #355 from PediatricOpenTargets/v12-subset-files

0c1eb9b

v12 CI subset files (16/N)

exclude immuned deconvo from GA checks

1dd2a39

Merge pull request #344 from PediatricOpenTargets/v12-independent-sam…

a278add

…ples

Merge pull request #336 from PediatricOpenTargets/v12-integrate

68ed62e

Merge pull request #335 from PediatricOpenTargets/v12-path

646986a

Merge pull request #333 from PediatricOpenTargets/v12-lgg

a917d2b

Merge pull request #332 from PediatricOpenTargets/v12-atrt

719ab9b

Merge pull request #331 from PediatricOpenTargets/v12-hgg

1e67747

Merge pull request #330 from PediatricOpenTargets/v12-nbl

4315f4e

Merge pull request #329 from PediatricOpenTargets/v12-neuro

e5d69d1

V12 neurocytoma (8/N)

Merge pull request #328 from PediatricOpenTargets/v12-emb

b4165ac

Merge pull request #327 from PediatricOpenTargets/v12-ews

ee0aec1

Merge pull request #326 from PediatricOpenTargets/v12-chordoma

ba21ca3

V12 chordoma (5/N)

Merge pull request #325 from PediatricOpenTargets/v12-cranio

ffbbb9f

Merge pull request #324 from PediatricOpenTargets/v12-epn

f71a265

jharenza added merge next and removed ready for review labels Apr 30, 2023

jharenza merged commit 0e7e47c into v12-analysis-files Apr 30, 2023

jharenza deleted the v12-mb branch April 30, 2023 11:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V12 mb subtyping (2/N) #322

V12 mb subtyping (2/N) #322

jharenza commented Feb 26, 2023 •

edited

Loading

komalsrathi commented Feb 27, 2023

komalsrathi commented Feb 27, 2023

ewafula left a comment

V12 mb subtyping (2/N) #322

V12 mb subtyping (2/N) #322

Conversation

jharenza commented Feb 26, 2023 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

komalsrathi commented Feb 27, 2023

komalsrathi commented Feb 27, 2023

ewafula left a comment

Choose a reason for hiding this comment

jharenza commented Feb 26, 2023 •

edited

Loading