Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Molecular subtyping nbl #264

Merged
merged 151 commits into from
Feb 10, 2023
Merged

Molecular subtyping nbl #264

merged 151 commits into from
Feb 10, 2023

Conversation

adilahiri
Copy link

@adilahiri adilahiri commented Oct 6, 2022

Purpose/implementation Section

What scientific question is your analysis addressing?

To molecularly subtype neuroblastoma, ganglioneuroblastoma, and ganglioneuroma samples into MYCN amplified or MYCN non-amplified.

What was your approach?

To obtain the NBL samples, we first filtered the histology file based on pathology_free_text_diagnosis , sample, and
experimental_strategy. We only consider the following values in each column

Pathology_Diagnosis sample_type experimental_strategy
Neuroblastoma Tumor WGS
Ganglioneuroblastoma WXS
Ganglioneuroblastoma, nodular Targeted Sequencing
Ganglioneuroblastoma, intermixed RNA-Seq
Ganglioneuroma, maturing subtype OR Ganglioneuroblastoma, well differentiated

Next we filter the consensus_wgs_plus_cnvkit_wxs.tsv.gz and gene-expression-rsem-tpm-collapsed.rds for the gene symbol MYCN and then join it with the filtered histology file. We use this composite file to get the DNA and RNA biospecimen IDs for the records and then subtype them based on the following criteria:

Subtyping criteria:

case 1:
If pathology_free_text_diagnosis is amplified and status is amplified assign subtype as NBL, MYCN amplified

case 2:
If pathology_free_text_diagnosis is non-amp and status is amplified assign subtype as NBL, MYCN amplified

case 3:
If pathology_free_text_diagnosis is non-amp and status is non-amp assign subtype as NBL, MYCN non-amplified

case 4:
If pathology_free_text_diagnosis is amplified and status is non-amp
For such samples if there exists a TPM value, evaluate if the TPM is above or below the Suggested_Cutoff established in the image results/TPM_Biospecimen_All_Samples_With_TMP.png assign subtype as NBL, MYCN amplified or NBL, MYCN non-amplified respectively. In case there is no TPM values then assign the subtype as NA.

case 5:
If there are samples that are not yet subtyped but have a TPM value, assign them a subtype based on the Suggested_Cutoff.

case 6:
Other remaining samples are not subtyped and the subtype field is left as NA.

What GitHub issue does your pull request address?

Issue#417

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

The code and logic are explained throughout the 00-Analysis-RMD file , additional information is also provided in the module README. Please review the data filtering steps in Lines 77-224 and Lines 257-329. Basically, we would like to make sure we are not missing any NBL samples.

In the plot plot/TPM_Biospecimen_All_Samples_With_TMP.png, we establish a Suggested_Cutoff for TPM values, we use this value for subtyping samples that fall under case 4 and 5. Please ensure if this cutoff is appropriate.

Also review the results in the table NBL_MYCN_Subtype.tsv and the QC
results in QC_table.tsv

Is there anything that you want to discuss further?

When finding samples that have both DNA and RNA IDs we encountered the following 2 issues with repeating records:

  1. Some of the biospecimen have same DNA and RNA IDs but differing copy numbers and status as mentioned in the following Issue#436 and comment , for the cases mentioned in the issue we retained the record with higher copy number.

  2. In addition we also found, two other repeating records with same DNA and RNA ID but differing aliquot_id
    These records are

DNA_ID RNA_ID aliquot_id
BS_0XC02E11 BS_2HM4AE24 ET_FD9T78QE_DGD_STNGS_29
BS_0XC02E11 BS_2HM4AE24 ET_FD9T78QE_DGD_STNGS_64
BS_1F8J25Q1 BS_2HM4AE24 ET_FD9T78QE_DGD_STNGS_29
BS_1F8J25Q1 BS_2HM4AE24 ET_FD9T78QE_DGD_STNGS_64

We retained the copies with aliquot_id ET_FD9T78QE_DGD_STNGS_64, to tackle the issue of duplicates. Please review this and provide your feedback. The issues 1 and 2 are tackled in lines 196-224 in the 00-Analysis-RMD file.

results/Alteration_Table.tsv: This table is similar to the table NBL_MYCN_Subtype.tsv. However, this table has
additional columns which contain information on MYCN_TPM, copy_number, status, and pathology_free_text_diagnosis. Furthermore, the column subtype in this table provides more insights into the samples in NBL_MYCN_Subtype.tsv which had a subtype NA (samples which could not be subtyped). If samples fell into case 4 but didn't have a TPM value, those are subtyped as Pathology-amp,Status-non-amp,TPM-NA. If a sample did not fall into any of the above cases they are subtyped as Unclassified due to insufficient info.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

Tables and Figures

What is your summary of the results?

The table NBL_MYCN_Subtype.tsv has 1168 samples of which only 509 were assigned a subtype. 107 of the samples are non-amplified, 402 are amplified and the rest could not be subtyped due to missing information.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • [X ] This analysis has been added to continuous integration.

Documentation Checklist

  • [X ] This analysis module has a README and it is up to date.
  • [X ] This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • [ X] The analytical code is documented and contains comments.

@adilahiri
Copy link
Author

@jharenza: Thank you for your feedback. I have cleaned up the result folder and moved some of the intermediate files (tables) to the input directory. These intermediate files can deleted if required. I have further updated the module readme and the individual scripts to include information regarding MYCN being on 2p, TPM cutoff and qc checks. The NA subtypes are relabeled to NBL, to be classified.

@ewafula: Thank you for pointing out the GA errors and your feedback.

@ewafula
Copy link

ewafula commented Feb 9, 2023

@jharenza, @afarrel, this is ready to review.

Copy link

@afarrel afarrel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Code looks much cleaner, results look reasonable after recent changes, the added code to fix the previously discussed issues looks okay. the code ran well on EC2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposed Analysis: Neuroblastoma (NBL) molecular subtyping
5 participants