-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Molecular subtyping nbl #264
Conversation
@jharenza: Thank you for your feedback. I have cleaned up the result folder and moved some of the intermediate files (tables) to the input directory. These intermediate files can deleted if required. I have further updated the module readme and the individual scripts to include information regarding MYCN being on 2p, TPM cutoff and qc checks. The @ewafula: Thank you for pointing out the GA errors and your feedback. |
- also condensed code in script 01-subset-for-NBL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. Code looks much cleaner, results look reasonable after recent changes, the added code to fix the previously discussed issues looks okay. the code ran well on EC2.
Purpose/implementation Section
What scientific question is your analysis addressing?
To molecularly subtype neuroblastoma, ganglioneuroblastoma, and ganglioneuroma samples into MYCN amplified or MYCN non-amplified.
What was your approach?
To obtain the NBL samples, we first filtered the histology file based on
pathology_free_text_diagnosis
,sample
, andexperimental_strategy
. We only consider the following values in each columnNext we filter the
consensus_wgs_plus_cnvkit_wxs.tsv.gz
andgene-expression-rsem-tpm-collapsed.rds
for the gene symbol MYCN and then join it with the filtered histology file. We use this composite file to get the DNA and RNA biospecimen IDs for the records and then subtype them based on the following criteria:Subtyping criteria:
case 1:
If
pathology_free_text_diagnosis
is amplified andstatus
is amplified assign subtype asNBL, MYCN amplified
case 2:
If
pathology_free_text_diagnosis
is non-amp andstatus
is amplified assign subtype asNBL, MYCN amplified
case 3:
If
pathology_free_text_diagnosis
is non-amp andstatus
is non-amp assign subtype asNBL, MYCN non-amplified
case 4:
If
pathology_free_text_diagnosis
is amplified and status is non-ampFor such samples if there exists a TPM value, evaluate if the TPM is above or below the Suggested_Cutoff established in the image
results/TPM_Biospecimen_All_Samples_With_TMP.png
assign subtype asNBL, MYCN amplified
orNBL, MYCN non-amplified
respectively. In case there is no TPM values then assign the subtype asNA
.case 5:
If there are samples that are not yet subtyped but have a TPM value, assign them a subtype based on the Suggested_Cutoff.
case 6:
Other remaining samples are not subtyped and the subtype field is left as
NA
.What GitHub issue does your pull request address?
Issue#417
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
The code and logic are explained throughout the
00-Analysis-RMD
file , additional information is also provided in the module README. Please review the data filtering steps in Lines 77-224 and Lines 257-329. Basically, we would like to make sure we are not missing any NBL samples.In the plot
plot/TPM_Biospecimen_All_Samples_With_TMP.png
, we establish aSuggested_Cutoff
for TPM values, we use this value for subtyping samples that fall under case 4 and 5. Please ensure if this cutoff is appropriate.Also review the results in the table NBL_MYCN_Subtype.tsv and the QC
results in QC_table.tsv
Is there anything that you want to discuss further?
When finding samples that have both DNA and RNA IDs we encountered the following 2 issues with repeating records:
Some of the biospecimen have same DNA and RNA IDs but differing copy numbers and status as mentioned in the following Issue#436 and comment , for the cases mentioned in the issue we retained the record with higher copy number.
In addition we also found, two other repeating records with same DNA and RNA ID but differing aliquot_id
These records are
We retained the copies with
aliquot_id
ET_FD9T78QE_DGD_STNGS_64, to tackle the issue of duplicates. Please review this and provide your feedback. The issues 1 and 2 are tackled in lines 196-224 in the00-Analysis-RMD
file.results/Alteration_Table.tsv
: This table is similar to the tableNBL_MYCN_Subtype.tsv
. However, this table hasadditional columns which contain information on
MYCN_TPM
,copy_number
,status
, andpathology_free_text_diagnosis
. Furthermore, the columnsubtype
in this table provides more insights into the samples inNBL_MYCN_Subtype.tsv
which had a subtypeNA
(samples which could not be subtyped). If samples fell into case 4 but didn't have a TPM value, those are subtyped asPathology-amp,Status-non-amp,TPM-NA
. If a sample did not fall into any of the above cases they are subtyped asUnclassified due to insufficient info
.Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes
Results
What types of results are included (e.g., table, figure)?
Tables and Figures
What is your summary of the results?
The table NBL_MYCN_Subtype.tsv has 1168 samples of which only 509 were assigned a subtype. 107 of the samples are non-amplified, 402 are amplified and the rest could not be subtyped due to missing information.
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.