Module authors: Krutika Gaonkar, Jaclyn Taroni, Eric Wafula, Jo Lynne Rokita; code adapted from Gregory Way (@gwaygenomics)
This module is adapted from: marislab/pdx-classification
.
Now published in Rokita et al. Cell Reports. 2019.
In brief, TP53 inactivation, NF1 inactivation, and Ras activation classifiers are applied to the OpenPedCan RNA-seq data.
The classifiers were trained on TCGA PanCan data (Way et al. Cell Reports. 2018, Knijnenburg et al. Cell Reports. 2018.).
See 01-apply-classifier.py
for more information about the procedure.
To evaluate the classifier scores, we use 02-evaluate-classifier.py
and input SNV data to identify true TP53/NF1 loss samples and compare scores of shuffled data to true calls and plot ROC curves.
The analysis can be run with the following (assuming you are in the analysis directory):
bash run_classifier.sh
snv-consensus-plus-hotspots.maf.tsv.gz
: from consensus SNV calls that are present in all 3 callers (strelka2,mutect2 and lancet) plus hotspot rescued
gene-expression-rsem-tpm-collapsed.rds
: TPM values per gene in biospecimen_id
consensus_seg_with_status.tsv
: created by analyses/focal-cn-file-preparation/02-add-ploidy-consensus.Rmd
- available through data download
cnvkit_with_status.tsv
: created by analyses/focal-cn-file-preparation/01-add-ploidy-cnvkit.Rmd
-available through data download
sv-manta.tsv.gz
: Structural Variants called by manta
00-tp53-nf1-alterations.R
produces TP53_NF1_snv_alteration.tsv
, which contains information about the presence or absence of coding SNVs in TP53 and NF1 for the purpose of evaluating the classifier results.
For evaluation purposes, a coding SNV 1) is found in a CDS region and 2) is not a silent mutation or in an intron as indicated by the Variant_Classification
column of the consensus mutation file.
NF1 positive examples are additionally filtered to remove missense mutations, as these are not annotated with OncoKB (#381 (comment)).
01-apply-classifier.py
produces results/gene-expression-rsem-fpkm-collapsed.stranded_classifier_scores.tsv
, which contains all 3 classifier scores for all RNA data and for shuffled RNA (e.g., random) data.
- output file of this notebook:
TP53_NF1_snv_alteration.tsv
02-qc-rna_expression_score.Rmd
here expression of TP53 gene was compared to TP53 classifier score. This script runs comparison for all different types of RNA library in the datasets. Currently, two RNA library type, stranded
and polya stranded
generated strong negative correlation between TP53 expression and TP53 inactivation score, while polya
samples do not seem to have strong correlation. Thus, for stranded
and polya stranded
, expression and classifier score together might be able to predict function; however, for polya, the correlation is not statistically significant.
- output file of this notebook:
gene-expression-rsem-tpm-collapsed_classifier_scores.tsv
03-tp53-cnv-loss-domain.Rmd
here copy_number of regions overlapping TP53 functional domains were compared to TP53 classifier score. We find tumors with TP53 copies <=1 have higher TP53 classifier scores, so we only retain biospecimens with <= 1 copy TP53 as high confidence TP53 loss. NOTE: since now WGS and WXS have their CNV calls processed different (WGS uses 2/3 consensus workflow and WXS only uses WXS samples) - the inputs were read-in, subsetted and then merged for analyses.
- output file of this notebook:
loss_overlap_domains_tp53.tsv
04-tp53-sv-loss.Rmd
here structural variant breakpoints overlapping TP53 are investigated for CNV loss or low expression to gather high confidence TP53 loss via Structural Variants.
- output files of this notebook:
fusion_bk_tp53_loss.tsv
andsv_overlap_tp53.tsv
05-tp53-altered-annotation.Rmd
here we take a deeper look into tp53 altered status with respect to number of SNVs/CNVs suggesting bi-allelic mutations or with respect to cancer_predisposition and tp53 classifier scores.
- output file of this notebook:
tp53_altered_status.tsv
Columns | Description |
---|---|
sample_id | 7316-XXXX id used to match DNA and RNA Kids_First_Biospecimen_IDs |
Kids_First_Biospecimen_ID | Kids_First_Biospecimen_ID |
match_id | match id created to match experimental strategies |
cancer_predispositions | Germline cancer predisposition status |
tp53_score | TP53 loss classifier score |
SNV_indel_counts | Number of deleterious SNVs found in DNA sample |
CNV_loss_counts | Number of CNV losses found in DNA sample |
HGVSp_Short | Short format of protein level change used as SNV evidence |
CNV_loss_evidence | copy_number of CNV overlapping functional domains of TP53 used as evidence |
hotspot | Any 1 SNV shown in HGVSp_Short overlaps MSKCC cancer hotspot database |
activating | Any 1 SNV shown in HGVSp_Short overlaps TP53 activating mutations R273C and R248W. Reference and reference. |
overlap_domain | Any 1 SNV causing a deletion shown in HGVSp_Short overlaps TP53 DNA binding domain. |
tp53_altered | Combined evidence, cancer predisposition and score based tp53 status |
06-evaluate-classifier.py
evaluates classifier score with TP53 alterations (non-synonymous SNV and all status == "loss" in consensus CNV file from 00-tp53-nf1-alterations.R)
Because some of the classifier genes are not present in the OpenPedCan dataset, the scores should be interpreted as continuous values representing relative gene alterations and not as probabilities.
ROC curve for TP53 classifier scores are saved in the results folder. We iterate through all possible RNA library types and print out the graphs accordingly. With TARGET samples, we have 3 RNA library types and 3 plots were generated:
poly-A_TP53.png
stranded_TP53.png
exome_capture_TP53.png