Skip to content

Latest commit

 

History

History
 
 

tp53_nf1_score

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Apply classifiers trained on TCGA RNA-seq data

Module authors: Krutika Gaonkar, Jaclyn Taroni, Eric Wafula, Jo Lynne Rokita; code adapted from Gregory Way (@gwaygenomics)

This module is adapted from: marislab/pdx-classification. Now published in Rokita et al. Cell Reports. 2019.

In brief, TP53 inactivation, NF1 inactivation, and Ras activation classifiers are applied to the OpenPedCan RNA-seq data. The classifiers were trained on TCGA PanCan data (Way et al. Cell Reports. 2018, Knijnenburg et al. Cell Reports. 2018.). See 01-apply-classifier.py for more information about the procedure. To evaluate the classifier scores, we use 02-evaluate-classifier.py and input SNV data to identify true TP53/NF1 loss samples and compare scores of shuffled data to true calls and plot ROC curves.

Running the analysis

The analysis can be run with the following (assuming you are in the analysis directory):

bash run_classifier.sh

Inputs from data download

snv-consensus-plus-hotspots.maf.tsv.gz: from consensus SNV calls that are present in all 3 callers (strelka2,mutect2 and lancet) plus hotspot rescued gene-expression-rsem-tpm-collapsed.rds : TPM values per gene in biospecimen_id consensus_seg_with_status.tsv : created by analyses/focal-cn-file-preparation/02-add-ploidy-consensus.Rmd - available through data download cnvkit_with_status.tsv: created by analyses/focal-cn-file-preparation/01-add-ploidy-cnvkit.Rmd -available through data download sv-manta.tsv.gz : Structural Variants called by manta

Order of analysis

00-tp53-nf1-alterations.R produces TP53_NF1_snv_alteration.tsv, which contains information about the presence or absence of coding SNVs in TP53 and NF1 for the purpose of evaluating the classifier results. For evaluation purposes, a coding SNV 1) is found in a CDS region and 2) is not a silent mutation or in an intron as indicated by the Variant_Classification column of the consensus mutation file. NF1 positive examples are additionally filtered to remove missense mutations, as these are not annotated with OncoKB (#381 (comment)).

01-apply-classifier.py produces results/gene-expression-rsem-fpkm-collapsed.stranded_classifier_scores.tsv, which contains all 3 classifier scores for all RNA data and for shuffled RNA (e.g., random) data.

  • output file of this notebook: TP53_NF1_snv_alteration.tsv

02-qc-rna_expression_score.Rmd here expression of TP53 gene was compared to TP53 classifier score. This script runs comparison for all different types of RNA library in the datasets. Currently, two RNA library type, stranded and polya stranded generated strong negative correlation between TP53 expression and TP53 inactivation score, while polya samples do not seem to have strong correlation. Thus, for stranded and polya stranded, expression and classifier score together might be able to predict function; however, for polya, the correlation is not statistically significant.

  • output file of this notebook: gene-expression-rsem-tpm-collapsed_classifier_scores.tsv

03-tp53-cnv-loss-domain.Rmd here copy_number of regions overlapping TP53 functional domains were compared to TP53 classifier score. We find tumors with TP53 copies <=1 have higher TP53 classifier scores, so we only retain biospecimens with <= 1 copy TP53 as high confidence TP53 loss. NOTE: since now WGS and WXS have their CNV calls processed different (WGS uses 2/3 consensus workflow and WXS only uses WXS samples) - the inputs were read-in, subsetted and then merged for analyses.

  • output file of this notebook: loss_overlap_domains_tp53.tsv

04-tp53-sv-loss.Rmd here structural variant breakpoints overlapping TP53 are investigated for CNV loss or low expression to gather high confidence TP53 loss via Structural Variants.

  • output files of this notebook: fusion_bk_tp53_loss.tsv and sv_overlap_tp53.tsv

05-tp53-altered-annotation.Rmd here we take a deeper look into tp53 altered status with respect to number of SNVs/CNVs suggesting bi-allelic mutations or with respect to cancer_predisposition and tp53 classifier scores.

  • output file of this notebook: tp53_altered_status.tsv
Columns Description
sample_id 7316-XXXX id used to match DNA and RNA Kids_First_Biospecimen_IDs
Kids_First_Biospecimen_ID Kids_First_Biospecimen_ID
match_id match id created to match experimental strategies
cancer_predispositions Germline cancer predisposition status
tp53_score TP53 loss classifier score
SNV_indel_counts Number of deleterious SNVs found in DNA sample
CNV_loss_counts Number of CNV losses found in DNA sample
HGVSp_Short Short format of protein level change used as SNV evidence
CNV_loss_evidence copy_number of CNV overlapping functional domains of TP53 used as evidence
hotspot Any 1 SNV shown in HGVSp_Short overlaps MSKCC cancer hotspot database
activating Any 1 SNV shown in HGVSp_Short overlaps TP53 activating mutations R273C and R248W. Reference and reference.
overlap_domain Any 1 SNV causing a deletion shown in HGVSp_Short overlaps TP53 DNA binding domain.
tp53_altered Combined evidence, cancer predisposition and score based tp53 status

06-evaluate-classifier.py evaluates classifier score with TP53 alterations (non-synonymous SNV and all status == "loss" in consensus CNV file from 00-tp53-nf1-alterations.R)

Because some of the classifier genes are not present in the OpenPedCan dataset, the scores should be interpreted as continuous values representing relative gene alterations and not as probabilities.

ROC curve for TP53 classifier scores are saved in the results folder. We iterate through all possible RNA library types and print out the graphs accordingly. With TARGET samples, we have 3 RNA library types and 3 plots were generated: poly-A_TP53.png stranded_TP53.png exome_capture_TP53.png