This analysis is DEPRECATED and was last run with OpenPedCan data release v12
.
Summarize preprocessed Illumina Infinium Human Methylation
array measurements produced by the OpenPedCan methylation-preprocessing module and Illumina infinium methylation array CpG probe coordinates lifted-over from GRCh37
to GRCh38
build and annotated with GENCODE v39 release that is currently utilized in the OpenPedCan data analyses.
01-calculate-tpm-medians.R
script calculates representative cancer group gene-level and isoform-level TPM expression medians for subjects with both RNA-Seq and methylation data in a cohort.
Usage: Rscript --vanilla 01-calculate-tpm-medians.R [options]
Options:
--histologies=CHARACTER
Histologies file
--rnaseq_matrix=CHARACTER
OpenPedCan rnaseq tpm gene or isoform matrix file
--methyl_probe_annot=CHARACTER
Methyl gencode array probe annotations
--methyl_independent_samples=CHARACTER
OpenPedCan methyl independent biospecimen list file
--methyl_independent_samples=CHARACTER
OpenPedCan rnaseq independent biospecimen list file
--exp_values_values=CHARACTER
OpenPedCan expression matrix values: gene (default) and isoform
-h, --help
Show this help message and exit
02-calculate-methly-quantiles.R
script calculates probe-level quantiles for either Beta-values or M-values methylation matrix.
Usage: Rscript --vanilla 02-calculate-methly-quantiles.R [options]
Options:
--histologies=CHARACTER
Histologies file
--methyl_matrix=CHARACTER
OPenPedCan methyl beta-values or m-values matrix file
--methyl_probe_annot=CHARACTER
Methyl gencode array probe annotations
--independent_samples=CHARACTER
OpenPedCan methyl independent biospecimen list file
--methyl_values=CHARACTER
OPenPedCan methly matrix values: beta (default) and m
-h, --help
Show this help message and exit
03-methyl-tpm-correlation.py
script calculates representativearray probe
togene locus
correlations andarray probe
togene locus-isoforms
correlations betweenRNA-Seq TPM-values
and eitherBeta-values
orM-values
for each cancer group within a cohort for patients who have both datasets.
usage: python3 03-methyl-tpm-correlation.py [-h] [-m {beta,m}] [-e {gene,isoform}]
[-v] HISTOLOGY_FILE RNA_INDEPENDENT_SAMPLES METHYL_INDEPENDENT_SAMPLES
METHLY_MATRIX EXP_MATRIX PROBE_ANNOT
positional arguments:
HISTOLOGY_FILE OPenPedCan histologies file
RNA_INDEPENDENT_SAMPLES
OPenPedCan rnaseq independent biospecimen list file
METHYL_INDEPENDENT_SAMPLES
OPenPedCan methyl independent biospecimen list file
METHLY_MATRIX OpenPedCan methyl beta-values or m-values matrix file
EXP_MATRIX OPenPedCan expression matrix file
PROBE_ANNOT Methyl gencode array probe annotations
optional arguments:
-h, --help show this help message and exit
-m {beta,m}, --methyl_values {beta,m}
OpenPedCan methly matrix values: beta (default) and m
-e {gene,isoform}, --exp_values {gene,isoform}
OpenPedCan expression matrix values: gene (default) and isoform
-v, --version Print the current 03-methyl-tpm-correlation.py version and exit
04-tpm-transcript-representation.py
script Calculate rna-seq expression (tpm) gene isoform (transcript) representation for patients who have samples in both rna-seq and methylation datasets as follows:
a) First calculate a Z score for each sample
Zs= SG_TPM - µG_TPM/sdG_TPM
Where Zs is the Sample gene expression Z Score, SG_TPM is the gene expression value of that sample in TPM, µG_TPM is the mean TPM expression of the gene in that cancer group, and sdG_TPM is the standard deviation of the TPM expression of the gene in that cancer group.
b) Then Calculate the Weight. The inverse exponential function of the absolute Z-score will give us a weight that decreases the further the sample's gene expression deviates from the mean. This way, weird outliers will not distort the calculation.
Ws = 1/e|Zs|
Where Ws is the weight assigned to the sample, and Zs is the sample's gene expression Z score calculated in the previous step.
c) Finally, apply the weights to the Transcript expressions and sum them to calculate the percent expression.
Transcript_Representation = (∑ Ws•TPMS_transcript)/ (∑ Ws• TPMS_Total(All transcripts))
Where Ws is the weight assigned to the sample calculated in the previous step, TPMS_transcript is expression of each Sample's individual transcript in TPM, and TPMS_Total(All transcripts) is the total expression in TPM of transcripts of that gene in that sample.
usage: python3 04-tpm-transcript-representation.py [-h] [-m {beta,m}] [-e {gene,isoform}]
[-v] HISTOLOGY_FILE RNA_INDEPENDENT_SAMPLES METHYL_INDEPENDENT_SAMPLES
GENE_EXP_MATRIX ISOFORM_EXP_MATRIX PROBE_ANNOT
positional arguments:
HISTOLOGY_FILE OPenPedCan histologies file
RNA_INDEPENDENT_SAMPLES
OPenPedCan rnaseq independent biospecimen list file
METHYL_INDEPENDENT_SAMPLES
OPenPedCan methyl independent biospecimen list file
GENE_EXP_MATRIX OPenPedCan gene expression matrix file
ISOFORM_EXP_MATRIX OPenPedCan isoform expression matrix file
PROBE_ANNOT Methyl gencode array probe annotations
-v, --version Print the current 04-tpm-transcript-representation.py version and exit
05-create-methyl-summary-table.R
script summarizesarray probe quantiles
,Beta/M-values correlations
andgene annotations
into gene locus and gene locus-isoforms methylation summary tables. The OPenPedCan API utilizes the summary tables to dynamically generate methylation plots displayed on the NCI MTP portal with the following columns:- Gene_Symbol: gene symbol
- targetFromSourceId: Ensemble locus ID
- transcript_id: Ensemble locus-isoform ID (for isoform-level summary table only)
- Gene_Feature: GENCODE gene feature i.e., promoter, 5' UTR, exon, intron, 3'UTR, and intergenic
- Dataset: OpenPedCan
cohort
i.e., TARGET - Disease: OpenPedCan
cancer_group
i.e., Neuroblastoma - diseaseFromSourceMappedId: EFO ID of OpenPedCan
cancer_group
- MONDO: MONDO_ID of OpenPedCan
cancer_group
- Median_TPM: representative
cancer_group
gene-level or isoform-level TPM expression medians in acohort
- RNA_Correlation: array probe-level correlation between
methylation Beta-values
andRNA-Seq TPM values
- Transcript_Representation: RNA-Seq expression (tpm) percent transcript representation (for isoform-level summary table only)
- Probe_ID:
Illumina Infinium HumanMethylation
array probe ID for the CpG site - Chromosome: chromosome for CpG site eg. chr1
- Location: genomic location of the CpG site
- Beta_Q1: array probe-level Beta Q1 quantile
- Beta_Q2: array probe-level Beta Q2 quantile
- Beta_Median: array probe-level Beta Q3 quantile
- Beta_Q4: array probe-level Beta Q4 quantile
- Beta_Q5: array probe-level Beta Q5 quantile
- datatypeId: pediatric_cancer
- chop_uuid: generate UUID
- datasourceId: chop_gene_level_methylation or chop_isoform_level_methylation
Usage: 05-create-methyl-summary-table.R [options]
Options:
--methyl_tpm_corr=CHARACTER
Methyl beta/m-vlaues to tpm-values correlations results file
--methyl_probe_qtiles=CHARACTER
Methyl array probe beta/m-values quantiles results file
--methyl_probe_annot=CHARACTER
Methyl gencode array probe annotations
--rnaseq_tpm_medians=CHARACTER
RNA-Seq gene-level or isoform-level tmp median expression results file
--tpm_transcript_rep=CHARACTER
RNA-Seq expression (tpm) gene isoform (transcript) representation results file
--efo_mondo_annot=CHARACTER
OpenPedCan EFO and MONDO annotation file
--exp_values=CHARACTER
OpenPedCan expression matrix values:gene (default) and isoform
--methyl_values=CHARACTER
OpenPedCan methly matrix values: beta (default) and m
-h, --help
Show this help message and exit
06-methly-summary-tsv2jsonl.py
script transforms tab-delimited methylation summary tables to JSONL (JSON-Line) format required for usage on the NCI MTP portal.
usage: python3 06-methly-summary-tsv2jsonl.py [-h] [-m {beta,m}] [-v]
GENE_SUMMARY_FILE ISOFORM_SUMMARY_FILE
positional arguments:
GENE_SUMMARY_FILE Gene-level methyl summary TSV file
ISOFORM_SUMMARY_FILE Isoform-level methyl summary TSV file
optional arguments:
-h, --help show this help message and exit
-m {beta,m}, --methyl_values {beta,m}
OpenPedCan methly matrix values: beta (default) and m
-v, --version Print the current 06-methly-summary-tsv2jsonl.py version and exit
run-methylation-summary.sh
is a wrapper bash script for executing all the other analysis scripts in the module. All file paths set in this script relative to the module directory. Therefore, this script should always run as if it were being called from the directory it lives in, the module directory (OpenPedCan-analysis/analyses/methylation-summary
).
bash run-methylation-summary.sh
- Analyses involving 850k arrays with large number of samples representing OPenPedCan cancer groups (as in the CBTN cohort) will utlize of memory to run successfully.Where possible we utlized
Rsqlite3
to reduce memory footprint. - In some computers the computer system
/tmp
is too small to hold temporary files generated during analysis by R scripts. Users are advised to create a./tmp
in the module directory then execute R script by prepending with TMP/TMPDIR environmental variable as illustrated in the wrapper module bash script,run-methy-summary.sh
.
The methylation beta-values
and M-values
matrices are available on the CHOP HPC Isilon
sever (location: /mnt/isilon/opentargets/wafulae/methylation-preprocessing/results/
). Please contact Avin Farrel (@afarrel)
for access if not already available for download using the OpenPedCan data release download script.
../../data/infinium.gencode.v39.probe.annotations.tsv.gz
../../data/independent-specimens.rnaseqpanel.eachcohort.tsv
../../data/independent-specimens.methyl.eachcohort.tsv
../../data/gene-expression-rsem-tpm-collapsed.rds
../../data/rna-isoform-expression-rsem-tpm.rds
../../data/methyl-beta-values.rds
../../data/efo-mondo-map.tsv
../../data/histologies.tsv
Most analysis result files sizes exceed the limit allowable to push on to a GitHub repository. All results files are available on the CHOP HPC Isilon
sever (location: /mnt/isilon/opentargets/wafulae/methylation-summary/results/
). Please contact Avin Ferrel (@afarrel)
for access.
results/methyl-probe-annotations.tsv.gz
results/methyl-probe-beta-quantiles.tsv.gz
results/gene-methyl-probe-beta-tpm-correlations.tsv.gz
results/isoform-methyl-probe-beta-tpm-correlations.tsv.gz
results/gene-median-tpm-expression.tsv.gz
results/isoform-median-tpm-expression.tsv.gz
results/methyl-tpm-transcript-representation.tsv.gz
results/gene-methyl-beta-values-summary.rds
results/gene-methyl-beta-values-summary.tsv.gz
results/gene-methyl-beta-values-summary.jsonl.gz
results/isoform-methyl-beta-values-summary.rds
results/isoform-methyl-beta-values-summary.tsv.gz
results/isoform-methyl-beta-values-summary.jsonl.gz