In this repo we host the code to generate the data and figures for the paper "Reproducible processing of TCGA regulatory networks".
All the data is generated with the tcga-data-nf workflow. This folder holds sample files and analyses that can be run thanks to the pipeline.
.
├── LICENSE
├── README.md
├── config # sample configuration files
├── data
│ ├── conf
│ │ └── coad-subtype/ # configuration files for the COAD subtype application in the paper
│ └── external
│ ├── coad-subtype # subtype assignment for each TCGA-COAD sample
│ └── reactome_slim # reactome SLIM pathways used in the paper
├── envs # conda environments
├── notebooks
│ ├── colon_subtype_dragon.ipynb # DRAGON results in the paper
│ ├── colon_subtype_panda.ipynb # PANDA results in the paper
│ └── src # reusable functions
└── results # folder where all results are generated
First, we ran the full tcga-data-nf
workflow with the configuration in
coad_subtype.config
and the metadata in full_coad_subtypes.json
.
$ nextflow run tcga-data-nf -profile conda --pipeline full -c coad_subtype.config
Results are stored into the results/batch-coad-subtype-20240510/
folder which has the following structure:
├── tcga_coad_cms1
│ ├── analysis
│ │ ├── dragon
│ │ └── panda
│ ├── data_download
│ │ ├── clinical
│ │ ├── cnv
│ │ ├── methylation
│ │ ├── mutations
│ │ └── recount3
│ └── data_prepared
│ ├── methylation
│ └── recount3
├── tcga_coad_cms2
│ ...
├── tcga_coad_cms3
│ ...
└── tcga_coad_cms4
...
For each subtype, you'll find the downloaded data (data_download
), the prepared data (data_prepared
) and the
networks (analysis
).
aw The notebooks reproduce the results in the paper. In order to run the code in them, you need to have the pre-processed DRAGON and PANDA networks.
You can either download the batch-coad-subtype-20240510
folder, or run the workflow again to generate all the data.
The data relative to this repo can be found on the Harvard Dataverse: Replication Data for: tcga-data-nf
@data{DVN/MCSSYJ_2024,
author = {Fanfani, Viola},
publisher = {Harvard Dataverse},
title = {{Replication Data for: tcga-data-nf}},
UNF = {UNF:6:TYixGNR1fJyPs/vReFVaPQ==},
year = {2024},
version = {V1},
doi = {10.7910/DVN/MCSSYJ},
url = {https://doi.org/10.7910/DVN/MCSSYJ}
}
Data on AWS: tcga-data-nf-procumputed.
In order to visualize and download this data, you need to have an active AWS account (a free tier one should suffice). For any additional help, please contact vfanfani@hsph.harvard.edu
We'll keep an updated list of exemplary configuration files inside the config
folder.
For the most updated structure of the configuration files always refer to the tests inside the tcga-data-nf repository
For examples of configuration files for a full analysis you can refer to those we used for the colon cancer application:
- Pipeline configurations:
data/conf/coad-subtype/coad_subtype.config
- Data configurations:
data/conf/coad-subtype/full_coad_subtypes.json
We paste here the configuration files we used to download data from TCGA. These are also available alongside the data on AWS.
First round downloads:
::warning:: These configuration files follow an older structure of the metadata, but they still include all relevant information to understand what has been downloaded
- Clinical data:
config/download_clinical_tcgabiolinks_firstround.config
- Gene Expression:
config/download_expression_recount3_firstround.config
- Mutations:
config/download_mutation_tcgabiolinks_firstround.config
- Methylation:
config/download_methylation_firstround.config
Files are at:
New Methylation:
GDC data went through some ID changes/downgrading to legacy, so we re-downloaded and prepared all methylation data:
Configuration file: conf/download_methylation.json
We have pre-processed gene expression data for the following tumor types: BRCA, COAD, DLBC, KIRC, LAML, LIHC, PRAD, PAAD, SKCM, STAD, LUAD, LUSC.
Configuration file (tcga-data-nf (0.0.10)): conf/expression_prepare.conf
Output files follow the naming:
recount3_tcga_coad_purity06_normlogtpm_mintpm1_fracsamples000001_tissuetumor_batchtcgagdcplatform_adjtcgagdcplatform.txt where we write in the filename the parameters used to generate it.
For instance, the file above is in logptm, has genes with at least 1 tpm in at least 0.000001 samples (we are basically filtering out only 'all-zero' genes), and it has been corrected for gdc-platform.
We have pre-processed methylation data for the following tumor types: BRCA, COAD, DLBC, KIRC, LAML, LIHC, PRAD, PAAD, SKCM, STAD, LUAD, LUSC.
Configuration file (tcga-data-nf (0.0.13)): conf/ methylation_prepare.conf
We generated PANDA and LIONESS networks for 10 solid cancers: BRCA, COAD, KIRC, LIHC, LUAD, LUSC, PAAD, PRAD, SKCM, STAD.
We have used the prepared data with:
- purity: 03
- normalization: logcpm
- gene filters: mintpm1, fracsamples01
- tissues: tissueall
- Viola Fanfani, vfanfani@hsph.harvard.edu