Komal Rathi, Run Jin, Yuanchao Zhang, Eric Wafula, Sangeeta Shukla, Jo Lynne Rokita
There are two modes for generating independent sample lists:
-
Release mode: Pre-release independent sample lists are generated using the
histologies-base.tsv
file, and these lists are used to run modules specific to release (eg:fusion_filtering
).OPENPBTA_BASE_SUBTYPING=1 bash run-independent-samples.sh
-
Analysis mode: Independent sample lists are generated per cohort and per cohort and cancer group, which requires
histolologies.tsv
.OPENPBTA_BASE_SUBTYPING=0 bash run-independent-samples.sh
01-generate-independent-specimens-wgs-only.R
: Generate tables of WGS-only independent specimens where no two specimens are chosen from the same individual.01-generate-independent-specimens-wgs-preferred.R
: Generate tables of WGS-preferred independent specimens where no two specimens are chosen from the same individual.01-generate-independent-specimens-wxs-preferred.R
: Generate tables of WXS-preferred independent specimens where no two specimens are chosen from the same individual.02-generate-independent-rnaseq.R
: Generate tables of independent rna-seq specimens.03-qc-independent-samples.Rmd
: Markdown to tabulate number of biospecimen ids for same participant ids from each output file.04-generate-independent-specimens-rnaseq-pre-release.R
: Generate tables of RNA-Seq-only independent specimens where no two specimens are chosen from the same individual. (Note: these tables will only be used for thefusion_filtering
module runs pre-release).05-generate-independent-specimens-methyl.R
: Generate tables of DNA methylation array independent specimens where no two specimens are chosen from the same individual.
.
├── 00-repeated-samples.Rmd
├── 00-repeated-samples.nb.html
├── 01-generate-independent-specimens-wgs-only.R
├── 01-generate-independent-specimens-wgs-preferred.R
├── 01-generate-independent-specimens-wxs-preferred.R
├── 02-generate-independent-rnaseq.R
├── 03-qc-independent-samples.Rmd
├── 03-qc-independent-samples.nb.html
├── 04-generate-independent-specimens-rnaseq-pre-release.R
├── 05-generate-independent-specimens-methyl.R
├── README.md
├── results
│ ├── independent-specimens.methyl.primary-plus.eachcohort.tsv
│ ├── independent-specimens.methyl.primary-plus.tsv
│ ├── independent-specimens.methyl.primary.eachcohort.tsv
│ ├── independent-specimens.methyl.primary.tsv
│ ├── independent-specimens.methyl.relapse.eachcohort.tsv
│ ├── independent-specimens.methyl.relapse.tsv
│ ├── independent-specimens.rnaseq.primary-plus-pre-release.tsv
│ ├── independent-specimens.rnaseq.primary-pre-release.tsv
│ ├── independent-specimens.rnaseq.relapse-pre-release.tsv
│ ├── independent-specimens.rnaseqpanel.primary-plus.eachcohort.tsv
│ ├── independent-specimens.rnaseqpanel.primary-plus.tsv
│ ├── independent-specimens.rnaseqpanel.primary.eachcohort.tsv
│ ├── independent-specimens.rnaseqpanel.primary.tsv
│ ├── independent-specimens.rnaseqpanel.relapse.eachcohort.tsv
│ ├── independent-specimens.rnaseqpanel.relapse.tsv
│ ├── independent-specimens.wgs.primary-plus.eachcohort.tsv
│ ├── independent-specimens.wgs.primary-plus.tsv
│ ├── independent-specimens.wgs.primary.eachcohort.tsv
│ ├── independent-specimens.wgs.primary.tsv
│ ├── independent-specimens.wgs.relapse.eachcohort.tsv
│ ├── independent-specimens.wgs.relapse.tsv
│ ├── independent-specimens.wgswxspanel.primary-plus.eachcohort.prefer.wgs.tsv
│ ├── independent-specimens.wgswxspanel.primary-plus.eachcohort.prefer.wxs.tsv
│ ├── independent-specimens.wgswxspanel.primary-plus.prefer.wgs.tsv
│ ├── independent-specimens.wgswxspanel.primary-plus.prefer.wxs.tsv
│ ├── independent-specimens.wgswxspanel.primary.eachcohort.prefer.wgs.tsv
│ ├── independent-specimens.wgswxspanel.primary.eachcohort.prefer.wxs.tsv
│ ├── independent-specimens.wgswxspanel.primary.prefer.wgs.tsv
│ ├── independent-specimens.wgswxspanel.primary.prefer.wxs.tsv
│ ├── independent-specimens.wgswxspanel.relapse.eachcohort.prefer.wgs.tsv
│ ├── independent-specimens.wgswxspanel.relapse.eachcohort.prefer.wxs.tsv
│ ├── independent-specimens.wgswxspanel.relapse.prefer.wgs.tsv
│ └── independent-specimens.wgswxspanel.relapse.prefer.wxs.tsv
├── run-independent-samples.sh
└── util
├── independent-dna-samples.R
├── independent_rna_samples.R
└── independent-methyl-samples.R
Many analyses that involve mutation frequencies or co-occurence require that all samples be independent. However, the OpenPedCan
data set includes many cases where multiple biospecimens were taken from a single participant. This analysis creates lists of samples such that there are no cases where more than one biospecimen is included from each participant.
As different analyses may require different sets of data, we generate a few different sets, stored in the results
subdirectory. We run the analyses based on different independent_level
, either each-cohort
or all-cohorts
. For each-cohort
analysis, we generate a list of independent samples using Kids_First_Participant_ID
unique within each cohort + cancer_group
combination. In this case, same Kids_First_Participant_ID
across different cohorts is treated as independent
. For all-cohorts
analysis, we generate a list of independent samples using Kids_First_Participant_ID
regardless of cohort
or cancer_group
and only one occurence of Kids_First_Participant_ID
is chosen from the entire dataset.
These lists contain only WGS samples:
- All-cohorts specific lists
- Primary specimens only with whole genome sequence (WGS):
independent-specimens.wgs.primary.tsv
- Relapse specimens with WGS:
independent-specimens.wgs.relapse.tsv
- Primary and relapse specimens with WGS:
independent-specimens.wgs.primary-plus.tsv
- Each-cohort specific lists
- Primary specimens only with whole genome sequence (WGS):
independent-specimens.wgs.primary.eachcohort.tsv
- Relapse specimens with WGS:
independent-specimens.wgs.relapse.eachcohort.tsv
- Primary and relapse specimens with WGS:
independent-specimens.wgs.primary-plus.eachcohort.tsv
For WGS-preferred lists, when a Kids_First_Participant_ID
is associated with multiple experimental_strategy
values i.e. WGS
, WXS
or Targeted Sequencing
, priority is given to a single randomly chosen WGS
biospecimen first, followed by a single randomly chosen WXS
and lastly a single randomly chosen Targeted Sequencing
sample.
After randomizing the histology file subset to tumor samples:
tumor_samples <- histology_df %>%
dplyr::filter(sample_type == "Tumor",
composition != "Derived Cell Line",
experimental_strategy %in% c("WGS", "WXS", "Targeted Sequencing"),
!grepl("Metastatic secondary tumors", pathology_diagnosis, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE))
For WGS-preferred lists, we first subset the tumor samples to WGS
samples only, pass the subsetted samples to the independent-samples.R
function to generate a WGS-specific
list. This list only contains a single occurence of Kids_First_Participant_ID
associated to the experimental_strategy = WGS
. Next, we subset the tumor samples to WXS
and Targeted Sequencing
samples only, pass the subsetted samples to the independent-samples.R
function to generate a WXS/Panel specific
list. This list only contains a single occurence of Kids_First_Participant_ID
associated to either experimental_strategy = WXS
or experimental_strategy = Targeted Sequencing
. Then we merge the two lists (keeping WGS
list first and WXS/Panel
as second) and take a dplyr::distinct
to only get the first occurence of Kids_First_Participant_ID
. Because we keep the WGS
specific list first when calling dplyr::distinct
, the WGS
associated biospecimens will be preferred over other biospecimens for multiple occurrences of Kids_First_Participant_ID
.
- All-cohorts specific lists
- Primary specimens only with either WGS or whole exome sequence (WXS) or Panel:
independent-specimens.wgswxspanel.primary.prefer.wgs.tsv
- Relapse specimens only with either WGS or whole exome sequence (WXS) or Panel:
independent-specimens.wgswxspanel.relapse.prefer.wgs.tsv
- Primary and relapse specimens with WGS or WXS or Panel:
independent-specimens.wgswxspanel.primary-plus.prefer.wgs.tsv
- Each-cohort specific lists
- Primary specimens only with either WGS or whole exome sequence (WXS) or Panel:
independent-specimens.wgswxspanel.primary.eachcohort.prefer.wgs.tsv
- Relapse specimens only with either WGS or whole exome sequence (WXS) or Panel:
independent-specimens.wgswxspanel.relapse.eachcohort.prefer.wgs.tsv
- Primary and relapse specimens with WGS or WXS or Panel:
independent-specimens.wgswxspanel.primary-plus.eachcohort.prefer.wgs.tsv
Additionally, similar independent lists that consist of all DNA experimental_strategy
were generated prioriting WXS
specimens when both WXS and WGS samples were available - this independent list will be used for analyzing SNV datasets since WXS gives higher coverage for coding regions and hence, better chance of detecting SNV.
For WXS-preferred lists, when a Kids_First_Participant_ID
is associated with multiple experimental_strategy
values i.e. WGS
, WXS
or Targeted Sequencing
, priority is given to a single randomly chosen WXS
biospecimen first, followed by a single randomly chosen WGS
and lastly a single randomly chosen Targeted Sequencing
sample.
After randomizing the histology file subset to tumor samples:
tumor_samples <- histology_df %>%
dplyr::filter(sample_type == "Tumor",
!composition %in% c("Derived Cell Line", "PDX"),
experimental_strategy %in% c("WGS", "WXS", "Targeted Sequencing"),
!grepl("Metastatic secondary tumors", pathology_diagnosis, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE))
For WXS-preferred lists, we first subset the tumor samples to WXS
samples only, pass the subsetted samples to the independent-samples.R
function to generate a WXS-specific
list. This list only contains a single occurence of Kids_First_Participant_ID
associated to the experimental_strategy = WXS
. Next, we subset the tumor samples to WGS
and Targeted Sequencing
samples only, pass the subsetted samples to the independent-samples.R
function to generate a WGS/Panel specific
list. This list only contains a single occurence of Kids_First_Participant_ID
associated to either experimental_strategy = WGS
or experimental_strategy = Targeted Sequencing
. Then we merge the two lists (keeping WXS
list first and WGS/Panel
as second) and take a dplyr::distinct
to only get the first occurence of Kids_First_Participant_ID
. Because we keep the WXS
specific list first when calling dplyr::distinct
, the WXS
associated biospecimens will be preferred over other biospecimens for multiple occurrences of Kids_First_Participant_ID
.
- All-cohorts specific lists
- Primary specimens only with either whole exome sequence (WXS) or WGS or Panel:
independent-specimens.wgswxspanel.primary.prefer.wxs.tsv
- Relapse specimens only with either whole exome sequence (WXS) or WGS or Panel:
independent-specimens.wgswxspanel.relapse.prefer.wxs.tsv
- Primary and relapse specimens with WXS or WGS or Panel:
independent-specimens.wgswxspanel.primary-plus.prefer.wxs.tsv
- Each-cohort specific lists
- Primary specimens only with either whole exome sequence (WXS) or WGS or Panel:
independent-specimens.wgswxspanel.primary.eachcohort.prefer.wxs.tsv
- Relapse specimens only with either whole exome sequence (WXS) or WGS or Panel:
independent-specimens.wgswxspanel.relapse.eachcohort.prefer.wxs.tsv
- Primary and relapse specimens with WXS or WGS or Panel:
independent-specimens.wgswxspanel.primary-plus.eachcohort.prefer.wxs.tsv
Simiarly, for independent RNA samples, we also run with either all-cohorts
or each-cohort
.
When run with each-cohort
, independent DNA samples ran with each-cohort
was used as starting point (see code for details) and when run with all-cohorts
, independent DNA samples ran with all-cohorts
was used as starting point.
When multiple RNA-Seq samples exist per participant, the script matches the independent whole genome or whole exome sample_ids to gather matched RNA-Seq sample. If participant has only RNA-Seq sample then a primary (and relapse if applicable) sample is randomly selected per participant per cancer group per cohort.
- All-cohorts specific lists
- Primary RNA-Seq specimens matching WGS/WXS/Panel independent sample_ids plus only-RNA-Seq:
independent-specimens.rnaseqpanel.primary.tsv
andindependent-specimens.rnaseq.primary.pre-release.tsv
- Relapse RNA-Seq specimens matching WGS/WXS/Panel independent sample_ids plus only-RNA-Seq:
independent-specimens.rnaseqpanel.relapse.tsv
andindependent-specimens.rnaseq.relapse.pre-release.tsv
- Primary and Relapse RNA-Seq specimens matching WGS/WXS/Panel independent sample_ids plus only-RNA-Seq:
independent-specimens.rnaseqpanel.primary-plus.tsv
andindependent-specimens.rnaseq.primary-plus.pre-release.tsv
Note: *pre-release.tsv
files are generated pre-release, require histologies-base.tsv
, and are only being used to run fusion_filtering
.
- Each-cohort specific lists
- Primary RNA-Seq specimens matching WGS/WXS/Panel independent sample_ids plus only-RNA-Seq:
independent-specimens.rnaseqpanel.primary.eachcohort.tsv
- Relapse RNA-Seq specimens matching WGS/WXS/Panel independent sample_ids plus only-RNA-Seq:
independent-specimens.rnaseqpanel.relapse.eachcohort.tsv
- Primary and Relapse RNA-Seq specimens matching WGS/WXS/Panel independent sample_ids plus only-RNA-Seq:
independent-specimens.rnaseqpanel.primary-plus.eachcohort.tsv
These lists contain independent methylation array samples:
- All-cohorts specific lists
- Primary methylation array specimens:
independent-specimens.methyl.primary.tsv
- Relapse methylation array specimens:
independent-specimens.methyl.relapse.tsv
- Primary and relapse methylation array specimens:
independent-specimens.methyl.primary-plus.tsv
- Each-cohort specific lists
- Primary specimens methylation array specimens:
independent-specimens.methyl.primary.eachcohort.tsv
- Relapse specimens methylation array specimens:
independent-specimens.methyl.relapse.eachcohort.tsv
- Primary and relapse methylation array specimens:
independent-specimens.methyl.primary-plus.eachcohort.tsv
To generate the independent sample lists and associated analysis of redundancies in the overall data set, run the following script from the project root directory:
bash analyses/independent-samples/run-independent-samples.sh
- When presented with more than one specimen from a given individual with a specific cancer group and cohort, the script selects the first occurrence of the individual so as to include only one specimen, with preference for primary tumors and whole genome sequences where available.
- The input histology file is randomized using a seed before using as input in order to avoid any selection bias. Using a seed allows for reproducibility of the randomized histology file.
- There is also a preference for the earliest collected samples, but as this data is not currently available, that code is currently deleted.