This repository contains the simulated data, the results of all the methods considered in the comparison, and the results of HATCHet (HATCHet repository) on the published whole-genome multi-sample tumor sequencing datasets, all these are described in the HATCHet paper at:
Simone Zaccaria and Ben Raphael, 2018
All the simulated data are included in the folder simulation
. The simulated data comprises 256 mixed samples with 2-3 tumor clones for 64 patients (3-5 samples per patient), half with a whole-genome duplication (WGD). These simulated data have been generated using MASCoTE which has also been described in the HATCHet paper and is available here:
Due to space limitations, we are unable to publish in this repository the sequencing reads from all samples. As such, for every sample we provide a BB file which encodes the read-depth ratio (RDR) and B-allele frequency (BAF) for every genomic bin of the reference genome in every sample. The BB files are produced by the pre-prosessing steps of HATCHet and summarize whole the basic input data needed by CNA-inference methods.
All these data are reported in the folder data
. The patients are divided between those with a tumor without a WGD in noWGD
folder and those with a tumor with a WGD in WGD
folder. In both cases, a dataset corresponds to a collection of clones and is reported in the format dataset_nX_sY
where X
is the number of tumor clones (in addition there is a normal diploid clone) and Y
is the random seed used by MASCoTE for reproducibility. The copy-number profiles of the tumor clones and the corresponding phylogenetic tree with CNAs and WGds are correspondingly reported in the two following files contained in the related subfolder tumor
:
Filename | Description | Format |
---|---|---|
copynumber.cvs |
The allele and clone-specific copy-number profiles resulting from the CNAs and WGDs simulated by MASCoTE | The file is a tab-separated file with the following fields:
|
tumor.dot |
The phylogenetic tree describing the tumor evolution where there is a node for every clone and the edges are labeled by the corresponding CNAs and WGDs. | The phylogenetic tree is encoded in the DOT format. The mutations are given in the following formats:
|
Each dataset includes two patients and for each patient a BB file describes the RDR and BAF of every genomic bin in all samples of the corresponding patient. The name of each BB file specifies the number of samples for the related patient, the number of clones, and the corresponding clone proportions. More specifically, the BB filename is given by a _
-separated list where the first element preceeded by the letter k
specifies the number of corresponding samples and each other element specifies the clone proportions of a sample, listed such that the first is the proportion of normal diploid clone and the clone proportion of any other tumor clone is given in corresponding order. The name of a sample is a _
-separated list which starts with the noun bulk
and each element specifies the clone proportion (without the dot) of every clone.
For example, k4_01090_02008_00506035_00504055.bb.gz
is a BB file for a patient with 4 samples which incude 2 tumor clones (clone0
and clone1
) and a normal diploid clone normal
. In particular, the samples have the following clonal compositions
Name of sample | normal proportion |
clone0 proportion |
clone1 proportion |
---|---|---|---|
bulk_01normal_09clone0_Noneclone1 |
0.1 |
0.9 |
Not present |
bulk_02normal_Noneclone0_08clone1 |
0.2 |
Not present | 0.8 |
bulk_005normal_06clone0_035clone1 |
0.05 |
0.6 |
0.35 |
bulk_005normal_04clone0_055clone1 |
0.05 |
0.4 |
0.45 |
Another example, k7_040600_010090_020008_0103060_0205003_0100504_01030303.bb.gz
is a BB file for a patient with 7 samples which incude 3 tumor clones (clone0
, clone1
, and clone2
) and a normal diploid clone normal
. In particular, the samples have the following clonal compositions
Name of sample | normal proportion |
clone0 proportion |
clone1 proportion |
clone2 proportion |
---|---|---|---|---|
bulk_04normal_06clone0_Noneclone1_Noneclone2 |
0.4 |
0.6 |
Not present | Not present |
bulk_01normal_Noneclone0_09clone1_Noneclone2 |
0.1 |
Not present | 0.9 |
Not present |
bulk_02normal_Noneclone0_Noneclone1_08clone2 |
0.2 |
Not present | Not present | 0.8 |
bulk_01normal_03clone0_06clone1_Noneclone2 |
0.1 |
0.3 |
0.6 |
Not present |
bulk_02normal_05clone0_Noneclone1_03clone2 |
0.2 |
0.5 |
Not present | 0.3 |
bulk_01normal_03clone0_03clone1_03clone2 |
0.1 |
0.3 |
0.3 |
0.3 |
Each BB file corresponds to a patint and is a tab-separated file describing the RDR and BAF of every genomic bin in all samples in the following format:
Field | Description |
---|---|
CHR |
Name of a chromosome |
START |
Starting genomic position of a genomic bin in CHR |
END |
Ending genomic position of a genomic bin in CHR |
SAMPLE |
Name of a tumor sample |
RD |
RDR of the bin in SAMPLE |
#SNPS |
Number of SNPs present in the bin in SAMPLE |
COV |
Average coverage in the bin in SAMPLE |
ALPHA |
Alpha parameter related to the binomial model of BAF for the bin in SAMPLE , typically total number of reads from A allele |
BETA |
Beta parameter related to the binomial model of BAF for the bin in SAMPLE , typically total number of reads from B allele |
BAF |
BAF of the bin in SAMPLE |
Due to space limitations, each BB file has been compressed using gzip
with level of compression 9. The file can be easily decompressed with the command gzip -d BBFILE
.
HATCHet has been compared with 4 current state-of-the-art methods for CNA inference:
Method | Reference | Repository |
---|---|---|
Battenberg | (Nik-Zainal et al., Cell, 2012) | cgpBattenberg and Wedge-Oxford Battenberg |
TITAN | (Ha et al., Genome Research, 2014) | TitanCNA |
THetA | (Oesper et al., Genome Biology, 2013) | THetA/THetA2 |
cloneHD | (Fischer et al., Cell Reports, 2014) | cloneHD |
Each of these methods and HATCHet has been applied on the simulated samples. More specifically, Battenberg, TITAN, and THetA have been applied on each sample individually, cloneHD has been applied jointly on all samples from the same patient, and HATCHet has been applied both on each sample individually (single-sample HATCHet) and jointly on all samples from the same patient. We consider two different settings when executing the methods on simulated data.
First, every method has been applied on all 128 samples of the 32 patients without a WGD by providing the true value of the main parameters, including tumor ploidy, number of clones, and maximum copy number. In this case, the results obtained by every method are reported in the folder fixed
and in the subfolder of the corresponding dataset. The results of Battenberg, TITAN, and THetA are specifically reported for every sample, the results of cloneHD are reported for every patient, and the results of HATCHet are specifically reported for every sample (when obtained by executing HATCHet on each sample inidividually) and specifically for every patient (when obtained by executing HATCHet jointly on all samples from the same patient).
Second, every method has been applied on all 256 samples of the 64 patients with and without a WGD, requiring that each method infers all the relevant parameters, including tumor ploidy and number of clones, and setting the maximum copy number to 8. THetA has been excluded from this analysis as it does not automatically infer the presence/absence of a WGD. In this case, the results obtained by every method are reported in the folder free
and in the subfolder of the corresponding dataset, which are divided according to either the presence or absence of a WGD. The results of Battenberg and TITAN are specifically reported for every sample, the results of cloneHD are reported for every patient, and the results of HATCHet are specifically reported for every sample (when obtained by executing HATCHet on each sample inidividually) and specifically for every patient (when obtained by executing HATCHet jointly on all samples from the same patient).
For every method, all the most important and relevant output files are reported. The largest of these files have been compressed due to space limitations using the command gzip -9
and they can be easily decompressed by using the corresponding command gzip -d
.
HATCHet has been applied on two whole-genome multi-sample tumor sequencing datasets; the first dataset comprises 10 prostate cancer patients analyzed in (Gundem et al., Nature, 2015) and the second dataset comprises 4 pancreas cancer patients described in (Makohon-Moore et al., Nature genetics, 2017).
The data for all prostate cancer patients are contained in the subfolder prostate
. For each of the 10 prostate cancer patients (A10, A12, A17, A21, A22, A24, A29, A31, A32, and A34) the results inferred by HATCHet in a subfolder with the corresponding name. More specifically, the following files encode the results inferred by HATCHet for each prostate cancer patient:
Name | Description | Format |
---|---|---|
best.seg.ucn |
Clone and allele-specific copy number profiles and clone proportions for every genomic segment | the format is described in the HATCHet repository here |
best.bbc.ucn.gz |
Clone and allele-specific copy number profiles and clone proportions for every clustered bin with the corresponding RDR and BAF | the format is described in the HATCHet repository here. Due to space limitations, this file is compressed |
chosen.diploid.seg.ucn |
The best result inferred by HATCHet assuming there is no WGD | The format is the same of best.seg.ucn |
chosen.tetraploid.seg.ucn |
The best result inferred by HATCHet assuming there is a WGD | The format is the same of best.seg.ucn |
The mutations inferred from all samples of every prostate cancer patient are reported in a subfolder mutations
. The SNVs and small indels are reported in two comma-separated files indel_hc.csv
and snv_hc.csv
with the following fields
Name | Description |
---|---|
Patient |
The name of a patient |
Sample |
A sample from the patient Patient |
chrom |
The name of a chromosome |
position |
The genomic position of a somatic-point mutation in chrom |
ref |
Number of sequencing reads coverging position with the reference allele |
var |
Number of sequencing reads coverging position with the alternating allele, i.e. harboring the mutation |
normal_reads1 |
Reads supporting the reference allele of position in the matched-normal sample (the corresponding fields with plus/minus are specific to reads belonging to +/- strand) |
normal_reads2 |
Reads supporting the variant allele of position in the matched-normal sample (the corresponding fields with plus/minus are specific to reads belonging to +/- strand) |
normal_var_freq |
Variant-allele frequency of position in the matched-normal sample |
normal_gt |
Genotype call for position in matched-normal sample |
tumor_reads1 |
Reads supporting the reference allele of position in the tumor sample Sample (the corresponding fields with plus/minus are specific to reads belonging to +/- strand) |
tumor_reads2 |
Reads supporting the variant allele of position in the tumor sample Sample (the corresponding fields with plus/minus are specific to reads belonging to +/- strand) |
tumor_var_freq |
Variant-allele frequency (VAF) of position in the tumor sample Sample |
tumor_gt |
Genotype call for position in the tumor sample Sample |
somatic_status |
Status of the variant (Germline, Somatic, or LOH). Here, all the mutations are Somatic |
variant_p_value |
Significance of variant read count compared to baseline error rate |
somatic_p_value |
Significance of tumor read count compared to normal read count |
The data for all pancreas cancer patients are contained in the subfolder pancreas
. For each of the 4 pancreas cancer patients (Pam01, Pam02, Pam03, and Pam04) the results inferred by HATCHet in a subfolder with the corresponding name. More specifically, the following files encode the results inferred by HATCHet for each pancreas cancer patient:
Name | Description | Format |
---|---|---|
best.seg.ucn |
Clone and allele-specific copy number profiles and clone proportions for every genomic segment | the format is described in the HATCHet repository here |
best.bbc.ucn.gz |
Clone and allele-specific copy number profiles and clone proportions for every clustered bin with the corresponding RDR and BAF | the format is described in the HATCHet repository here. Due to space limitations, this file is compressed |
chosen.diploid.seg.ucn |
The best result inferred by HATCHet assuming there is no WGD | The format is the same of best.seg.ucn |
chosen.tetraploid.seg.ucn |
The best result inferred by HATCHet assuming there is a WGD | The format is the same of best.seg.ucn |
The mutations inferred from all samples of every pancreas cancer patient are reported in a subfolder mutations
. The SNVs and small indels are reported in two comma-separated files indel_hc.csv
and snv_hc.csv
with the following fields
Name | Description |
---|---|
Patient |
The name of a patient |
Sample |
A sample from the patient Patient |
chrom |
The name of a chromosome |
position |
The genomic position of a somatic-point mutation in chrom |
ref |
Number of sequencing reads coverging position with the reference allele |
var |
Number of sequencing reads coverging position with the alternating allele, i.e. harboring the mutation |
normal_reads1 |
Reads supporting the reference allele of position in the matched-normal sample (the corresponding fields with plus/minus are specific to reads belonging to +/- strand) |
normal_reads2 |
Reads supporting the variant allele of position in the matched-normal sample (the corresponding fields with plus/minus are specific to reads belonging to +/- strand) |
normal_var_freq |
Variant-allele frequency of position in the matched-normal sample |
normal_gt |
Genotype call for position in matched-normal sample |
tumor_reads1 |
Reads supporting the reference allele of position in the tumor sample Sample (the corresponding fields with plus/minus are specific to reads belonging to +/- strand) |
tumor_reads2 |
Reads supporting the variant allele of position in the tumor sample Sample (the corresponding fields with plus/minus are specific to reads belonging to +/- strand) |
tumor_var_freq |
Variant-allele frequency (VAF) of position in the tumor sample Sample |
tumor_gt |
Genotype call for position in the tumor sample Sample |
somatic_status |
Status of the variant (Germline, Somatic, or LOH). Here, all the mutations are Somatic |
variant_p_value |
Significance of variant read count compared to baseline error rate |
somatic_p_value |
Significance of tumor read count compared to normal read count |
The section contains tools which have been applied for obtaining the analysis presented in the HATCHet's paper.
Analysis | Tool | Requirement |
---|---|---|
Compute mutated copies, predicted VAF, CCF, and explaining of mutations | explainMutationsCCF.py | The toold requires in input a SEG file with allele and clone-specific copy-number states and proportions, and a CSV file with the following fields (whose names must be specified in the first-row header):
|
The tool for the analysis and clustering of SNVs computes and otuputs the following fields (specified in the header which starts with the symbol #
):
Field | Description |
---|---|
CHR |
The name of a chromosome |
POS |
The genomic position of the mutation in CHR |
PATIENT-SAMPLE |
Patient-sample name in the format P-S where P is the name of the patient and S is the name of the sample |
TOOL |
The name of the methods which inferred the copy numbers |
COV |
Total number of reads covering the mutation |
COUNTS |
Comma-sperated numbers of reads without and with the mutation |
ObservedVAF |
Observed variant-allele frequency of the mutation |
predicted_VAF |
Predicted VAF when considering the given copy-number states and clone proportions |
Error |
Error in the prediction of VAF |
CNStates |
Given copy-number states and clone proportion for the mutation in POS . The state and proportion of the mutation in every clone i are reported in a comma separated list (where clones are sorted according to the input) and the entry for clone i is in the format `A_i |
MutatedCopies |
Inferred number of mutated copies for every clone, these are reported in a comma-separated list such that, also in this field, the clones are sorted according to the same order in the input |
CCF |
Computed cancer-cell fraction for the mutation |
Explained |
True or False to indicate whether the mutation is explained |
SNVState |
Name of the cluster of the mutation based on its SNVState which is defined by the unique combination of its CNStates and MutatedCopies |
SPRUCEState |
This is the state of the mutation as defined in the SPRUCE model. More specifically, this corresponds to a comma-separated list with an element for every clone i equal to `MAJ_i |
SPRUCECluster |
Name of the cluster of the mutation based on the unique values of SPRUCEState |