Skip to content

Methods

Roy Straver edited this page Dec 23, 2016 · 1 revision

Preparation

Unchanged

Data is converted from bam files to read depth per bin. RETRO Filtering is applied to remove read towers, this filter was previously described in the WISECONDOR paper.

Training

Changed

After preparation, training samples are combined into a 2-dimensional array by concatenating autosomal chromosomes, resulting in one sample per row. This array reduced by removing all bins found empty in all training samples, masking out columns where sum == 0.

PCA is fit to the masked array, means and components are stored for later use. The masked array is mapped to PCA space using the first 3 components, then reconstructed back to the original space. The relative difference between the original and reconstructed value per bin (actual_readcount/pca_corrected_readcount) is considered the corrected value per bin.

Unchanged

For every sample, for every bin, the difference squared to every other bin is calculated. Bins on the same chromosome as the targeted bin are ignored. Distances obtained between any pair of bins are summed over all samples, providing one value indicating the read depth distance between the pair of bins over all reference samples. For every bin, the 100 closest bins in terms of the read depth distance are taken as reference bins.

Testing

Changed

After preparation, autosomal chromosomes are concatenated into a 1-dimensional array, bins found empty in the reference set are removed, and the test sample is PCA corrected using the correction described in Training. The PCA was fit/trained on the training data and stored, here it is used again. After mapping to this predefined PCA space and reconstructing the signal, the relative difference between the reconstructed and the original signal is used here (actual_readcount/pca_corrected_readcount).

Unchanged

After removing reference bins that strongly deviate from their target bins, a z-score is obtained per bin by applying the reference set created in training. This is documented in the WISECONDOR paper. Also, several cycles are applied to determine aberrant bins, remove them from reference sets, and repeat.
With a z-score per bin, current results are close to original WISECONDORs approach for the single bin method.

Changed

A segmentation algorithm is applied to find the most significant stretches of bins per chromosome in the sample. Ignoring bins previously masked out, the stouffers z-score is used to determine the combined z-score for any stretch of bins. When the maximum value over all these possible stretches is significant, the stretch is called and the regions left and right are analysed. This is repeated until no more significant calls can be made.