cfDNA_GCcorrection is an easy-to-use tool to determine and correct GC biases in cell-free DNA (cfDNA) samples. Based on fragment length and GC distributions, it calculates read-level correction values and makes them available as tags in SAM/BAM format. This additional information can be used during signal extraction in various metrics (e.g., allelic balance, coverage, read ends/midpoints, WPS), while preserving the original read coverage patterns for specific analyses.
Install by cloning this repository:
You can install cfDNA_GCcorrection on command line (linux/mac) by cloning this git repository :
git clone https://github.com/kircherlab/cfDNA_GCcorrection.git
cd cfDNA_GCcorrection
pip install -e .
note: pybedtools needs a locally installed version of bedtools! Please install it by other means. See the official documentation.
The GCbiascorrection described in this repo works in two steps.
First the GC bias is calculated with the help of the computeGCBias_readlen
script, which returns a tab-separated file containing the expected, observed GC x fragment length distribution and the ratio of these measures.
Given a .bam
file with WGS data from a liquid biopsy and a genome reference in .2bit
format:
computeGCBias_readlen -b <INPUT BAMFILE> \
-g <2BIT GENOME> \
-I \
-p <NUMBER OF CPU CORES> \
--GCbiasFrequenciesFile <GCBIAS_OUTPUT_FILE>
For more options use the --help
flag.
Note: Computation of expected GC x fragment length distribution is expensive. If multiple samples are to be processed, genome specific background profiles can be reused. For more details see this section
The second step uses these precomputed values to compute correction values and attach them as bam/sam tag to each read according to its length and GC content. This can be done by executing the correctGCBias_readlen
with the output of the previous script.
correctGCBias_readlen -b <INPUT BAMFILE> \
--effectiveGenomeSize <EFFECTIVEGENOMESIZE> \
-g <2BIT GENOME> \
--GCbiasFrequenciesFile <GCBIAS_OUTPUT_FILE> \
-p <NUMBER OF CPU CORES> \
-w \
-o <GCWEIGHTED_OUTPUT_BAM>
For more options use the --help
flag.
Effective genome size:
The effective genome size is the portion of the genome that is mappable. This parameter will be used for creating the sampling distribution.
The most used values are:
Genome | Effective size |
---|---|
GRCh37 | 2864785220 |
GRCh38 | 2913022398 |
A more extensive documentation on determining the effective genome size can be found here.
getting 2bit genome files :
2bit versions of the most common genomes can be downloaded from UCSC and are indicated by the .2bit extension
.
If the genome file is not available fasta files can be converted to 2bit by using faToTwoBit.
The computation of expected GC x fragment length distribution (background profile) is expensive. As background profile is only dependent on the genome, it can be precalculated and reused for multiple samples mapped to the same genome build. This approach reduces the runtime of the GC bias calculation significantly, if multiple samples are to be processed.
The background profile can be calculated using the computeGCBias_background
script.
computeGCBias_background -b <INPUT BAMFILE> \
-g <2BIT GENOME> \
-I \
-p <NUMBER OF CPU CORES> \
--output <GCBIAS_background_FILE>
The resulting output file can be used as an additional input for the computeGCBias_readlen
script by specifying it using the --precomputed_background
flag.
computeGCBias_readlen -b <INPUT BAMFILE> \
-g <2BIT GENOME> \
-I \
--precomputed_background <GCBIAS_background_FILE> \
-p <NUMBER OF CPU CORES> \
--GCbiasFrequenciesFile <GCBIAS_OUTPUT_FILE>
If a precomputed background profile was specified, the computeGCBias_readlen
script will load it, check its integrity and use it during the GC bias calculations, effectively skipping the expensive computation of the background profile.
In recent years, many approaches were developed to extract information from cfDNA. Methods range from identification of allelic differences at known disease markers, coverage alterations indicative of structural changes in tumor tissues, fragmentation differences to measuring methylation state. These methods measure different signals, but all rely on the quantification of read distributions and subtle changes in read recovery affect their results.
A common bias in sequencing and DNA library preparation is related to the Guanine+Cytosine (GC) content of fragments, leading to under- or overrepresentation of certain read populations (Fig 1).
Figure 1. Measurement of GC biases in cfDNA data. Heatmaps show the theoretical fragment representation in genome (left), distribution in the sample (middle) and the ratio (right).From these values we can calculate readlength and GC dependent correction values, which can be used as weights fro each read. A correction of over- and underrepresented reads in a sample with an average GC content of 43% can be seen in Figure 2, showing the uncorrected ratio of Figure 1 on the left and the corrected ratio on the right.
Figure 2. GC bias ratio before and after correction. Shown is the ratio as in Figure 1 after the initial analysis of the sample (left) and after applying the respective correction (right).One potential application is signal extraction around transcription afctor binding sites (TFBS). Firgure 3 shows the coverage signal around the TFBS of LYL1, a transcription factor associated with hematopoiesis.
Figure 3 shows the compound signal of the LYL1 TFBS from normalized coverage. The expected signal is shown on the left side (sample with 41% average GC), with higher coverage in close flanking regions and a drop in coverage at the TFBS indicative of higher accessibility compared to regional background. The signal on the right side of the plot (sample with 43% average GC) is distorted due to a technical GC bias, but can be rescued by applying our correction values.
Figure 3. Effects of GC bias correction on coverage signals from cell-free DNA. Effects of GC bias correction in cfDNA samples with different GC contents illustrated for an overlay of experimentally identified binding sites of the transcription factor LYL1.cfDNA_GCcorrection started as a fork of deepTools, which was used as a skeleton for developing a GCcorrection method tailored towards cell-free DNA samples.