Skip to content
dpryan79 edited this page Jan 20, 2016 · 29 revisions

For instructions on using the latest deepTools versions, please go here. This page only applies to deepTools 1.5

Quality controls of aligned reads

These tools work on BAM files that contain read-related information (e.g. [read][] DNA sequence, sequencing quality, mapping quality etc.) from the same organism. They are typically generated by [read][] alignment programs such as [bowtie2][], bwa, novoalign etc.

The following tools will allow you to inspect your BAM files more closely.

bamCorrelate

This tool is useful to assess the overall similarity of different BAM files. A typical application is to check the correlation between replicates or published data sets, but really, you can apply it to any inquiry that boils down to the question: "How (dis)similar are these BAM files?".

What it does

bamCorrelate computes the overall similarity between two or more BAM files based on [read][] coverage (number of reads) within genomic regions, i.e. for each pair of BAM files, the reads overlapping with the same genomic intervals are counted. Then, the correlation methods are used to determine how similar the counts are for the pairs of BAM files, i.e. if a region contains many reads in file A, does it also contain many reads in file B (see the image below)?

The result of the correlation computation is a table of correlation coefficients that will be visualized as a heatmap. The correlation coefficient indicates how "strong" the relationship between the two samples is and it will consist of numbers between -1 and 1. (-1 indicates perfect anti-correlation, 1 perfect correlation.)

We offer two different functions for the correlation computation: Pearson or Spearman.

The Pearson method measures the metric differences between samples and is therefore influenced by outliers. The Spearman method is based on rankings. If you imagine a race with 3 participants where the winner and runner-up are very close together while the third person broke her leg and comes in way, way after the first two, then Pearson would be strongly influenced by the fact that the third person had a great distance to the first ones while Spearman would only care about the fact that person 1 came in first, person 2 came in second and person 3 got the third rank, the distances between them are ignored.

In short, Pearson is an appropriate measure for data that follows a normal distribution, while Spearman does not make this assumption and is generally less driven by outliers, but with the caveat of also being less sensitive.

NOTE: bamCorrelate usually takes a long time to finish, thus it is advisable to run the tool for a tiny region (using the --region option) to adjust plotting parameters like colors and labels before running the whole computation.

Important parameters

bamCorrelate has 2 modes: bins and BED-file

In bins mode, the correlation is based on read coverage over consecutive bins of equal size (10k bp by default). This mode is useful to assess the overall similarity of BAM files. The bin size and the distance between bins can be adjusted.

In BED-file mode, the user has to supply a list of genomic regions in BED format in addition to the BAM files. bamCorrelate subsequently uses this list to compare the [read][] coverages for these regions only. This can be used, for example, to compare ChIP-seq coverages of two different samples for a set of peak regions.

For an overview of all the available command line options, go here, for an example command and results, see below.

Output files:

  • diagnostic plot the plot produced by bamCorrelate is a clustered heatmap displaying the values for each pair-wise correlation, see below for an example
  • data matrix (optional) in case you want to plot the correlation values using a different program, e.g. R, this matrix can be used

Example Figures

Here is a result of running bamCorrelate: heatmaps where the pairwise correlation coefficients are depicted by varying color intensities and are clustered using hierarchical clustering.

For the two example plots below, we supplied BAM files of RNA-seq data from different human cell lines that we had downloaded from the ENCODE project and a list of genes from RefSeq (Note that you can supply any number of BAM files that you would like to compare. In Galaxy, you just click "Add BAM file", in the command line you simply list all files one after the other, giving meaningful names via the --label option). We then calculated the pair-wise correlations of read numbers for the different genes, once with Spearman correlation, once with Pearson correlation.

As you can see, both correlation calculations more or less agree on which samples are nearly identical (the replicates, indicated by 1 or 2 at the end of the label). The Spearman correlation, however, seems to be more robust and meets our expectations more closely as the two different cell types (HUVEC and IMR90) are clearly separated.

This is the command that was used to generate the plot on the left-hand side:

$ deepTools-1.5.7/bin/bamCorrelate BED-file \
--BED RefSeq_Genes.bed \
--bamfiles wgEncodeCshlLongRnaSeqImr90CellPapAlnRep1.bam wgEncodeCshlLongRnaSeqImr90CellPapAlnRep2.bam wgEncodeCshlLongRnaSeqImr90CellTotalAlnRep1.bam	wgEncodeCshlLongRnaSeqImr90CellTotalAlnRep2.bam  wgEncodeCshlLongRnaSeqHuvecCellPapAlnRep1.bam wgEncodeCshlLongRnaSeqHuvecCellPapAlnRep2.bam wgEncodeCshlLongRnaSeqHuvecCytosolPapAlnRep3.bam wgEncodeCshlLongRnaSeqHuvecCytosolPapAlnRep4.bam \
--labels IMR90_WC1 IMR90_WC2 IMR90_WC_totalRNA1 IMR90_WC_totalRNA2 HUVEC_WC1 HUVEC_WC2 HUVEC_Cytosol1 HUVEC_Cytosol2 \
--binSize 1000 --corMethod spearman -f 200 \
--colorMap Reds --zMin 0.5 --zMax 1 \
-o correlation_spearman.pdf

Here is another example, this time for ChIP-seq samples of two different histone marks (the histone marks are abbreviated H3K27me3 and H3K27ac and have been shown to be indicative of inactive and active chromatin, respectively). For our example, H3K27ac was ChIPed by the same experimentator for different cell populations while H3K27me3 was performed with the same antibody, but at different times. You can see that the correlation between the H3K27ac replicates is much higher than for the H3K27me3 samples, however, for both histone marks, the ChIP-seq experiments are more similar to each other than to the other ChIP or to the control (the [input][] sample). In fact, the signals of H3K27ac and H3K27me3 are almost not correlated at all which supports the notion that their biological function is also quite opposing.

computeGCbias

This tool computes the GC bias using the method proposed by Benjamini and Speed.

What it does

The basic assumption of the GC bias diagnosis is that an ideal sample should show a uniform distribution of sequenced reads across the genome, i.e. all regions of the genome should have similar numbers of reads, regardless of their base-pair composition. In reality, the DNA polymerases used for PCR-based amplifications during the library preparation of the sequencing protocols prefer GC-rich regions. This will influence the outcome of the sequencing as there will be more reads for GC-rich regions just because of the DNA polymerase's preference.

computeGCbias will first calculate the expected GC profile by counting the number of DNA fragments of a fixed size per GC fraction (GC fraction is defined as the number of G's or C's in a genome region of a given length) (a). This profile is then compared to the observed GC profile by counting the number of sequenced reads per GC fraction.

(a) The expected GC profile depends on the reference genome as different organisms have very different GC contents. For example, one would expect more fragments with GC fractions between 30% to 60% in mouse samples (average GC content of the mouse genome: 45 %) than for genome fragments from Plasmodium falciparum (average genome GC content P. falciparum: 20%).

Excluding regions from the read distribution calculation

In some cases, it will make sense to exclude certain regions from the calculation of the read distributions to increase the accuracy of the computation. There are several kinds of regions that are either not expected to show a background read distribution or where the uncertainty of the reference genome might be too big. Please consider the following points:

  • repetitive regions: if multi-reads (reads that map to more than one genomic position) were excluded from the BAM file, it will help to exclude known repetitive regions. You can get BED files of known repetitive regions from UCSC Table Browser (see the screenshot below for an example of human repetitive elements).
  • regions of low mappability: these are regions where the mapping of the reads notoriously fails and we recommend to exclude known regions with mappability issues from the GC computation. You can download the mappability tracks for different read lengths from UCSC, e.g. for mouse and human. In the github deepTools folder "scripts", you can find a shell script called mappabilityBigWig_to_unmappableBed.sh which will turn the bigWig mappability file from UCSC into a BED file.

  • ChIP-seq peaks: in ChIP-seq samples it is expected that certain regions should show more reads than expected based on the background distribution, therefore it makes absolute sense to exclude those regions from the GC bias calculation. We recommend to run a simple, non-conservative peak calling on the uncorrected BAM file first to obtain a BED file of peak regions that should then subsequently be supplied to computeGCbias.

Output files

  • Diagnostic plot
    • box plot of absolute [read][] numbers per genomic GC fraction
    • x-y plot of observed/expected [read][] ratios per genomic GC fraction (ideally, ratio should always be 1 (log2(1) = 0))
  • Data matrix
    • tabular matrix file
    • to be used for GC correction with correctGCbias

What the plots tell you

In an ideal sample without GC bias, the ratio of observed/expected values should be close to 1 for all GC content bins.

However, due to PCR (over)amplifications, the majority of ChIP samples usually shows a significant bias towards reads with high GC content (>50%) and a depletion of reads from GC-poor regions.

Example figures

Let's start with an ideal case. The following plots were generated with computeGCbias using simulated reads from the Drosophila genome. The command looked like this (for more information on the individual command options, go to our page with All command line options):

$ deepTools-1.5.2/bin/computeGCBias -b simulatedReads.bam \
--effectiveGenomeSize 121400000 --genome dm3.2bit \
--fragmentLength 200 --biasPlot dm3_simulatedReads.png \
--GCbiasFrequenciesFile simulatedReads_frequencies.txt 

As you can see, both plots based on simulated reads do not show enrichments or depletions for specific GC content bins, there is an almost flat line at log2ratio of 0 (= ratio(observed/expected) of 1). The fluctuations on the ends of the x axis are due to the fact that only very, very few regions in the Drosophila genome have such extreme GC fractions so that the number of fragments that are picked up in the random sampling can vary.

Now, let's have a look at real-life data from genomic DNA sequencing. Panels A and B can be clearly distinguished and the major change that took place between the experiments underlying the plots was that the samples in panel A were prepared with too many PCR cycles and a standard polymerase whereas the samples of panel B were subjected to very few rounds of amplification using a high fidelity DNA polymerase.

bamFingerprint

This quality control will most likely be of interest for you if you are dealing with ChIP-seq samples as a pressing question in ChIP-seq experiments is "Did my ChIP work?", i.e. did the antibody-treatment enrich sufficiently so that the ChIP signal can be separated from the background signal? (After all, around 90 % of all DNA fragments in a ChIP experiment will represent the genomic background). We use bamFingerprint routinely to monitor the outcome of ChIP-seq experiments.

What it does

This tool is based on a method developed by Diaz et al. and it determines how well the signal in the ChIP-seq sample can be differentiated from the background distribution of reads in the control sample. For factors that will enrich well-defined, rather narrow regions (e.g. transcription factors such as p300), the resulting plot can be used to assess the strength of a ChIP, but the broader the enrichments are to be expected, the less clear the plot will be. Vice versa, if you do not know what kind of signal to expect, the bamFingerprint plot will give you a straight-forward indication of how careful you will have to be during your downstream analyses to separate biological noise from meaningful signal.

Similar to bamCorrelate, bamFingerprint randomly samples genome regions (bins) of a specified length and counts the reads from indexed BAM files that overlap with those regions. These counts are then sorted according to their rank and the cumulative sum of read counts is plotted. These counts are then sorted according to their rank and the cumulative sum of [read][] counts is plotted.

Output files:

  • Diagnostic plot
  • Data matrix of raw counts (optional)

What the plots tell you

An ideal [input][] with perfect uniform distribution of reads along the genome (i.e. without enrichments in open chromatin etc.) should generate a straight diagonal line. A very specific and strong ChIP enrichment will be indicated by a prominent and steep rise of the cumulative sum towards the highest rank. This means that a big chunk of reads from the ChIP sample is located in few bins which corresponds to high, narrow enrichments seen for transcription factors.

Example figures

Here you see 3 different fingerprint plots. We chose these examples to show you how the nature of the ChIP signal (narrow and high vs. wide and not extremely high) is reflected in the "fingerprint" plots. Please note that these plots go by the name of "fingerprints" in our facility because we feel that they help us tremendously in judging individual files, but the idea underlying these plots came from Diaz et al.


[read]: https://github.com/fidelram/deepTools/wiki/Glossary#terminology "the DNA piece that was actually sequenced ("read") by the sequencing machine (usually between 30 to 100 bp long, depending on the read-length of the sequencing protocol)" [input]: https://github.com/fidelram/deepTools/wiki/Glossary#terminology "confusing, albeit commonly used name for the 'no-antibody' control sample for ChIP experiments" [bowtie2]: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml "a popular read alignment program"