Skip to content

Latest commit

 

History

History
171 lines (112 loc) · 7.61 KB

output.md

File metadata and controls

171 lines (112 loc) · 7.61 KB

NCBI-Hackathons/ATACFlow

This pipeline performs ATAC Seq using Nextflow

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

  • Sra-tools --version 2.8.2 - convert sra files to fastq files
  • Trim_Galore --version 0.4.4 -trimming adaptors and quality control
  • FastQC --version v0.11.7 - read quality control
  • MultiQC --version 1.5 - report, describing results of the whole pipeline
  • Bowtie2-build --version 2.3.0 -building reference genome
  • Bowtie2 --version 2.3.0 - mapping reads to reference genome
  • Samtools --version 1.3.1 - manipulating alignments in the SAM files
  • Bedtools --version 2.25.0 - enables genome arithmetic
  • Igvtools --version 2.3.75 - preprocessing the data and visualization
  • MACS2 --version 2.1.1.20160309 - calling peaks
  • DAStk - differential ATAC-Seq analysis

FastQC

FastQC gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences.

For further reading and documentation see the FastQC help.

Output directory: results/fastqc

  • sample_fastqc.html
    • FastQC report, containing quality metrics for your untrimmed raw fastq files
  • zips/sample_fastqc.zip
    • zip file containing the FastQC report, tab-delimited data file and plot images

MultiQC

MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory.

Output directory: results/multiqc

  • Project_multiqc_report.html
    • MultiQC report - a standalone HTML file that can be viewed in your web browser
  • Project_multiqc_data/
    • Directory containing parsed statistics from the different tools used in the pipeline

For more information about how to use MultiQC reports, see http://multiqc.info

TrimGalore

TrimGalore is used for removal of adapter contamination and trimming of low quality regions. TrimGalore uses Cutadapt for adapter trimming and runs FastQC after it finishes.

MultiQC reports the percentage of bases removed by TrimGalore in the General Statistics table, along with a line plot showing where reads were trimmed.

Output directory: results/trimgalore

Contains FastQ files with quality and adapter trimmed reads for each sample, along with a log file describing the trimming.

  • sample_val_1.fq.gz, sample_val_2.fq.gz
    • Trimmed FastQ data, reads 1 and 2.
  • sample_val_1.fastq.gz_trimming_report.txt
    • Trimming report (describes which parameters that were used)
  • sample_val_1_fastqc.html
  • sample_val_1_fastqc.zip
    • FastQC report for trimmed reads

Single-end data will have slightly different file names and only one FastQ file per sample:

  • sample_trimmed.fq.gz
    • Trimmed FastQ data
  • sample.fastq.gz_trimming_report.txt
    • Trimming report (describes which parameters that were used)
  • sample_trimmed_fastqc.html
  • sample_trimmed_fastqc.zip
    • FastQC report for trimmed reads

bowtie2

bowtie2 is used to produce raw bam files, followed by various filtering steps (mappability and quality) to produce filtered bams.

Output directory: results/bowtie2

  • sample.sam
    • Alignment sam file

SAMtools

SAMtools is used for sorting and indexing the output BAM files from Bowtie2. In addition, the numbers of features are counted with the idxstats option.

Output directory: results/samtools

  • sample.sorted.bam
    • Sorted bam file
  • sample.sorted.bam.flagstat
    • Flagstat of the bam file

bedtools

bedtools is used to generate BedGraph copies for the downstream analysis.

Output directroy: results/bedtools

  • sample.sorted.bed
    • BedGraph copies (easier for data analysis)

Igvtools

igvtools toTDF converts a sorted data input file to a binary tiled data (.tdf) file.

Output directory: results/igvtools

  • sample.tdf
    • binary tiled tdf data file

MACS2

macs2 is a program for detecting regions of genomic enrichment. Though designed for ChIP-seq, it works just as well on ATAC-seq and other genome-wide enrichment assays that have narrow peaks. The main program in MACS2 is callpeak, and its options are described below. As input, MACS2 takes the alignment files produced in the previous steps. However, it is important to remember that the read alignments indicate only a portion of the DNA fragments generated by the ATAC. Therefore, we must consider how we want MACS2 to interpret the alignments.

Output directory: results/macs2

  • sample_peaks.xls
    • Tabular file which contains information about called peaks. Information include:
      • chromosome name
      • start position of peak
      • end position of peak
      • length of peak region
      • absolute peak summit position
      • pileup height at peak summit, -log10(pvalue) for the peak summit (e.g. pvalue =1e-10, then this value should be 10)
      • fold enrichment for this peak summit against random Poisson distribution with local lambda, -log10(qvalue) at peak summit
  • sample_peaks.narrowPeak
    • BED6+4 format file which contains the peak locations together with peak summit, pvalue and qvalue.
  • sample_summits.bed
    • BED format file which contains the peak summits locations for every peaks.
  • sample_peaks.broadPeak
    • BED6+3 format file which is similar to narrowPeak file, except for missing the column for annotating peak summits.
  • sample_peaks.gappedPeak
    • BED12+3 format file which contains both the broad region and narrow peaks.
  • sample_model.r
    • R script with which a PDF image about the model based on your data can be produced.
  • .bdg
    • bedGraph format files which can be imported to UCSC genome browser or be converted into even smaller bigWig files.

Refer to https://github.com/taoliu/MACS for the specifications of the output fields.

Downstream analysis

DAStk

DAStk is a differential ATAC-seq toolkit, can be used to identify changes in TF activity across differential ATAC-seq datasets.

Output directory: results/md_scores

  • sample_Treatment_md_scores.txt
    • MD-scores of the differential analysis on ATAC-seq datasets
  • MA plot that labels the most significant TF activity changes, at a p-value cutoff of 1e-7. Note that the condition names (DMSO and Treatment) were the same ones used earlier as the second half of the prefix.

  • barcode plot of each of these statistically significat motif differences that depicts how close the ATAC-seq peak centers were to the motif centers, within a 1500 base-pair radius of the motif center.