ALFA provides a global overview of features distribution composing NGS dataset(s). Given a set of aligned reads (BAM files) and an annotation file (GTF format with biotypes), the tool produces plots of the raw and normalized distributions of those reads among genomic categories (stop codon, 5'-UTR, CDS, intergenic, etc.) and biotypes (protein coding genes, miRNA, tRNA, etc.). Whatever the sequencing technique, whatever the organism.
ALFA produces two outputs: plots and count files.
Legend: Two images display the nucleotides distributions among the different features, one for the categories (A) and another one for the biotypes (B). Each image is composed of two plots: the top ones (1) represents the raw nucleotides fraction of the reads that mapped to each feature in the samples whereas the bottom ones (2) represents the feature enrichment/depletion relative to the genome ”surface” occupied by this feature.
For each input sample, a count table with the following information is produced:
- category/biotype pair
- count of nucleotides mapping to this feature pair in the sample
- count of nucleotides belonging to this feature pair in the genome
Using -c/--count option, this file can be used to plot again the results and avoid the whole program execution.
You can either download the code by clicking the ZIP link on this webpage or clone the project using:
git clone https://github.com/biocompibens/ALFA
You can install ALFA through the command:
pip install alfa
You can install ALFA through the command (might need to add bioconda channel first with "conda config --add channels bioconda"):
conda install -c biocomp alfa
You can run ALFA on Galaxy, the instance administrator can find the progam on the mail tool shed.
RAM: 1 GB should be sufficient for a BAM file aligned on the human genome
Dependencies:
- Bedtools suite (v2.20.0 and above)
- python libs: argparse, pysam, copy, subprocess, matplotlib, re, progressbar2, multiprocessing
There is a quick start in the respository in order to test the tool installation. To do so, one can go in the directory and run the following command:
python alfa.py -a quick_start.gtf --chr_len quick_start.chr_len.txt -g quick_start --bam quick_start.bam quick_start
Here is an illustrated detailed example produced by ALFA from fake input files.
The BAM file contains 10 reads fully mapped to unique genomic categories referring to a GTF file describing a genome made of only one gene (without introns). The figure shows an illustration of the input BAM file reads distribution on the genome (A). These reads are converted to proportions on the top plot produced by ALFA (B). As an example, 60% of the reads (n=6) are mapped to a CDS region. This plot is then normalized according to the "surface" that each category occupies within a given genome described in the input GTF file (C). Finally, the bottom plot produced by ALFA shows the enrichment/depletion of the different categories (D). For instance, the CDS regions are enriched by a factor of 1.2 since 60% of the nucleotides from the reads map to this feature although the genome is only composed of 50% of CDS regions.
ALFA can be installed from pip or conda, in this case, an executable is provided meaning that the user can call the program with "alfa" directly. Otherwise, if ALFA is installed from a GitHub clone, the user has to type "python alfa.py".
The basic ALFA workflow consists in 2 steps performed at once or separately:
- Generating ALFA genome index files (stranded and unstranded)
The user supplies an annotations file (GTF format) to generate ALFA indexes that will be used in the 2nd step. The ALFA genome index files are saved and need to be generated only once for each annotation file.
- Intersecting mapped reads with the ALFA genome index files
The user provides the previously generated ALFA genome indexes as well as mapped reads file(s) (BAM format) that are intersected to determine the proportion of reads (at nucleotide resolution) within each of the genomic features. As an output, ALFA produces a count file for each input as well as plots based on these files.
Usage:
alfa -a GTF_FILE [-g GENOME_INDEX] [--chr_len CHR_LENGTHS_FILE] [-p NB_CPUS]
Arguments:
- -a/--annotation specifies the path to the genomic annotation file (GTF format).
- -g/--genome_index defines ALFA index files basename. In absence of this option, the annotation file basename will be used.
- --chr_len specifies the path to the tab-delimited text file defining chromosome names and lengths. In absence of this option, lengths will be estimated using the GTF file. Chr_len file example:
“Chr12 100000”
Important: the GTF file has to be sorted by position. Otherwise, you can use the following command line to sort it:
sort -k1,1 -k4,4n -k5,5nr GTF_FILE > SORTED_GTF_FILE
Usage:
alfa -g GENOME_INDEX --bam BAM_FILE1 LABEL1 [BAM_FILE2 LABEL2 …]
[-s STRAND] [-d {1,2,3,4}] [-t YMIN YMAX]
[-n] [--{pdf, svg, png} output.{pdf, svg, png}]
[--keep_ambiguous] [-p NB_CPUS]
Arguments:
- -g/--genome_index specifies path and basename of existing ALFA index files
- --bam specifies BAM files paths and associated labels (labels are used within output filenames and plots legends)
- -s/--strandness specifies the strandness of the library. Authorized values are: ‘unstranded’ (default), ’forward’/’fr-firststrand’ and ‘reverse’/’fr-secondstrand’
- -d/--categories_depth specifies the depth for the categories (see Categories depth)
- -t/--threshold set the y-axis limits
- --pdf or --svg or --png specifies the path to save the plots in the chosen format
- -n/--no_display do not create and show the plots
- --keep_ambiguous specifies that the nucleotides mapping to more than one feature (category or biotype) is equally split between the different features instead of being discarded
Important: BAM files have to be sorted by position. Otherwise, you can use the 'sort' module of samtools suite:
samtools sort BAM_FILE -T aln.sorted -O bam -o SORTED_BAM_FILE
- Indexing + processing
create the ALFA genome indexes and process your BAM files at once using both -a/--annotation and --bam options. Example:
alfa -a quick_start.gtf -g quick_start --bam quick_start.bam quick_start
- Processing bedgraph files
provide the data in the coverage bedgraph format to skip the bam to bedgraph and coverageBed steps using --bedgraph flag. Example (unstranded and stranded):
alfa -g quick_start --bedgraph quick_start.bedgraph quick_start
alfa -g quick_start --bedgraph quick_start.plus.bedgraph quick_start.plus.bedgraph quick_start -s forward
- Running the tool from counts
specify the count files previously generated to avoid running the script again on already processed datasets using the -c/--counts option (instead of --bam). Example:
alfa -c quick_start.feature_counts.tsv
- -p/--processors specifies the number of processors used
- -o specifies an output directory where the files are created (default is the current directory, it will be created if it doesn't exist)
ALFA can assign categories to nucleotides according to different hierarchical levels considered in the GTF file using the -d/--categories_depth option. Here are the features considered in the 4 different levels:
- Gene / intergenic
- Exon / intron / intergenic
- (Default) 5’-UTR / CDS / 3’-UTR / intron / intergenic
- 5’-UTR / start_codon / CDS / stop_codon / 3’-UTR / intron / intergenic
Warning: using a non-homogeneous GTF file in term of deep level annotations may lead to inconsistent results due to the fact that, for instance, reads mapping to genes without UTR annotation will increase the CDS category count whereas on the other genes, the UTR categories may be increased.
By default, as GTF files are built on a hierarchical way, some assumptions are made on categories priorities:
start_codon/stop_codon > five_prime_utr/three_prime_utr/CDS > exon > transcript > gene
This means, for example, that a nucleotide associated to an exon as well as to a CDS will be accounted for the CDS category.
In case of a nucleotide mapping to overlapping categories, by default, it is discarded. Although with the option --keep_ambiguous, the count will be split equally between the different categories. Overlapping biotype priorities are first solved with the associated category, in case of an equality, the previous principle is applied.
If ALFA meets a category that is not referenced in its code, it will be ignored since its priority is tricky to assess. However, an unknown biotype is added on the fly and will be processed.
We thank Felipe Delestro Matos from IBENS Bioinformatics platform for the figures and plots design.