GC-aware species abundance estimation from metagenomic data
GuaCAMOLE is a tool for estimating species abundance from metagenomic data, considering GC content. This package implements various scripts to generate reference distributions, correct for GC bias, and estimate species abundances.
- Create GC reference distributions from Kraken2 databases.
- Estimate species abundances using GC-aware correction.
- Generate detailed plots for data visualization.
You can install the package directly from a folder:
pip install .
Or from GitHub:
pip install git+https://github.com/Cibiv/GuaCAMOLE.git
To install the quadratic programming solver:
pip install qpsolvers['cvxopt']
GuaCAMOLE relies on the library.fna files downloaded by Kraken2 to generate the reference distributions. The Kraken2 database needs to be downloaded with the --no-masking
option with the kraken2-build
command!
To create a GC reference distribution from a Kraken2 database run the following command with the parameters of your sample:
create-reference-dist --lib_path path/to/kraken_db --read_len 150 --fragment_len 400 --ncores 20
Specifying the fragment_len
parameters for paired-end sequencing samples can help with the accuracy of GuaCAMOLE, especially if the fragments are a lot longer than the reads. If specified, GuaCAMOLE will use the mean GC content of the read pair as an approximation to the GC content of the fragment. If not specified just single reads.
For GuaCAMOLE to run it also requires a Bracken database with the correct read length to exist in the Kraken database folder.
To estimate species abundances from your data, run:
guacamole --kraken_report path/to/report --kraken_file path/to/kraken --kraken_db path/to/kraken_db --read_len 150 --output result.txt --read_files path/to/reads_1.fastq path/to/reads_2.fastq
--kraken_report
: Kraken2 report file (required)--kraken_file
: Kraken2 file with classifications (required)--kraken_db
: Path to the Kraken2 database (required)--read_len
: Read length (required)--output
: Output file name (required)--read_files
: Path to input read files (required)--threshold
: Minimum number of reads found for a species to estimate its abundance, default=500--length_correction
, genome size correction (taxonomic vs. read abundance), default=False--plot
, True if detailed plots should be generated--fp_cycles
, Number of iterations for false positive removal, default=4--reg_weight
, Determines how strong the regularization should be [between 0 and 1]', default=0.01--fragment_len
, length of the fragment if paired end and known', default=None--fasta
, True if reads are in fasta format, false if fastq', default=False
The Output is the same as the tab-delimited Bracken output file. Three additional columns are added:
Bracken_estimate
, the abundance estimate from Bracken (iflength_correction=True
they are genome length corrected the same as the GuaCAMOLE estimates)GuaCAMOLE_estimate
, the abundances estimated using the abundance parameter from the GuaCAMOLE algorithmGuaCMAOLE_est_eff
, the abundances computed using the estimated efficiencies by GuaCAMOLE (this does also include esimates for the taxa that were labelled as false positives by GuaCAMOLE)GC_content
, the GC content of the taxon's genome
To try out GuaCAMOLE you can download the fastq files of one of the replicates of the sample GH
sequenced for a publication by Tourlouss et al. (2021) (for more details see their and our manuscripts). To download the data install sra-tools
and run:
fasterq-dump SRR12996167
You need to have a kraken2 database installed with the --no-masking
flag set. Download the Kraken2 database using:
kraken2-build --standard --db demo --threads 24 --no-masking
Now you can classify the reads using
kraken2 --db standard --threads 24 --report SRR12996167_report.txt --paired SRR12996167_1.fastq SRR12996167_2.fastq > SRR12996167.kraken
Then build a Bracken database for the read length of the sequencing data which is 150 bp:
bracken-build -d standard -t 24 -l 150
Now build a GuaCAMOLE database for the corresponding read and fragment length (should take around an hour):
crate-reference-dist --lib_path standard --read_len 150 --ncores 20 --fragment_len 300
Now run GuaCAMOLE using (should take around 20 minutes):
guacamole --kraken_report SRR12996167_report.txt --kraken_file SRR12996167.kraken --kraken_db standard --read_len 150 fragment_len 150 --length_correction True --output SRR12996167_guacamole.out --read_files SRR12996167_1.fastq SRR12996167_2.fastq