_ __ __ _ ___
/\︵︵/\ | | | \/ | /_\ / __|
(◕('人')◕) | |__| |\/| |/ _ \\__ \
|︶| |____|_| |_/_/ \_\___/
Last Metagenomic Assembler Standing
The de novo assembly of raw sequence data is a key process when analysing data from shotgun metagenomic sequencing. It allows recovering draft genomes from a pool of mixed raw reads, yielding longer sequences that offer contextual information and afford a more complete picture of the microbial community. It also represents one of the greatest bottlenecks when obtaining trustworthy, reproducible results.
LMAS is an automated workflow enabling the benchmarking of traditional and metagenomic prokaryotic de novo assembly software using defined mock communities. The results are presented in an interactive HTML report where selected global and reference specific performance metrics can be explored.
All components of LMAS are executed in docker containers, which means that you’ll need to have a container engine installed. The container engines available are the ones supported by Nextflow:
If you already have any one of these installed, you are good to go as the provided docker containers are compatible with all engines available. If not, you’ll need to install one.
LMAS can be easily installed through Conda, an open source package management system and environment management system that runs on Windows, macOS and Linux. After its installation, LMAS is available on Bioconda and can be easily installed with:
conda install -c bioconda lmas
To install LMAS manually you'll first have to install nextflow.
Nextflow (version 20.01.0 or higher) can be used on any POSIX compatible system (Linux, OS X, etc). It requires BASH and Java 8 (or higher) to be installed. More instructions are available here.
You can clone this repository with git clone git@github.com:cimendes/LMAS.git
, and all files will be in your local machine.
This is usually handled by Nextflow during its execution, but if for whatever reason you require the images to be downloaded before you run LMAS, the pull_images.sh
script will do that for you!
To download all images required to run LMAS simply run:
sh pull_images.sh
This script supports shifter
, singularity
and docker
, recognizing the available software without needing user input.
To run LMAS you can simply call it with:
LMAS <options>
If no option or --help
is provided, LMAS will display its help message. Otherwise, the --fastq
and --reference
options are mandatory. By default they are set to 'data/fastq/*_{1,2}.*'
and 'data/reference/*.fasta'
respectively.
Alternatively you can call LMAS directly with Nextflow:
nextflow run main.nf <options>
To use LMAS the following options are available:
_ __ __ _ ___
/\︵︵/\ | | | \/ | /_\ / __|
(◕('人')◕) | |__| |\/| |/ _ \\__ \
|︶| |____|_| |_/_/ \_\___/
Last Metagenomic Assembler Standing
Input parameters:
--fastq Path expression to paired-end fastq files.
(default: data/fastq/*_{1,2}.*)
--reference Path to the genome reference fasta file.
(default: data/reference/*.fasta)
--md Path to markdown with input sample description for report (optional).
(default: data/*.md)
Mapping and filtering paramenters:
--minLength Value for minimum contig length, in basepairs.
(default: 1000)
--mapped_reads_threshold Value for the minimum percentage of a read aligning to the
contig to be considered as mapped.
(default: 0.75)
Assembly quality assessment parameters:
--n_target Target value for the N, NA and NG metrics, ranging from 0 to 1.
(default: 0.5)
--l_target Target value for the L metric, ranging from 0 to 1.
(default: 0.5)
--plot_scale Scale of x-axis for the L, NA and NG metrics plots.
Allowed values: 'linear' or 'log'.
(default: log)
Assembly execution parameters:
--abyss Boolean controling the execution of the ABySS assembler.
(default: true)
--abyssKmerSize K-mer size for the ABySS assembler, as an intiger.
(default 96)
--abyssBloomSize Bloom filter size for the ABySS assembler.
It must be a sting with a value and an unit.
(default: 2G)
--gatb_minia Boolean controling the execution of the GATB Minia Pipeline assembler.
(default: true)
--gatbKmerSize K-mer sizes for the GATB Minia Pipeline assembler.
It must be a sting with the values separated with a comma.
(default 21,61,101,141,181)
--gatb_besst_iter Number of iteration during Besst scaffolding for the
GATB Minia Pipeline assembler.
(default 10000)
--gatb_error_correction Boolean to control weather to skip error correction for the
GATB Minia Pipeline assembler.
(default false)
--idba Boolean controling the execution of the IDBA-UD assembler.
(default true)
--metahipmer2 Boolean controling the execution of the MetaHipMer2 assembler.
(default true)
--metahipmer2KmerSize K-mer sizes for the MetaHipMer2 assembler.
It must be a sting with the values separated with a comma.
(default 21,33,55,77,99)
--minia Boolean controling the execution of the minia assembler.
(default: true)
--miniaKmerSize K-mer size for the minia assembler, as an intiger.
(default 31)
--megahit Boolean controling the execution of the MEGAHIT assembler.
(default true)
--megahitKmerSize K-mer sizes for the MEGAHIT assembler.
It must be a sting with the values separated with a comma.
(default 21,29,39,59,79,99,119,141)
--metaspades Boolean controling the execution of the metaSPAdes assembler.
(default true)
--metaspadesKmerSize K-mer sizes for the metaSPAdes assembler.
It must be a sting with 'auto' or the values separated with a space.
(default auto)
--spades Boolean controling the execution of the SPAdes assembler.
(default true)
--spadesKmerSize K-mer sizes for the SPAdes assembler.
It must be a sting with 'auto' or the values separated with a space.
(default auto)
--skesa Boolean controling the execution of the SKESA assembler.
(default true)
--strainxpress Boolean controling the execution of the StrainXpress assembler.
(default true)
--unicycler Boolean controling the execution of the Unicycler assembler.
(default true)
--velvetoptimiser Boolean controling the execution of the VelvetOptimiser assembler.
(default: true)
--velvetoptimiser_hashs Starting K-mer size for the VelvetOptimiser assembler, as an intiger.
(default 19)
--velvetoptimiser_hashe End K-mer size for the VelvetOptimiser assembler, as an intiger.
(default 31)
Execution resources parameters:
--cpus Number of CPUs for the assembly and mapping processes, as an intiger.
This resource is double for each retry until max_cpus is reached.
(default 8)
--memory Memory for the assembly and mapping processes, in the format of
'value'.'unit'.
This resource is double for each retry until max_memory is reached.
(default 32 GB)
--time Time limit for the assembly and mapping processes, in the format of
'value'.'unit'.
This resource is double for each retry until max_time is reached.
(default 1d)
--max_cpus Maximum number of CPUs for the assembly and mapping processes,
as an intiger. It overwrites the --cpu parameter.
(default 32)
--max_memory Maximum memory for the assembly and mapping processes, in the format of
'value'.'unit'. It overwrites the --memory parameter.
(default 100 GB)
--max_time Maximum time for the assembly and mapping processes, in the format of
'value'.'unit'. It overwrites the --time parameter.
(default 3d)
The reference sequences, in a single file, can be passed with the --reference
parameter, and --fastq
recieves the raw data for assembly.
The raw data is a collection of sequence fragments from the references, and can be either obtained in silico or from real
sequencing platforms.
Users can customize the workflow execution either by using command line options or by modifying a simple plain-text
configuration file (conf/params.config
), where parameters are set as key-value pairs. The version of tools used can also
be changed by providing new container tags in the appropriate configuration file (conf/containers.config
).
Users can select what profile to use with the -profile
option. Several configurations are availabel in the profile configuration file (conf/profiles.config
). For a local execution we recommend running LMAS with either -profile docker
or -profile singularity
. HPC compatibility is available for SLURM, SGE, LSF, among others.
The output files are stored in the results/
folder in the directory where the workflow was executed.
The nextflow log file for the execution of the pipeline can be found in the directory of execution. Log files for each
of the components in the workflow are stored inside the results/
folder.
LMAS creates an interactive HTML report, stored in the report/
folder in the directory where the
workflow was executed. To open the report simply click on the index.html file and the report will open on
your default browser.
LMAS comes pre-packaged with the JS source code for the interactive report, available in the resources/
folder.
The source code for the report is available in the LMAS.js repository.
A bash script to download and structure the ZymoBIOMICS data to be used as input is provided (get_data.sh
).
sh get_data.sh
Running this scipt downloads the eight bacterial genomes and four plasmids of the ZymoBIOMICS Microbial Community Standards were used as reference. It contains complete sequences for the following species:
- Bacillus subtilis
- Enterococcus faecalis
- Escherichia coli
- Escherichia coli plasmid
- Lactobacillus fermentum
- Listeria monocytogenes
- Pseudomonas aeruginosa
- Salmonella enterica
- Staphylococcus aureus
- Staphylococcus aureus plasmid 1
- Staphylococcus aureus plasmid 2
- Staphylococcus aureus plasmid 3
It also downloads the raw sequence data of the mock communities, with an even (ERR2984773) and logarithmic distribution of species (ERR2935805), and the complete reference sequences
Simulated samples of the evenly and log distributed reads, with and without error, generated from the genomes in the Zymobiomics standard with inSilicoSeq (version 1.5.2):
- ENN - Evenly distributed sample with no error model
- EMS - Evenly distributed sample with Illumina MiSeq error model
- LNN - Log distributed sample with no error model
- LHS - Log distributed sample with Illumina HiSeq error model
After downloading the data you can simply run LMAS, with default parameters, with the following command:
LMAS -profile docker
or
nextflow run main.nf -profile docker
LMAS is developed at the Molecular Microbiology and Infection Unit (UMMI) at the Instituto de Medicina Molecular Joao Antunes, in collaboration with Microbiology, Advanced Genomics and Infection Control Applications Laboratory (MAGICAL) at the Faculty of Health Sciences, Ben-Gurion University of the Negev.
This project is licensed under the GPLv3 license.
If you use LMAS please cite this repository.