ml git ml bbmap
need to increase memory for
Thanks to the increased cost-effectiveness of high-throughput technologies, the number of studies focusing on microorganisms (bacteria, archaea, microbial eukaryotes, fungi, and viruses) and on their connections with human health and diseases has surged, and, consequently, a plethora of approaches and software has been made available for their study, making it difficult to select the best methods and tools.
Here we present Yet Another Metagenomic Pipeline (YAMP) that, starting from the raw sequencing data and having a strong focus on quality control, allows, within hours, the data processing up to the functional annotation (please refer to the YAMP wiki for more information).
YAMP is constructed on Nextflow, a framework based on the dataflow programming model, which allows writing workflows that are highly parallel, easily portable (including on distributed systems), and very flexible and customisable, characteristics which have been inherited by YAMP. New modules can be added easily and the existing ones can be customised -- even though we have already provided default parameters deriving from our own experience.
YAMP is accompanied by a Docker container, that saves the users from the hassle of installing the required software, increasing, at the same time, the reproducibility of the YAMP results (see Using Docker or Singularity).
- Citation
- Dependencies
- Installation
- Other requirements
- Usage
- Using Docker or Singularity
- Troubleshooting
- Changelog
- License
- Acknowledgements
Please cite YAMP as:
Visconti A,. Martin T.C., and Falchi M., "YAMP: a containerised workflow enabling reproducibility in metagenomics research", GigaScience (2018), https://doi.org/10.1093/gigascience/giy072
To run YAMP you will need to install Nextflow (version 0.29.x or higher), as explained here. Please note that Nextflow requires BASH and Java 7 or higher to be installed. Both should be already available in most of the POSIX compatible systems (Linux, Solaris, OS X, etc). However, as of October 2017, the latest release of Java (SE9) introduces some breaking changes in Nextflow, and should not be used (see here for details).
If you are using the containerised version of YAMP (as we strongly suggest), you will should also install Docker or Singularity, as explained here and here, respectively. In fact, Nextflow orchestrates, in a transparent fashion, the flow of the pipeline by wrapping and executing each step using the Docker/Singularity run command. Thus, Nextflow lies outside the container, that is responsible for instantiating. You can find more information about Docker/Singularity containers and Nextflow here and here, respectively.
Once you have either Docker or Singularity up and running, you will not need to install anything additional tools, since all the pieces of software are already available in the Docker container released with YAMP pipeline, and that you can find on DockerHub. Please refer to Using Docker or Singularity for more details.
For expert users only. If you do not want to use the containerised version of YAMP, you will need to install several additional tools for YAMP to work properly, and all of them should either be in the system path with execute and read permission, or made available within a multi-image scenario as the one we describe in the multi-image scenario tutorial.
The list of tools that should be available includes:
- fastQC v0.11.2+ (http://www.bioinformatics.babraham.ac.uk/projects/fastqc)
- BBmap v36.92+ (https://sourceforge.net/projects/bbmap)
- Samtools v1.3.1 (http://samtools.sourceforge.net)
- MetaPhlAn2 v2.0+ (https://bitbucket.org/biobakery/metaphlan2)
- QIIME v1.9.1+ (http://qiime.org)
- HUMAnN2 v0.9.9+ (https://bitbucket.org/biobakery/humann2)
Following the links, you will find detailed instructions on how to install them, as explained by their developers. Notably, MetaPhlAn2, QIIME, and HUMAnN2 are also available in bioconda.
Clone the YAMP repository in a directory of your choice:
git clone https://github.com/alesssia/YAMP.git
The repository includes:
- the Nextflow script,
YAMP.nf
, - the configuration files,
nextflow.config
- a folder (
bin
) containing two helper scripts (fastQC.sh
andlogQC.sh
), - a folder (
yampdocker
) containing the Docker file used to build the Docker image (Dockerfile
).
Note: the nextflow.config
file includes the parameters that are used in our tutorials (check the YAMP wiki!).
YAMP requires a set of databases that are queried during its execution. Some of them should be automatically downloaded when installing the tools listed in the dependencies (or using specialised scripts, as those available with HUMAnN2), whilst other should be created by the user. Specifically, you will need:
- a FASTA file listing the adapter sequences to remove in the trimming step. This file should be available within the BBmap installation. If not, please download it from here;
- two FASTA file describing synthetic contaminants. These files (
sequencing_artifacts.fa.gz
andphix174_ill.ref.fa.gz
) should be available within the BBmap installation. If not, please download them from here; - a FASTA file describing the contaminating genome(s). This file should be created by the users according to the contaminants present in their dataset. When analysing human metagenome, we suggest the users to always include the human genome. Please note that this file should be indexed beforehand. This can be done using BBMap, using the following command:
bbmap.sh -Xmx24G ref=my_contaminants_genomes.fa.gz
. We suggest to download the FASTA file provided by Brian Bushnell for removing human contamination, using the instruction available here; - the BowTie2 database file for MetaPhlAn2. This file should be available within the MetaPhlAn2 installation. If not, please download it from here;
- the ChocoPhlAn and UniRef databases, that can be downloaded directly by HUMAnN2, as explained here;
- [optional] a phylogenetic tree used by QIIME to compute a set of alpha-diversity measures (see here for details).
You can find an example of the folders layouts in this wiki page.
You can also download all these files (please note that it might be necessary to edit this file list according to the analysis at hand) either from Zenodo (https://zenodo.org/record/1068229#.Wh7a3rTQqL4), or using the following command:
wget https://zenodo.org/record/1068229/files/YAMP_resources_20171128.tar.gz
If you use this data file, please note that, before running YAMP, the FASTA file describing the human (contaminating) genome should be indexed with the following command:
bbmap.sh -Xmx24G ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz
Please also note that the size of this compressed data file is 16.7 GB.
- Modify the
nextflow.config
file, specifying the necessary parameters, such as the path to the aforementioned databases. - From a terminal window run the
YAMP.nf
script using the following command (when the library layout is 'paired'):wherenextflow run YAMP.nf --reads1 R1 --reads2 R2 --prefix mysample --outdir outputdir --mode MODE
R1
andR2
represent the path to the raw data (two compressed paired-end FASTQ files),mysample
is a prefix that will be used to label all the resulting files,outputdir
is the directory where the results will be stored, andMODE
is any of the following: < QC, characterisation, complete >; or the following command (when the library layout is 'single'):wherenextflow run YAMP.nf --reads1 R --prefix mysample --outdir outputdir --mode MODE --librarylayout single
R
represents the path to the raw data (a compressed single-end FASTQ file),librarylayout single
specifies that single-end reads are at hand, and the other parameters are as above.
Does it seem complicate? In the YAMP wiki there are some tutorials and a TL;DR if you are in a hurry!
To use the tools made available through the Docker container within both Docker, one could either pull the pre-built image from DockerHub, using the following command:
docker pull alesssia/yampdocker
or build a local image using the file Dokerfile
in the yampdocker
folder. To build a local image, one should first access the yampdocker
folder and then run the following command (be careful to add the dot!):
docker build -t yampdocker .
In both cases, the image can be used by YAMP by running the command presented above adding -with-docker
followed by the image name (yampdocker
):
nextflow run YAMP.nf --reads1 R1 --reads2 R2 --prefix mysample --outdir outputdir --mode MODE -with-docker yampdocker
where R1
and R2
represent the path to the raw data (two compressed FASTQ file), mysample
is a prefix that will be used to label all the resulting files, outputdir
is the directory where the results will be stored, and MODE
is any of the following: < QC, characterisation, complete >.
YAMP can also fetch the Docker container directly from DockerHub;
nextflow run YAMP.nf --reads1 R1 --reads2 R2 --prefix mysample --outdir outputdir --mode MODE -with-docker docker://alesssia/yampdocker
so, even simpler!
YAMP can use a Docker image with Singularity (again without pulling the image) by adding the -with-singularity
option followed by the image path (--with-singularity docker://alesssia/yampdocker
), that is, the following command:
nextflow run YAMP.nf --reads1 R1 --reads2 R2 --prefix mysample --outdir outputdir --mode MODE -with-singularity docker://alesssia/yampdocker
Please note that Nextflow is not included in the Docker container and should be installed as explained here.
We have listed all known issues and solutions on this wiki page. Please report any issue using the GitHub platform.
Fixes:
- Solved problem in loading data in 'complete` mode
Notes:
- YAMP now requires Nextflow version 0.29.x or higher
Enhancements:
- QC'd files are now compressed (fq.gz) before being saved when
keepQCtmpfile
is true
Fixes:
- Solved problem in loading data when using single library layout
- Solved problem in loading data in 'characterisation` mode
Enhancements:
- Improved logs
- Version and help message printed upon request
Enhancements:
- Users no longer need to specify the number of threads and the maximum amount of memory -- both values are now read from the
nextflow.config
file
Enhancements:
- YAMP can now handle both paired-end and single-end reads
- The de-duplication step is now optional and can be skip (default: true)
Enhancements:
- YAMP can now be run in three "modes" : < QC, characterisation, complete >.
YAMP is licensed under GNU GPL v3.
Alessia would like to thank Brian Bushnell for his helpful suggestions about how to successfully use the BBmap suite in a metagenomics context and for providing several useful resources, and Paolo Di Tommaso, for helping her in using Nextflow properly! Alessia would also like to thank all the users for their valuable feedbacks (and mostly Richard Davies @richardjdavies)