Thist repository contains workflows for metgenomic filtering, assembly, and mapping ,and a combined workflow which runs all components together. The workflow is written in Workflow Definition Language (WDL) and uses Cromwell, a workflow management system developted by Broad Institute, to execute the workflow.

Workflows are intended to run on 2x150 bp Illumina datasets.

Included wdls

rqcfilter2.wdl - performs quality control filtering on Illumina datasets using BBTools

metagenome_assy.wdl - performs external error correction using bbcms from BBTools, followed by assembly with metaSPAdes

mapping.wdl - performs mapping of short reads to an assembly using bbmap from BBTools

Metagenome_filtering_assembly_and_alignment.wdl - executes filtering, error correction, assembly, and mapping

External bioinformatics tools used

BBTools (

SPAdes (

Installation instructions


  • Git version 2.18.4 or higher
  • Cromwell version 47 or higher
  • Docker version 19.03.6-ce or higher
  • Java version 1.8.0_152-release or higher
  • wget

Recommended test dataset Zymobiomics mock-community DNA control (SRR7877884)

Example instructions for installing and running the metagenome assembly pipeline on Amazon Web Services

  • Launch an instance using an AMI that includes docker ( ie amzn-ami-2018.03.20200205-amazon-ecs-optimized - ami-0683e2d253e41f366). The test data provided can be run with a r5.4xlarge instance with 300G of storage mounted to / (root). See current pricing.
  • Connect to your ec2 instance using ssh.
# install wget.
sudo yum install wget -y

# Download and install anaconda.
bash -b -p $PWD/miniconda3
source miniconda3/etc/profile.d/ && conda activate

# Install git and cromwell.
conda install -c conda-forge git cromwell jq -y

# Change directory to /tmp and download code.
cd /tmp; git clone

# Fetch test data.

# Interleave fastq data.
docker run --volume $PWD:/data -w /data bryce911/bbtools:38.86 in=SRR7877884_1.fastq.gz in2=SRR7877884_2.fastq.gz out=SRR7877884.fastq.gz

# Download rqcfilter dataset (~2hours).
mkdir data; cd data; wget -O - | tar -xf - ; cd ..

# Make an inputs.json file containing the the path to the test data.
echo '{"metagenome_filtering_assembly_and_alignment.input_files": ["/tmp/SRR7877884.fastq.gz"]}' > inputs.json

# Run pipeline.
cromwell -Dconfig.file=jgi_meta_wdl/local.conf run -i inputs.json -m output.metadata.json jgi_meta_wdl/metagenome_filtering_assembly_and_alignment.wdl

# display outputs
jq .outputs output.metadata.json

Output files

Final contigs and fasta files:

  • call-assy/metagenome_assy.metagenome_assy/*/call-create_agp/assembly.contigs.fasta
  • call-assy/metagenome_assy.metagenome_assy/*/call-create_agp/assembly.scaffolds.fasta

Alignment files of input reads mapped to contigs:

  • call-mapping/mapping.mapping/*/call-finalize_bams/pairedMapped_sorted.bam
  • call-mapping/mapping.mapping/*/call-finalize_bams/pairedMapped_sorted.bam.cov
  • call-mapping/mapping.mapping/*/call-finalize_bams/pairedMapped_sorted.bam.flagstat