Skip to content

Using cwltool

Alejandro Barrera edited this page May 10, 2018 · 7 revisions

Workflow and CLT inputs

When using cwltool workflows and command-line tools (CLTs) can be preceded by configuration options (for more details, run cwltool -h). After the workflow of CLT, users have the option of specify the expected arguments or supply a JSON describing the input files and parameters. Conceptually,

cwltool [conf. options] workflow.cwl [--arg1 <value1> [...] | inputs.json]

Realistically, users will almost always prefer to store their parameters and arguments in JSONs, so that no information is lost. In the GGR-cwl project, we facilitate the creation of JSON files with a simple Python utility that takes a tab-delimited file with information about the samples and produces JSON(s) file(s). It uses templates for each of the experiments (e.g. RNA-seq, ChIP-seq), which contains reference files that should be adapted.

Here is an example with real data associated with the GGR project:

python json-generator/run.py \
  --metadata-type chip-seq \
  --metadata-file json-generator/examples/chip_seq_Lerchter_3153_160217A3.txt \
  --data-dir /path/to/data/ \
  --mem 16000 \
  --nthreads 16 \
  --outdir out

which will produce JSON(s) with all the necessary arguments to run cwltool for each pipeline subtype:

out/chip_seq_Lerchter_3153_160217A3-se-with-control.json

For a complete list of workflows and subworkflows, see https://github.com/Duke-GCB/GGR-cwl/blob/master/README.md

Parallel processing

At the time of writing this wiki, Toil is not ready for production to fully take advantage of the Slurm HPC cluster available at Duke GCB (more on this here).

A suboptimal solution is to run multiple cwltool instances in parallel instead. This means that instead of relying in a CWL runner that can work directly with the HPC scheduler (as Toil does), multiple individual cwltool commands will be run. The main problem with this solution is that each job will have to request the memory and threads needed to run the most computationally intensive tool of the workflow. Regardless of the individual requirements of each tool, the job will allocate the same memory and threads.

Still, it might be a good idea to speed things up. Just add --separate-jsons to the json-generator/run.py command and each JSON will contain only a target sample, adding a numerical suffix to the JSON name, which might come handy if your HPC system supports array jobs.

Running cwltool

Once all the software requirements are satisfied, the environment is ready and the input JSONs have been generated, cwltool is ready to rumble.

with Docker

Running cwltool with Docker is convenient and easy, it will automatically download images and build containers.

cwltool \
  --non-strict \
  --tmpdir-prefix tmpdirs/tmp \
  --tmp-outdir-prefix tmpdirs/tmp-out \
  --outdir out \
  /path/to/GGR-cwl/ChIP-seq_pipeline/v1.0/pipeline-se-with-control.cwl \
  out/chip_seq_Lerchter_3153_160217A3-se-with-control.json

But keep in mind that running cwltool like this will put all the computational stress in the machine running docker. Nonetheless, it is a good way to check that your CWL files run successfully on small sample files.

no-container

The no-container keyword will make cwltool to avoid Docker directives. Each of the required executables should be provided. An easy way to do this is to add those to the $PATH environmental variable and the --preserve-environment option to the execution.

export PATH="/path/to/FastQC:$PATH"
export PATH="/path/to/preseq_v2.0:$PATH"
...
export DISPLAY=:0.0

cwltool \
  --no-container \
  --preserve-environment PATH R_LIBS DISPLAY \
  --non-strict \
  --tmpdir-prefix tmpdirs/tmp \
  --tmp-outdir-prefix tmpdirs/tmp-out \
  --outdir out \
  /path/to/GGR-cwl/ChIP-seq_pipeline/pipeline-se-with-control.cwl \
  out/chip_seq_Lerchter_3153_160217A3-se-with-control.json

In the example above the DISPLAY variable has to be set so that java --jar applications trying to use graphical displays like X11 do not crash (e.g. FASTQC).

SLURM array job example

If the input JSON files were created with --separate-json option, it is easy to create a SLURM job array to process them. Here is an example SLURM script of ChIP-seq processing in HARDAC@Duke-GCB.

#!/bin/bash
#SBATCH --job-name=cwl_chipseq
#SBATCH --output=logs/PROJECT123-se-with-control-%%a.out
#SBATCH --mem=16000
#SBATCH --cpus-per-task=16
#SBATCH --array=0-100%20

# For Python based tools, we use anaconda
source activate cwltool

# Load tools in the path
export PATH="/path/to/bin:$PATH"
export PATH="/path/to/cwl/bin:$PATH"
export PATH="/path/to/rsem-1.2.21/:$PATH"
export PATH="/path/to/FastQC:$PATH"
export PATH="/path/to/preseq_v2.0:$PATH"
export PATH="/path/to/samtools-1.3/bin/:$PATH"
export PATH="/path/to/phantompeakqualtools/:$PATH"

# If you have lmod, you can use module load directives 
module load bedtools2  
module load R/3.2.0-gcb01

# For SPP, some R libraries are needed
export R_LIBS="/data/reddylab/software/R_libs"

# For Fastqc, avoid graphical displays
export DISPLAY=:0.0

cwltool \
    --no-container \
    --debug \
    --non-strict \
    --preserve-environment PATH R_LIBS DISPLAY \
    --tmpdir-prefix tmpdirs/tmp-PROJECT123-${SLURM_ARRAY_TASK_ID}- \
    --tmp-outdir-prefix tmpdirs/tmp-PROJECT123-${SLURM_ARRAY_TASK_ID}- \
    --outdir out/%s  \
    /path/to/GGR-cwl/ChIP-seq_pipeline/pipeline-se-narrow-with-control.cwl \
    jsons/chip_seq_download_metadata.se-narrow-with-control-${SLURM_ARRAY_TASK_ID}.json