-
Notifications
You must be signed in to change notification settings - Fork 17
Using cwltool
When using cwltool
workflows and command-line tools (CLTs) can be preceded by configuration options (for more details, run cwltool -h
). After the workflow of CLT, users have the option of specify the expected arguments or supply a JSON describing the input files and parameters. Conceptually,
cwltool [conf. options] workflow.cwl [--arg1 <value1> [...] | inputs.json]
Realistically, users will almost always prefer to store their parameters and arguments in JSONs, so that no information is lost. In the GGR-cwl project, we facilitate the creation of JSON files with a simple Python utility that takes a tab-delimited file with information about the samples and produces JSON(s) file(s). It uses templates for each of the experiments (e.g. RNA-seq, ChIP-seq), which contains reference files that should be adapted.
Here is an example with real data associated with the GGR project:
python json-generator/run.py \
--metadata-type chip-seq \
--metadata-file json-generator/examples/chip_seq_Lerchter_3153_160217A3.txt \
--data-dir /path/to/data/ \
--mem 16000 \
--nthreads 16 \
--outdir out
which will produce JSON(s) with all the necessary arguments to run cwltool
for each pipeline subtype:
out/chip_seq_Lerchter_3153_160217A3-se-with-control.json
For a complete list of workflows and subworkflows, see https://github.com/Duke-GCB/GGR-cwl/blob/master/README.md
At the time of writing this wiki, Toil is not ready for production to fully take advantage of the Slurm HPC cluster available at Duke GCB (more on this here).
A suboptimal solution is to run multiple cwltool
instances in parallel instead. This means that instead of relying in a CWL runner that can work directly with the HPC scheduler (as Toil does), multiple individual cwltool
commands will be run. The main problem with this solution is that each job will have to request the memory and threads needed to run the most computationally intensive tool of the workflow. Regardless of the individual requirements of each tool, the job will allocate the same memory and threads.
Still, it might be a good idea to speed things up. Just add --separate-jsons
to the json-generator/run.py
command and each JSON will contain only a target sample, adding a numerical suffix to the JSON name, which might come handy if your HPC system supports array jobs.
Once all the software requirements are satisfied, the environment is ready and the input JSONs have been generated, cwltool
is ready to rumble.
Running cwltool
with Docker is convenient and easy, it will automatically download images and build containers.
cwltool \
--non-strict \
--tmpdir-prefix tmpdirs/tmp \
--tmp-outdir-prefix tmpdirs/tmp-out \
--outdir out \
/path/to/GGR-cwl/ChIP-seq_pipeline/v1.0/pipeline-se-with-control.cwl \
out/chip_seq_Lerchter_3153_160217A3-se-with-control.json
But keep in mind that running cwltool
like this will put all the computational stress in the machine running docker. Nonetheless, it is a good way to check that your CWL files run successfully on small sample files.
The no-container
keyword will make cwltool
to avoid Docker directives. Each of the required executables should be provided. An easy way to do this is to add those to the $PATH
environmental variable and the --preserve-environment
option to the execution.
export PATH="/path/to/FastQC:$PATH"
export PATH="/path/to/preseq_v2.0:$PATH"
...
export DISPLAY=:0.0
cwltool \
--no-container \
--preserve-environment PATH R_LIBS DISPLAY \
--non-strict \
--tmpdir-prefix tmpdirs/tmp \
--tmp-outdir-prefix tmpdirs/tmp-out \
--outdir out \
/path/to/GGR-cwl/ChIP-seq_pipeline/pipeline-se-with-control.cwl \
out/chip_seq_Lerchter_3153_160217A3-se-with-control.json
In the example above the DISPLAY
variable has to be set so that java --jar
applications trying to use graphical displays like X11 do not crash (e.g. FASTQC).
If the input JSON files were created with --separate-json
option, it is easy to create a SLURM job array to process them. Here is an example SLURM script of ChIP-seq processing in HARDAC@Duke-GCB.
#!/bin/bash
#SBATCH --job-name=cwl_chipseq
#SBATCH --output=logs/PROJECT123-se-with-control-%%a.out
#SBATCH --mem=16000
#SBATCH --cpus-per-task=16
#SBATCH --array=0-100%20
# For Python based tools, we use anaconda
source activate cwltool
# Load tools in the path
export PATH="/path/to/bin:$PATH"
export PATH="/path/to/cwl/bin:$PATH"
export PATH="/path/to/rsem-1.2.21/:$PATH"
export PATH="/path/to/FastQC:$PATH"
export PATH="/path/to/preseq_v2.0:$PATH"
export PATH="/path/to/samtools-1.3/bin/:$PATH"
export PATH="/path/to/phantompeakqualtools/:$PATH"
# If you have lmod, you can use module load directives
module load bedtools2
module load R/3.2.0-gcb01
# For SPP, some R libraries are needed
export R_LIBS="/data/reddylab/software/R_libs"
# For Fastqc, avoid graphical displays
export DISPLAY=:0.0
cwltool \
--no-container \
--debug \
--non-strict \
--preserve-environment PATH R_LIBS DISPLAY \
--tmpdir-prefix tmpdirs/tmp-PROJECT123-${SLURM_ARRAY_TASK_ID}- \
--tmp-outdir-prefix tmpdirs/tmp-PROJECT123-${SLURM_ARRAY_TASK_ID}- \
--outdir out/%s \
/path/to/GGR-cwl/ChIP-seq_pipeline/pipeline-se-narrow-with-control.cwl \
jsons/chip_seq_download_metadata.se-narrow-with-control-${SLURM_ARRAY_TASK_ID}.json
GGR-cwl pipelines developed by the www.reddylab.org @ Duke GCB