Skip to content

Cromwell

Jennings Zhang edited this page Feb 8, 2022 · 13 revisions

Introduction

Cromwell is a workflow engine which can be used to execute containers on a variety of platforms.

https://cromwell.readthedocs.io/

In particular, we are interested in using Cromwell to run ChRIS plugins using Singularity via a SLURM scheduler.

Thoughts

Actually, Cromwell is most similar in functionality as the pfcon + pman duo.

For our convenience, we have implemented a shim for pman to dispatch requests to Cromwell. In theory, this means all platforms supported by Cromwell are now also supported by ChRIS.

Architecture

pfcon, pman, and cromwell are running on a container host, where the cromwell container has access to a SLURM cluster via sbatch, squeue, and scancel commands. pfcon is responsible for localization and delocalization of data (i.e. moving data to and from a filesystem mounted by the SLURM cluster).`

System Account

The container user (or mapped UID) of pfcon and Cromwell must be the same, and one of an authorized SLURM user.

Cromwell + SLURM

Cromwell must be configured to support WDLs generated by the Jinja2 template defined in pman/cromwell/slurm/wdl.py.

include required(classpath("application"))

backend {
  default = SLURM
  providers {
    SLURM {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        runtime-attributes = """
        Int timelimit = 30
        Int cpu = 1
        Int memory_mb = 4000
        Int gpu_limit = 0
        Int number_of_workers = 1
        String slurm_partition = "short"
        String slurm_account = "mylab"
        String docker
        String sharedir
        """

        submit-docker = """
        # https://cromwell.readthedocs.io/en/stable/tutorials/Containers/#job-schedulers
        # https://github.com/broadinstitute/cromwell/blob/develop/cromwell.example.backends/singularity.slurm.conf
        sbatch -J ${job_name} \
            -D ${cwd} -o ${out} -e ${err} -t ${timelimit} \
            -p ${slurm_partition} -A ${slurm_account} \
            --cpus-per-task ${cpu} \
            --mem ${memory} \
            --gpus-per-task ${gpu_limit} \
            --nodes ${number_of_workers} \
            chrispl_singularity_wrapper.sh \
                ${cwd}:${docker_cwd} \
                ${docker} ${job_shell} ${docker_script} ${sharedir}
        """

        kill = "scancel ${job_id}"
        check-alive = "squeue -j ${job_id}"
        job-id-regex = "Submitted batch job (\\d+).*"
      }
    }
  }
}

Usually, the GPU nodes of a SLURM cluster are in a different partition. So it may be helpful to add something like this to your configuration:

partition=${slurm_partition}
if [ "${gpu_limit}" -gt "0" ]; then
  partition=has-gpu
fi

GPU usage on SLURM can be different. Consult your cluster's specific documentation. E.g. it is common for the --gres flag to be used:

sbatch ... --gres=gpu:Titan_RTX:${gpu_limit} ...

Singularity Wrapper Script

chrispl_singularity_wrapper.sh should be a wrapper script which executes ChRIS plugins using Apptainer. It can also have more features such as management of the Singularity image build cache. Here is a basic example satisfying the above usage:

#!/bin/bash -ex
cwd="$1"
image="$2"
shell="$3"
script="$4"
sharedir="$5"

if [ -n "$SLURM_JOB_GPUS" ]; then
  gpu_flag='--nv'
fi

export SINGULARITY_CACHEDIR=/work/singularity-cache
module load singularity/3.8.5
exec singularity exec --containall $gpu_flag -B "$cwd" -B "$sharedir:/share" "docker://$image" "$shell" "$script"

pman, pfcon, and Data Localization

pman should be configured with the environment variables:

SECRET_KEY=aaaaaaaa
CONTAINER_ENV=cromwell
CROMWELL_URL=http://example.com/
STORAGE_TYPE=host
STOREBASE=/some/path/on/nfs/server
TIMELIMIT_MINUTES=30

/some/path/on/nfs/server should be a filesystem mounted in pfcon and in the SLURM cluster on the same path.

Authentication with Cromwell is not currently supported.