📢 NOTICE: We are seeking volunteers to maintain this repository as the current maintainers no longer use LSF. See this issue. 📢
Snakemake profile for running jobs on an LSF cluster.
This profile is deployed using Cookiecutter. If you do not have
cookiecutter
installed it can be easily installed using mamba
or pip
by running:
pip install --user cookiecutter
# or
mamba create -n cookiecutter -c conda-forge cookiecutter
mamba activate cookiecutter
If neither of these methods suits you, then visit the installation documentation for other options.
Download and set up the profile on your cluster
# create configuration directory that snakemake searches for profiles
profile_dir="${HOME}/.config/snakemake"
mkdir -p "$profile_dir"
# use cookiecutter to create the profile in the config directory
template="gh:Snakemake-Profiles/lsf"
cookiecutter --output-dir "$profile_dir" "$template"
You will then be prompted to set some default parameters.
Default: KB
Valid options: KB
, MB
, GB
, TB
, PB
, EB
, ZB
LSF_UNIT_FOR_LIMITS
on
your cluster. This value is stored in your cluster's lsf.conf
file. In
general, this file is located at ${LSF_ENVDIR}/lsf.conf
. So the easiest way to get
this value is to run the following:
grep '^LSF_UNIT_FOR_LIMITS' ${LSF_ENVDIR}/lsf.conf
You should get something along the lines of LSF_UNIT_FOR_LIMITS=MB
. If this command
doesn't work, get in touch with your cluster administrator to find out the value.
As mentioned above, this is a very important parameter. It sets the scaling units to use
for resource limits. So, if this value is MB
on your cluster, then when setting the
memory limit with -M 1000
the value is taken as megabytes. As snakemake
allows you
to set the memory for a rule with the resources: mem_mb
parameter, it is important for
this profile to know whether this then needs to be converted into other units when
submitting jobs. See here for further information.
Default: wait
Valid options: wait
, kill
When LSF returns a job status of UNKWN
do you want to wait for the host the job is
running on to be contactable again - i.e. consider the job running - or kill it as
outlined here?
Default: ignore
Valid options: ignore
, kill
When LSF returns a job status of ZOMBI
do you want to ignore this (not clean it up) or
kill it as outlined here? Regardless of the option chosen, the job is
considered failed.
Default: 5
This sets the default --latency-wait/--output-wait/-w
parameter in snakemake
.
From the snakemake --help
menu
--latency-wait SECONDS, --output-wait SECONDS, -w SECONDS
Wait given seconds if an output file of a job is not
present after the job finished. This helps if your
filesystem suffers from latency (default 5).
Default: False
Valid options: False
, True
This sets the default --use-conda
parameter in snakemake
.
From the snakemake --help
menu
--use-conda If defined in the rule, run job in a conda
environment. If this flag is not set, the conda
directive is ignored.
Default: False
Valid options: False
, True
This sets the default --use-singularity
parameter in snakemake
.
From the snakemake --help
menu
--use-singularity If defined in the rule, run job within a singularity
container. If this flag is not set, the singularity
directive is ignored.
Default: 0
This sets the default --restart-times
parameter in snakemake
.
From the snakemake --help
menu
--restart-times RESTART_TIMES
Number of times to restart failing jobs (defaults to
0).
Default: False
Valid options: False
, True
This sets the default --printshellcmds/-p
parameter in snakemake
.
From the snakemake --help
menu
--printshellcmds, -p Print out the shell commands that will be executed.
Default: 500
This sets the default --cores/--jobs/-j
parameter in snakemake
.
From the snakemake --help
menu
--cores [N], --jobs [N], -j [N]
Use at most N cores in parallel. If N is omitted or
'all', the limit is set to the number of available
cores.
In the context of a cluster, -j
denotes the number of jobs submitted to the cluster at
the same time1.
Default: 1024
This sets the default memory, in megabytes, for a rule
being submitted to the cluster
without mem_mb
set under resources
.
See below for how to overwrite this
in a rule
.
Default: "logs/cluster"
This sets the directory under which cluster log files are written. The path is relative to the working directory of the pipeline. If it does not exist, it will be created.
The log files for a given rule are organised into sub-directories. This is to avoid
having potentially thousands of files in one directory, as this can cause file system
issues.
If you want to find the log files for a rule called foo
, with wildcards
sample=a,ext=fq
then this would be located at
logs/cluster/foo/sample=a,ext=fq/jobid<jobid>-<uuid>.out
for the standard
output and with extension .err
for the standard error.
<jobid>
is the internal jobid used by snakemake
and is the same across multiple
attempts at running the same rule.
<uuid>
is a random 28-digit, separated by -
, and is specific to each attempt
at running a rule. So if a rule fails, and is restarted, the uuid will be different.
The reason for such a seemingly complex log-naming scheme is explained in Known Issues. However, you can override the name of the log files for a specific rule by following the instructions below.
Default: None
The default queue on the cluster to submit jobs to. If left unset, then the default on
your cluster will be used.
The bsub
parameter that this controls is -q
.
Default: None
The default project on the cluster to submit jobs with. If left unset, then the default on your cluster will be used.
The bsub
parameter that this controls is -P
.
Default: None
The default group on the cluster to submit jobs with. If left unset, then the default on your cluster will be used.
The bsub
parameter that this controls is [-G
][bsub-G].
Default: 10
This sets the default --max-status-checks-per-second
parameter in snakemake
.
From the snakemake --help
menu
--max-status-checks-per-second MAX_STATUS_CHECKS_PER_SECOND
Maximal number of job status checks per second,
default is 10, fractions allowed.
Default: 10
This sets the default --max-jobs-per-second
parameter in snakemake
.
From the snakemake --help
menu
--max-jobs-per-second MAX_JOBS_PER_SECOND
Maximal number of cluster/drmaa jobs per second,
default is 10, fractions allowed.
Default: 1
How many times to check the status of a job.
Default: 0.001
How many seconds to wait until checking the status of a job again (if max_status_checks
is greater than 1).
Default: lsf
The name to use for this profile. The directory for the profile is created as this name
i.e. $HOME/.config/snakemake/<profile_name>
.
This is also the value you pass to snakemake --profile <profile_name>
.
Once set up is complete, this will allow you to run snakemake with the cluster profile
using the --profile
flag. For example, if the profile name was lsf
, then you can
run:
snakemake --profile lsf [snakemake options]
and pass any other valid snakemake options.
The following resources can be specified within a rule
:
threads: <INT>
the number of threads needed for the job. If not specified, will default to the amount you set when initialising the profile.resources:
mem_mb = <INT>
: the memory required for the rule, in megabytes. If not specified, will default to the amount you set when initialising the profile.time_min: <INT>
: the runtime limit required for the rule, in minutes.
NOTE: these settings will override the profile defaults.
Since the deprecation of cluster configuration files the ability to specify per-rule cluster settings is snakemake-profile-specific.
Per-rule configuration must be placed in a file called lsf.yaml
and must be
located in the working directory for the pipeline. If you set workdir
manually within
your workflow, the config file has to be in there.
NOTE: these settings are only valid for this profile and are not guaranteed to be valid on non-LSF cluster systems.
All settings are given with the rule
name as the key, and the additional cluster
settings as a string (scalar) or list
(sequence).
Snakefile
rule foo:
input: "foo.txt"
output: "bar.txt"
shell:
"grep 'bar' {input} > {output}"
rule bar:
input: "bar.txt"
output: "file.out"
shell:
"echo blah > {output}"
lsf.yaml
__default__:
- "-P project2"
- "-W 1:05"
foo:
- "-P gpu"
- "-gpu 'gpu resources'"
In this example, we specify a default (__default__
) project (-P
) and
runtime limit (-W
) that will apply to all rules.
We then override the project and, additionally, specify GPU resources for
the rule foo
.
For those interested in the details, this will lead to a submission command, for foo
that looks something like
$ bsub [options] -P project2 -W 1:05 -P gpu -gpu 'gpu resources' ...
Although -P
is provided twice, LSF uses the last instance.
__default__: "-P project2 -W 1:05"
foo: "-P gpu -gpu 'gpu resources'"
The above is also a valid form of the previous example but not recommended.
Some LSF commands require multiple levels of quote-escaping.
For example, to exclude a node from job submission which has non-alphabetic characters
in its name (docs): bsub -R "select[hname!='node-name']"
.
You can specify this in lsf.yaml
as:
__default__:
- "-R \"select[hname!='node-name']\""
If running very large snakemake
pipelines, or there are many workflow management
systems submitting and checking jobs at the same time on the same cluster, we have seen
examples where retrieval of the job state from LSF returns an empty status. This causes
problems as we do not know whether or not the job has passed/failed. In these
circumstances, the status-checker will look at the log file for the
job to see if it is complete or still running. Thus, the reason for the seemingly
complex log file naming scheme. As the status-checker uses tail
to get the status, if
the standard output log file of the job is very large, then status checking will be
slowed down as a result. If you run into these problems and the tail
solution is no
feasible, the first suggestion would be to reduce --max_status_checks_per_second
and
see if this helps.
Please raise an issue if you experience this, and the log file check doesn't seem to
work.
Please refer to CONTRIBUTING.md
.