-
Notifications
You must be signed in to change notification settings - Fork 312
Transition from SGE to SLURM
When using a shared cluster resource access to utilize the compute nodes is guarded by a resource scheduler such as SLURM, SGE, PBS Pro et al. They line up jobs to be run sequentially, allow for prioritization and job placement within the cluster.
SLURM is one of the most popular open-source job schedulers and this document intends to help end-users transition off their current scheduler to SLURM. Transition to a new scheduler comes down to learn and practice new commands to handle the lifecycle of a job (submit/view/cancel/..), view resources like queues and compute nodes and adjust the way submit scripts are written. In some cases the last step can be made easier by using a compatibility mode to use existing submit scripts.
Son of Grid Engine (SGE) is the open source community version that previously continued development of Sun Grid Engine (also referred to as SGE). It's this version (Son of Grid Engine) that has been used inside of ParallelCluster. Moving off SGE to SLURM needs some adjustments in how to interact with the scheduler by using SLURM commands and adjust the submission scripts.
- queue/partition SGE uses the term queues, while SLRUM calls them partitions
- node-count SGE has no concept of node counts, SLURM has
Firstly, common commands used in SGE have an equivalent in the SLURM environment. The following table reviews the most common once.
User Command | SGE | SLURM |
---|---|---|
Interactive Login | qlogin |
srun --pty [-p $partition_name] [--time=hh:mm:ss] bash |
Job submission | qsub $script_file |
sbatch $script_file |
Job deletion | qdel $job_id |
scancel $job_id |
Job status by job | qstat -u \* -j $job_id |
squeue $job_id |
Job status by user | qstat -u $user_name |
squeue -u $user_name |
Job hold | qhold $job_id |
scontrol hold $job_id |
Job release | qrls $job_id |
scontrol release $job_id |
Queue list | qconf -sql |
squeue |
List nodes | qhost |
sinfo -N OR scontrol show nodes
|
Cluster status | qhost -q |
sinfo |
GUI | qmon |
sview |
Environment variables are used within a script to get access to information about the running job.
Information | SGE | SLURM |
---|---|---|
Job ID | $JOB_ID |
$SLURM_JOBID |
Submit directory | $SGE_O_WORKDIR |
$SLURM_SUBMIT_DIR |
Submit host | $SGE_O_HOST |
$SLURM_SUBMIT_HOST |
Node list | $PE_HOSTFILE |
$SLURM_JOB_NODELIST |
Job Array Index | $SGE_TASK_ID |
$SLURM_ARRAY_TASK_ID |
Submission scripts are able to inform the scheduler about certain parameters of the submission using a directive and options. The following table collects the most common specifications.
Information | SGE | SLURM |
---|---|---|
Script directive | #$ |
#SBATCH |
queue/partition | -q $queue |
-p $partition |
count of nodes | N/A | -N [min[-max]] |
CPU count |
-pe smp <count> OR -pe orte <count>
|
-n <count> |
Wall clock limit | -l h_rt=<seconds> |
-t <min> OR -t <days-hh:mm:ss>
|
stdout file | -o <file_name> |
-o <file_name> |
stderr file | -e <file_name> |
-e <file_name> |
combine stdout/stderr | -j yes |
use -o without -e
|
copy environment | -V |
--export=(ALL/NONE/<variables>) |
event notification | -m abe |
--mail-type=<events> |
send notification email | -M <address> |
--mail-user=<address> |
job name | -N <name> |
--job-name=<name> |
Restart job | `-r (yes | no)` |
set working directory | -wd <dir_name> |
--workdir=<dir_name> |
Resource sharing | -l exclusive |
--exclusive OR —shared
|
Memory size | `-l mem_free=(K | M |
Charge to an account | -A <account> |
--account=<account> |
Tasks per node | fixed rule in parallel environment (PE) | --tasks-per-node=<count> |
Job dependency |
-hold_jid <job_id> OR -hold_jid=<job_name>
|
--depend=[<state>:]<job_id> |
Job project | -P <name> |
--wckey=<name> |
Job host preferences |
-q <queue>@<node> OR -q <queue>@@<hostgroup>
|
--nodelist=<nodes> AND/OR —exclude=<nodes>
|
Quality of service | N/A` | --qos=<name> |
Job arrays | -t $array_spec |
--array=$array_spec |
Generic Resources | -l $resource=$value |
--gres=<resource_spec> |
Licenses | -l <license>=<count> |
--licenses=<license_spec> |
Begin Time | -a <YYMMDDhhmm> |
--begin=<YYYY-MM-DD[THH:MM[:SS]]> |
SGE | SLURM |
---|---|
Single-core application | |
#!/bin/bash |
#!/bin/bash -l # NOTE the -l flag! |
MPI application w/o Hyperthreading | |
#!/bin/bash -l |
|
Hybrid MPI/OpenMP w/o hyperthreading | |
#!/bin/bash -l #Standard output and error: #SBATCH -o ./tjob_hybrid.out.%j #SBATCH -e ./tjob_hybrid.err.%j #Initial working directory: #SBATCH -D ./ #Job Name: #SBATCH -J test_slurm #Queue (Partition): #SBATCH --partition=general ##Number of nodes and MPI tasks per node: #SBATCH --nodes=8 #SBATCH --ntasks-per-node=4 ##for OpenMP: #SBATCH --cpus-per-task=8 # ##Wall clock limit: #SBATCH --time=24:00:00 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ##For pinning threads correctly: export OMP_PLACES=cores ##Run the program: srun ./myprog > prog.out |
|
Batch job using job steps | |
#!/bin/bash ##Submit a chain of batch jobs with dependencies # ##Number of jobs to submit: NR_OF_JOBS=6 ##Batch job script: JOB_SCRIPT=./my_batch_script echo "Submitting job chain of ${NR_OF_JOBS}" echo "jobs for batch script ${JOB_SCRIPT}:" JOBID=$(sbatch ${JOB_SCRIPT} 2>&1 | awk '{print $(NF)}') echo " " ${JOBID} I=1 while [ ${I} -lt ${NR_OF_JOBS} ]; do JOBID=$(sbatch --dependency=afterany:${JOBID} \ ${JOB_SCRIPT} 2>&1 | awk '{print $(NF)}') echo " " ${JOBID} let I=${I}+1 done |
|
Batch job using job array | |
#!/bin/bash -l ##specify the indexes of the job array elements #SBATCH --array=1-20 ##Standard output and error: ##Standard output, %A = job ID, %a = job array index #SBATCH -o job_%A_%a.out ##Standard error, %A = job ID, %a = job array index #SBATCH -e job_%A_%a.err ##Initial working directory: #SBATCH -D ./ ##Job Name: #SBATCH -J test_array ##Queue (Partition): #SBATCH --partition=general ##Number of nodes and MPI tasks per node: #SBATCH --nodes=1 #SBATCH --ntasks-per-node=40 # ##Wall clock limit: #SBATCH --time=24:00:00 |
|
MPI Batch job using GPUs | |
#!/bin/bash -l ##Standard output and error: #SBATCH -o ./tjob.out.%j #SBATCH -e ./tjob.err.%j ##Initial working directory: #SBATCH -D ./ # #SBATCH -J test_slurm ##Queue: ##If using both GPUs of a node #SBATCH --partition=gpu ##If using only 1 GPU of a shared node ###SBATCH --partition=gpu:1 ##Node feature: #SBATCH --constraint="gpu" ##Specify number of GPUs to use: # If using both GPUs of a node #SBATCH --gres=gpu:2 ### If using only 1 GPU of a shared node ###SBATCH --gres=gpu:1 ### Memory is necessary if using only 1 GPU ###SBATCH --mem=61000 # ##Number of nodes and MPI tasks per node: #SBATCH --nodes=1 ##If using both GPUs of a node #SBATCH --ntasks-per-node=32 ##If using only 1 GPU of a shared node ###SBATCH --ntasks-per-node=16 # ##wall clock limit: #SBATCH --time=24:00:00 |
To keep track which jobs are actually failed or not inside the script.
scontrol requeuehold ${SLURM_JOB_ID}
And for an array job.
scontrol requeuehold ${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID})