Skip to content

Commit

Permalink
Merge pull request #44
Browse files Browse the repository at this point in the history
* Adding documentation for slurm
  • Loading branch information
fongcj authored Oct 11, 2024
1 parent 4a3a6de commit 8f4b3ff
Show file tree
Hide file tree
Showing 2 changed files with 171 additions and 1 deletion.
169 changes: 169 additions & 0 deletions docs/reference/user-guide/slurm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# SLURM Job Scheduling: A Guide

## Introduction

SLURM (Simple Linux Utility for Resource Management) is a powerful and flexible workload manager and job scheduler. It is used to allocate resources, submit, monitor, and manage jobs on high-performance computing clusters.

This guide covers the basics of using SLURM, including submitting jobs, requesting resources, and monitoring their execution.

---

## Table of Contents
- [Submitting a Job with SLURM](#submitting-a-job-with-slurm)
- [Basic SLURM Directives](#basic-slurm-directives)
- [Example SLURM Job Script](#example-slurm-job-script)
- [Running Array Jobs](#running-array-jobs)
- [Monitoring Jobs](#monitoring-jobs)
- [Canceling Jobs](#canceling-jobs)
- [Common SLURM Commands](#common-slurm-commands)

---

## Submitting a Job with SLURM

To submit a job in SLURM, you create a job script that includes directives telling SLURM what resources your job needs, how long it will take, where to write output, etc. This script is submitted using the `sbatch` command.

```bash
sbatch job_script.slurm
```

---

## Basic SLURM Directives

In the job script, directives are defined using the `#SBATCH` prefix, followed by the resource requests or configurations you need for your job.

Here are some common SLURM directives:

| Directive | Description |
|----------------------|-------------------------------------------------------|
| `--job-name=<name>` | Sets the job name for easier identification |
| `--output=<file>` | File to store standard output (use `%j` for job ID) |
| `--error=<file>` | File to store standard error (use `%j` for job ID) |
| `--ntasks=<num>` | Number of tasks (CPU cores) required |
| `--mem=<size>` | Memory required for the job (e.g., 4G, 10G, etc.) |
| `--time=<time>` | Maximum run time (format: `days-hours:minutes:seconds`) |
| `--partition=<name>` | Specify the partition or queue to use |
| `--gpus=<num>` | Number of GPUs required |
| `--array=<range>` | Job array (e.g., `0-10`, creates 11 tasks) |

---

## Example SLURM Job Script

Below is a simple example of a SLURM job script.

```bash
#!/bin/bash
#SBATCH --job-name=my_analysis # Job name
#SBATCH --output=my_analysis_%j.out # Output file (%j is replaced by job ID)
#SBATCH --error=my_analysis_%j.err # Error file
#SBATCH --ntasks=1 # Run a single task (1 CPU core)
#SBATCH --mem=8G # Memory request
#SBATCH --time=02:00:00 # Time limit (2 hours)
#SBATCH --partition=short # Partition to submit the job to

# Your executable or command goes here
srun python my_script.py --input data/input_file.csv --output results/output_file.csv
```

In this script:
- The `#SBATCH` directives configure the job's resources.
- The `srun` command launches the program, which in this case runs a Python script.

---

## Running Array Jobs

Array jobs allow you to submit multiple similar jobs with one submission. You can specify an array with the `--array` directive.

```bash
#SBATCH --array=0-10 # Submits 11 tasks, with IDs ranging from 0 to 10
```

In your script, you can use the environment variable `$SLURM_ARRAY_TASK_ID` to differentiate tasks in the array.

Example:

```bash
#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --array=0-10
#SBATCH --output=logs/job_%A_%a.out # %A is the job ID, %a is the array index

# Command that varies based on the array task ID
srun ./process_data.sh input_file_$SLURM_ARRAY_TASK_ID.txt
```

---

## Monitoring Jobs

To monitor your submitted jobs, you can use the following commands:

- **`squeue`**: Shows the status of all jobs in the queue.
```bash
squeue -u <username>
```

- **`scontrol show job <job_id>`**: Shows detailed information about a specific job.

- **`sacct`**: Displays accounting information for your completed jobs.
```bash
sacct -j <job_id>
```

---

## Canceling Jobs

You can cancel a running or pending job using the `scancel` command:

```bash
scancel <job_id>
```

To cancel an entire job array, you can omit the task ID, or use the specific task ID to cancel only one task:

```bash
scancel <job_id> # Cancels the entire array
scancel <job_id>_<task_id> # Cancels a specific task in the array
```

---

## Common SLURM Commands

- **`sbatch`**: Submits a job script.
```bash
sbatch my_job_script.slurm
```

- **`squeue`**: Displays information about jobs in the queue.
```bash
squeue -u <username>
```

- **`scancel`**: Cancels a job or set of jobs.
```bash
scancel <job_id>
```

- **`sinfo`**: Shows the status of partitions and nodes.
```bash
sinfo
```

- **`scontrol`**: Allows you to manage jobs and resources (e.g., show job details).
```bash
scontrol show job <job_id>
```

- **`srun`**: Runs parallel tasks within a SLURM job (not typically needed for single-node jobs).

---

## Additional Guides

For further details and advanced usage, consult the official [SLURM documentation](https://slurm.schedmd.com/documentation.html).

3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,8 @@ nav:
- 'R Users': 'reference/user-guide/r-guide.md'
- 'JupyterHub Setup': 'reference/user-guide/jupyterhub.md'
- 'Minio': 'reference/user-guide/minio.md'
- 'Running GPU Jobs via HTCondor': 'reference/user-guide/htcondor.md'
- 'Slurm GPU Job Scheduler': 'reference/user-guide/slurm.md'
- 'HTCondor (Deprecated)': 'reference/user-guide/htcondor.md'
- 'Data Querying Through Dremio': 'reference/user-guide/data-query-quick-start.md'
- 'Airflow': 'reference/user-guide/airflow.md'
- 'Conda Cheatsheet': 'reference/user-guide/conda-cheatsheet.md'
Expand Down

0 comments on commit 8f4b3ff

Please sign in to comment.