Best practices for preparing your environment to run as a Slurm job. Procedure includes pulling, modifying and converting a container from NGC using NVIDIA Enroot. A multi-node Horovod example is included.
We strongly recommend to work with a container as your neutral environment and mount your code from outside the container.
A flowchart of the process:
Our recommendation is to start developing based on an optimized container from NGC. Chances are the required packages are installed and no modification is needed. If this is not the case, you can modify the container to fit your requirements.
-
Pull a relevant container from NGC using NVIDIA Enroot.
Optimized frameworks (PyTorch, TensorFlow, etc.) containers are available and updated monthly. To find which container fits your desired environment visit our Optimized Framework Release Notes and search for the relevant container release.
Pull command:
enroot import 'docker://nvcr.io#nvidia/<framework>:<tag>'
E.g., to pull a 22.03 release TensorFlow container run:
enroot import 'docker://nvcr.io#nvidia/tensorflow:22.03-tf1-py3'
A container will be pulled and converted to a local squash file.
-
Export the container to Enroot's data path.
enroot create --name <environment_name> <squash_file>
E.g., to export the TensorFlow container run:
enroot create --name nvidia_tf nvidia+tensorflow+22.03-tf1-py3.sqsh
To view all exported containers run:
enroot list
-
Start and work on the container.
enroot start --root --rw --mount <local_folder>:<container_folder> <environment_name>
--root
enables root privileges.--rw
enables read and write permissions (any changes inside the container will be saved).--mount
enables mounting of a local folder (to mount your code and data).
More configurations are available in Enroot's start command documentations.
To exit the container run
exit
.
Slurm uses squash files to run jobs. Therefore, your environment should be exported to a (new) squash file, containing all the changes you performed (if any).
-
Export your current environment to a squash file.
enroot export --output <squash_file> <environment_name>
A new squash file will be locally created.
Note: move the squash file to a location accessible to Slurm.
-
Optional: remove old squash files and clear Enroot's data path.
The original, unmodified squash file can be deleted. Additionally, to delete the exported container under Enroot's data path run:
enroot remove <environment_name>
Slurm jobs can be submitted either via a srun
or a sbatch
commands. To submit a job from the "login" node use sbatch
and prepare a designated script.
Relevant for executing multi-GPU / multi-node runs using MPI. We'll use Horovod's example for that.
Note: also relevant for single-GPU runs, but MPI is redundant.
-
Clone Horovod's repository.
git clone https://github.com/horovod/horovod
-
Create a Slurm script file.
Create a new file, paste the following code and save:
#!/bin/bash #SBATCH --job-name horovod_tf #SBATCH --output %x-%j.out #SBATCH --error %x-%j.err #SBATCH --ntasks 1 #SBATCH --cpus-per-task 32 #SBATCH --gpus-per-task 16 srun --container-image $1 \ --container-mounts $2:/code \ --no-container-entrypoint \ /bin/bash -c \ "python /code/examples/tensorflow/tensorflow_synthetic_benchmark.py \ --batch-size 256"
%x
- Job name.%j
- Job ID.
Note: this script is intended to run on 16 GPUs (E.g., 2 nodes with 8 GPUs each), modify it if needed. Notice how only a single task (
--ntasks 1
) is needed for running with MPI. -
Submit a new Slurm job.
sbatch <script_file> <squash_file> <horovods_git_folder>
Two files will be locally created, one for the output and one for the errors.
Relevant for executing single-GPU / multi-GPU runs in a single / multi-threaded manner with the framework's native support.
Create a Slurm script identical to Case A, and change the following lines:
#SBATCH --ntasks <number of GPUs>
#SBATCH --cpus-per-task 8
#SBATCH --gpus-per-task 1
This will create a separate task per GPU.