At the moment, this tutorial uses Horovod for an easy and quick start. For advanced configuration please consider using other methods (such as DDP for PyTorch).
The following applications and tools are involved in the process. Each is quickly explained in relation to this tutorial.
- NVIDIA NGC is a catalog for resources, including accelerated containers of famous frameworks such as PyTorch and TensorFlow.
- Note: It is recommended to use these containers as a base "engine" for your applications.
- NVIDIA Enroot is used to pull and modify containers.
- NVIDIA Pyxis is a Slurm's plugin for running containers.
- Slurm is a workload manager for allocating resources (nodes, tasks, GPUs, etc.) needed for a job (a container) to run, and provides the MPI library.
- MPI is used for providing communication between resources.
- Horovod uses MPI and the resources to run the relevant application.
- Note: without Slurm, Horovod should be used with "horovodrun" for multi-node and multi-GPUs applications.
Pulling, modifying and preparing a container for running was previously explained in the Technion's Pre-Slurm Tutorial.
- Note: NVIDIA NGC TensorFlow containers are already equipped with Horovod. Other containers and frameworks that lack of it should be modified and Horovod should be installed prior to running the application. To install Horovod please follow Horovod Installation Guide. For example, to install Horovod in a PyTorch container:
- Run the container with Enroot.
- Install Horovod with
pip install horovod[pytorch]
.
Implementing Horovod in your code is fairly a simple process and is well documented with some examples. Another very basic PyTorch neural network example with a single forward and backward passes is presented here:
import torch
import torch.nn as nn
import torch.optim as optim
import horovod.torch as hvd
def example():
print("start horovod")
hvd.init()
print("-" * 20)
print("total number of GPUs:", hvd.size())
print("number of GPUs on this node:", hvd.local_size())
print("rank:", hvd.rank())
print("local rank:", hvd.local_rank())
print("-" * 20)
print("create a local model")
torch.cuda.set_device(hvd.local_rank())
model = nn.Linear(10, 10)
model.cuda()
print("define a loss function")
loss_fn = nn.MSELoss()
print("define an optimizer")
optimizer = optim.SGD(model.parameters(), lr=0.001)
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
print("distribute parameters")
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
print("run a forward pass")
outputs = model(torch.randn(20, 10).cuda()).cuda()
labels = torch.randn(20, 10).cuda()
print("run a backward pass")
loss_fn(outputs, labels).backward()
print("update parameters using the optimizer")
optimizer.step()
print("done")
def main():
example()
if __name__=="__main__":
main()
Slurm jobs should be submitted via SBATCH script files. A basic example of such file is presented here:
#!/bin/bash
#SBATCH --job-name=<job name>
#SBATCH --output=slurm-%x-%j.out
#SBATCH --error=slurm-%x-%j.err
#SBATCH --ntasks=<number of total GPUs>
#SBATCH --gpus=<number of total GPUs>
srun --container-image=<path to container image> \
--container-mounts=<path to code>:/code \
--no-container-entrypoint \
/bin/bash -c \
"python -u <path to code Python's file>"
Note:
SBATCH
lines provides the resources.srun
provides the command.- Two files will be created named with
slurm-<job name>-<job id>
. The.out
file provides the regular output, while the.err
file provides the errors. - In case specific type of GPUs should be used, use
--gpus=<GPU type>:<number of GPUs>
. E.g.,--gpus=a100:2
for 2 A100 GPUs. - Horovod needs enough tasks to use all of the GPUs. Therefore, the number of tasks provided in
--ntasks
should be equal to the number of GPUs. A lack of tasks will result in GPUs being allocated but not used. - To output each task to a different file add
--output=slurm-%x-%j-%t.out
to thesrun
command. This will create a new file for each task namedslurm-<job name>-<job id>-<task id>.out
. Usually, the main output will be available inslurm-<job name>-<job id>-0.out
file. - More information is available in Slurm's srun and sbatch documentations.
Horovod provides examples for running applications. To see a quick multi-node example in action use the following:
-
Pull an NVIDIA NGC TensorFlow container:
enroot import 'docker://nvcr.io#nvidia/tensorflow:22.09-tf1-py3'
-
Clone Horovod's GitHub repository:
git clone https://github.com/horovod/horovod
-
Create a new SBATCH script file and name it
example.sub
:#!/bin/bash #SBATCH --job-name=horovod_example #SBATCH --output=horovod_example.out #SBATCH --error=horovod_example.err #SBATCH --ntasks=16 #SBATCH --gpus=16 srun --container-image=<path to TensorFlow container image> \ --container-mounts=<path to Horovod GitHub folder>:/code \ --no-container-entrypoint \ /bin/bash -c \ "python -u /code/examples/tensorflow/tensorflow_synthetic_benchmark.py"
- Note: this example uses 16 GPUs which guarantees the use of at least two DGX A100 nodes. You can decrease the number of GPUs, but to guarantee the use of more than a single node please add
#SBATCH --nodes=<number of nodes>
. A use of 3 GPUs and 2 nodes is recommended to observe a minimal imbalanced resources example of a multi-node run.
- Note: this example uses 16 GPUs which guarantees the use of at least two DGX A100 nodes. You can decrease the number of GPUs, but to guarantee the use of more than a single node please add
-
Submit the job:
sbatch <path to example.sub>
-
Examine the output file:
vi horovod_example.out
Note: to view which resources were allocated to the job run the following command:
scontrol show -d job <job id> | grep GRES