-
Notifications
You must be signed in to change notification settings - Fork 312
Upgrade the NVIDIA GPU driver on a Slurm cluster managed with AWS ParallelCluster
An AWS ParallelCluster release comes with a set of AMIs for the supported operating systems and EC2 platforms. Each AMI contains a software stack, including the NVIDIA Drivers, that has been validated at ParallelCluster release time.
It’s likely that other versions of the NVIDIA Drivers can successfully work with the rest of the software stack but technical support will be limited.
If you wish to upgrade the NVIDIA GPU Driver on your cluster you can follow this guide.
To upgrade the NVIDIA GPU Driver on the compute nodes, it is advised to create a new custom AMI with the new version of the driver via the pcluster build-image
command.
In particular, what you need to provide to the image configuration file is the following configuration snippet:
Build:
InstanceType: g4dn.xlarge # instance type with NVIDIA GPUs
ParentImage: ami-04823729c75214919 # base AMI of your desired OS, e.g. alinux2
DevSettings:
Cookbook:
ExtraChefAttributes: |
{"cluster": {"nvidia": {"enabled": true, "driver_version": "470.199.02"}}}
You are free to choose the version of the NVIDIA driver you wish to install (please consider the requirements for the versions of CUDA installed on the AMI — see Installing Alternate CUDA Versions on AWS ParallelCluster).
Please notice that the true
value of the "enabled"
key must not be quoted, otherwise the build will proceed without installing the driver.
Please use an instance type with NVIDIA GPUs for the build, and start from a base AMI of an OS supported by ParallelCluster (see ParallelCluster documentation for the Os
setting). You can also start from a custom AMI, but please make sure the NVIDIA GPU driver is not pre-installed on that image, otherwise the build will fail.
After having successfully built the custom AMI, you can update your compute nodes using the Scheduling/SlurmQueues/Queue/Image/CustomAmi
cluster configuration parameter and launching a pcluster update-cluster
command. Depending on the version of ParallelCluster you are using, this may require stopping the whole compute fleet or apply a queue parameter update strategy (Scheduling/SlurmSettings/QueueUpdateStrategy
) to the cluster.
Once the update is applied and the compute nodes have been started with the new custom AMI, please verify that the new version of the driver is installed by launching the nvidia-smi
command.
If you wish to upgrade the NVIDIA GPU on the head node of your cluster, you cannot rely on upgrading your AMI, since as of ParallelCluster 3.6.1 the AMI of the head node cannot be changed with a pcluster update-cluster
operation.
Please notice that upgrading the NVIDIA driver on the head node is only necessary for customers using a GPU instance on the head node (for example for customers using the DCV functionality of ParallelCluster).
In this case, please follow these steps (here we are installing version 470.199.02):
- Stop the compute fleet on the cluster via a
pcluster update-compute-fleet -n <cluster_name> --status STOP_REQUESTED
operation, and wait for the compute fleet to be stopped. - Connect to the head node and download the installer file of the new version of the NVIDIA GPU Driver you wish to install on the head node via:
wget https://us.download.nvidia.com/tesla/470.199.02/NVIDIA-Linux-x86_64-470.199.02.run
; then make the installer executable. - As root, uninstall the previous version of the driver via
./NVIDIA-Linux-x86_64-470.199.02.run --uninstall
- After the uninstall operation has completed, reboot the head node.
- Reconnect to the head node after the reboot, and then install the new version of the driver via
./NVIDIA-Linux-x86_64-470.199.02.run --silent --dkms --disable-nouveau
- Once the installation has completed, reboot the head node once again.
- After the reboot, please verify that the node is using the new version of the NVIDIA driver using the
nvidia-smi
command. - After the verification, please restart the compute fleet via a
pcluster update-compute-fleet -n <cluster_name> --status START_REQUESTED
operation.