-
Notifications
You must be signed in to change notification settings - Fork 312
Installing Alternate CUDA Versions on AWS ParallelCluster
Last Updated: 04/06/2023
CUDA is a parallel computing platform and application programming interface (API) developed by Nvidia for general-purpose computing on GPUs. The default AMIs that AWS ParallelCluster 3.5.1 uses for head nodes and compute instances have CUDA 11.7.1 installed, but there are some cases why you might want to install a different version.
- Compatibility with different software: Certain software may require a specific version of CUDA to be installed to run properly. For example, you may need to install an older version of CUDA to use an older version of a machine learning framework or a simulation application.
- Development and testing: You may need to test software with different versions of CUDA to ensure that it works properly across different hardware and software configurations.
- Performance optimization: Different versions of CUDA may offer different levels of performance optimization for different GPUs. You may want to install a specific version of CUDA that yields better performance for your deep learning or scientific computing applications.
There are two main paths to install and upgrade CUDA. You can use your system’s package manager, such as apt or yum, or you can use downloadable run files provided by NVIDIA. It is important to pick one method and stick with it, otherwise your system can become unusable due to software conflicts. In the case of the AWS ParallelCluster AMI, that choice is already determined because the AWS HPC engineering team uses NVIDIA run files to install the NVIDIA driver, NVIDIA fabric manager, and the CUDA framework on the AWS ParallelCluster AMIs. Here, we will illustrate how to install alternative versions of CUDA using NVIDIA run files.
In this document, we will create a custom AMI with CUDA 11.6 installed alongside the existing 11.7 version. This is an overview of the steps we will take:
- Launch and log into an Amazon EC2 instance
- Find out the installed driver version
- Check compatibility with the installed driver
- Download the run script from NVIDIA
- Run the installer
- [Optional] Test your CUDA environment
- Create your custom AMI
You will then be able to use the new AMI with ParallelCluster.
Launch an EC2 instance with at least one NVIDIA GPU using your choice of official ParallelCluster AMIs. The process is described in detail in the AWS ParallelCluster documentation. Briefly,
- Use the
pcluster list-official-images
command to find the AMI ID for your preferred operating system and architecture. - Navigate to the Amazon EC2 Console, choose Images:AMIs, and search for your AMI ID.
- Choose Launch instance from AMI.
- Select an instance type with an NVIDIA GPU (such as a
g4dn.2xlarge
). - Select an SSH key for accessing the instance.
- Launch the instance then log into it via SSH once it is ready.
The NVIDIA driver is software that enables communication between a computer's operating system and its NVIDIA graphics processing unit (GPU). CUDA is a parallel computing platform that leverages the power of the NVIDIA GPU to accelerate complex computing tasks. The NVIDIA driver is a prerequisite for CUDA to function.
Find out your NVIDA driver version with the NVIDIA System Management Interface by running the nvidia-smi
command.
$ ec2-user@ip-172-31-12-191:~# nvidia-smi
Wed Apr 5 15:20:35 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
The value for Driver Version
shows us that NVIDIA driver version 470.141.03 is installed and active. The value for CUDA Version
tells us this driver shipped with CUDA 11.4. But, NVIDIA minor version compatibility means we can run other versions of CUDA with this driver as well.
The CUDA Toolkit release notes are the canonical source of information about driver/CUDA compatibility. Navigate to the release notes page and consult the table “CUDA Toolkit and Minimum Required Driver Version for CUDA Minor Version Compatibility”.
We are installing CUDA 11.6, which requires driver version 450.80.02 or higher. Since the driver installed on our instance is 470.141.03, it will be compatible with CUDA 11.6. If you find that you need to upgrade your NVIDIA driver to support the version of CUDA you need, that is a more involved process that is out of scope for this article.
We need to retrieve the URL from NVIDIA for the runnable installer, then download it. It is tricky to get the URL programmatically, so we will use an interactive form provided by NVIDIA.
Navigate to the CUDA Toolkit Archive. Find the version of CUDA you wish to install (CUDA Toolkit 11.6.2 in this case).
Follow its URL then select the Linux operating system. Select x86_64 if your instance CPU is Intel or AMD-based or arm64-sbsa if it is Graviton-based. Choose the Centos distribution, version 7. Even if you are using a different operating system on your AMI, select Centos/7 - it doesn’t matter for the run file. Then, for Installer Type, choose runfile (local).
Now, copy the URL and set it as a shell variable. Then download it using curl.
$ CUDA_URL=https://developer.download.nvidia.com/compute/cuda/11.6.1/local_installers/cuda_11.6.1_510.47.03_linux.run
$ curl -fSL -# -O $CUDA_URL
This will download a file named cuda_11.6.1_510.47.03_linux.run
to your local directory.
Run the installer with administrator privileges. The --silent
option lets the script run unattended. The --toolkit
option instructs the script to only install the CUDA toolkit, not the associated NVIDIA driver.
$ sudo sh cuda_11.6.1_510.47.03_linux.run --silent --toolkit
When the installer has completed, verify that the new CUDA version has been installed and is the default CUDA for the system.
$ ls -alth /usr/local | grep cuda
lrwxrwxrwx 1 root root 21 Apr 5 17:20 cuda -> /usr/local/cuda-11.6/
drwxr-xr-x 16 root root 4.0K Apr 5 17:05 cuda-11.6
drwxr-xr-x 17 root root 4.0K Feb 15 14:01 cuda-11.7
Now, verify that NVIDIA CUDA Compiler Driver (nvcc
) can load and tell you its version.
$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driverCopyright (c) 2005-2022 NVIDIA Corporation
Built on Thu_Feb_10_18:23:41_PST_2022
Cuda compilation tools, release 11.6, V11.6.112Build cuda_11.6.r11.6/compiler.30978841_0
You can put the CUDA installation through its paces with CUDA Samples repository provided by NVIDIA.
# Navigate to a directory with at least 5 GB free space
$ cd /scratch
# Clone the repository
$ git clone https://github.com/NVIDIA/cuda-samples.git
# List tags
$ cd cuda-samples
$ git tag -l
10.1.1
10.1.2
v10.0
v10.0.1
v10.1
v10.2
v11.0
v11.1
v11.2
v11.3
v11.4
v11.4.1
v11.5
v11.6
v11.8
v12.0
v12.1
v9.2
# Check out the closest version to your CUDA installation
$ git checkout v11.6
# Build the samples repository (this will take a while)
$ make
# Run the tests (this will also take a while)
$ make test
# Clean up (this will remove build artifacts)
$ make clean
While still logged into your instance, clean up un-needed files and data.
- Delete the NVIDIA runfile
- Delete the cuda-samples directory if you downloaded it
- Run this instance preparation script
sudo /usr/local/sbin/ami_cleanup.sh
Then, in your browser, navigate to the Amazon EC2 Console. Select your running instance and choose Instance state and Stop instance. When the instance has stopped, choose Actions then Image and templates then Create image.
Give your new image a distinct name and, optionally, write a short description for it. Then, choose Create image. After some period of time, the custom AMI will be built and become available to use.
In the Amazon EC2 Console, navigate to Images. Select Owned by me to find your custom image. When its status is Available, you can use its AMI ID as a custom AMI in AWS ParallelCluster. You have three options for doing this: