Skip to content

(3.6.0) NVIDIA GPU nodes fail to start with custom AMI built from DLAMI

Enrico Usai edited this page Jul 5, 2023 · 1 revision

(3.6.0) NVIDIA GPU nodes fail to start with custom AMI built from DLAMI

The issue

ParallelCluster version 3.6.0 introduced the installation of the NVIDIA daemon nvidia-persistenced. The daemon has the goal to persist the NVIDIA device file(s) in order to save the latency incurred by repetitive device initialization.

In the DLAMI, since version published from 2022-12-20, the same service was deployed with a different name (nvidia_gpu_settings_service.service).

In a PCluster 3.6.0 custom AMI built from the DLAMI, the two services have a conflict and they are causing an error in the bootstrap procedure of the nodes. The cluster detects the error as a compute node bootstrap issue and after a number of consecutive nodes failing at bootstrap time (the number is defined by protected_failure_count , default is 10), it enters protected status.

When the issue occurs, the nodes are not able to join the cluster and in the CloudWatch LogGroup of the cluster the following lines can be found:

STDERR: Error: 'systemctl start nvidia-persistenced.service' failed with
'Job for nvidia-persistenced.service failed because the control process exited with error code.
See "systemctl status nvidia-persistenced.service" and "journalctl -xe" for details.'.
     ---- End output of "bash"  ----
Ran "bash"  returned 1

How to reproduce:

  1. Build a ParallelCluster custom image starting from the latest public DLAMI (version 20230524 at the time of writing). Example: Deep Learning AMI GPU CUDA 11.2.1 (Ubuntu 20.04) 20230524 (us-east-1 ami-0ef5f7e6314c26b09). Documentation
  2. Create a simple cluster with the newly created custom AMI and a compute resource instance type with NVIDIA GPU Example: g5.xlarge. Documentation
  3. Run a simple job. Documentation
  4. The job cannot run because the compute nodes are not able to join the cluster.
  5. Check in CloudWatch the LogGroup of the cluster, searching for the LogStream of the compute node named: {hostname}.{instance_id}.cloud-init-output. Example: ip-192-168-94-99.i-0399784055e87bdf6.cloud-init-output - Documentation

Affected versions (OSes, schedulers)

ParallelCluster 3.6.0 and custom AMI built from DLAMI when the instance type has NVIDIA GPU.

Solution

The issue has been fixed since ParallelCluster version 3.6.1.

Clone this wiki locally