(3.0.0‐3.7.2) Cluster update rollback can fail when modifying the list of instance types declared in the Compute Resources

The issue

During a cluster update it is possible to change the list of instance types declared in the Slurm Compute Resources, by modifying the InstanceTypes or InstanceType configuration parameter.

This change alone does not cause any issue, but if there is a problem during the cluster update (e.g. connection to Slurm Accounting Database goes wrong) and the whole operation fails, the cluster update rollback will also fail and the stack will go in the UPDATE_ROLLBACK_FAILED state.

$ pcluster list-clusters
{
  "clusters": [
    {
      "clusterName": "test",
      "cloudformationStackStatus": "UPDATE_ROLLBACK_FAILED",
      "clusterStatus": "UPDATE_ROLLBACK_FAILED",
      ...
    }
  ]
}

By looking at the /var/log/chef-client.log file from the head node instance it’s possible to find an error like: ERROR - Failed to generate slurm configurations, exception: 'p3.2xlarge' or more in detail:

2023-06-30 01:07:24,378 - [root:main] - ERROR - Failed to generate slurm configurations, exception: 'p3.2xlarge'
Traceback (most recent call last):
  File "/opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py", line 453, in main
    generate_slurm_config_files(
  ...
  File "/opt/parallelcluster/scripts/slurm/templates/common/slurm_parallelcluster_utils.conf", line 10, in template
    {{ ' ' }}CPUs={{ compute_resource | vcpus }}
  File "/opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py", line 291, in _vcpus
    vcpus_count, threads_per_core = _get_min_vcpus(instance_types)
  File "/opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py", line 268, in _get_min_vcpus
    instance_type_info = instance_types_data[instance_type]
KeyError: 'p3.2xlarge'
Traceback (most recent call last):
  File "/opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py", line 471, in <module>
    main()
  ...
  File "/opt/parallelcluster/scripts/slurm/templates/common/slurm_parallelcluster_utils.conf", line 10, in template
    {{ ' ' }}CPUs={{ compute_resource | vcpus }}
  File "/opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py", line 291, in _vcpus
    vcpus_count, threads_per_core = _get_min_vcpus(instance_types)
  File "/opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py", line 268, in _get_min_vcpus
    instance_type_info = instance_types_data[instance_type]
KeyError: 'p3.2xlarge'
---- End output of /opt/parallelcluster/pyenv/versions/3.9.16/envs/cookbook_virtualenv/bin/python /opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py --output-directory /opt/slurm/etc/ --template-directory /opt/parallelcluster/scripts/slurm/templates/ --input-file /opt/parallelcluster/shared/cluster-config.yaml --instance-types-data /opt/parallelcluster/shared/instance-types-data.json --compute-node-bootstrap-timeout 1800  --realmemory-to-ec2memory-ratio 0.95 --slurmdbd-user slurm --cluster-name cfsan-hpc-aws-prod ----
Ran /opt/parallelcluster/pyenv/versions/3.9.16/envs/cookbook_virtualenv/bin/python /opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py --output-directory /opt/slurm/etc/ --template-directory /opt/parallelcluster/scripts/slurm/templates/ --input-file /opt/parallelcluster/shared/cluster-config.yaml --instance-types-data /opt/parallelcluster/shared/instance-types-data.json --compute-node-bootstrap-timeout 1800  --realmemory-to-ec2memory-ratio 0.95 --slurmdbd-user slurm --cluster-name cfsan-hpc-aws-prod returned 1

Affected versions (OSes, schedulers)

ParallelCluster 3.0.0 - 3.7.1
Slurm scheduler

Mitigation

The mitigation is to execute another cluster update by using the cluster configuration used for the creation or the last successful update attempt.

Tips to find cluster configurations: Past cluster configurations are stored on a versioned S3 bucket. The name of the bucket and the path to the directory can be retrieved from cluster CloudFormation Parameters: ResourcesS3Bucket, ArtifactS3RootDirectory.

This will permit the rollback to run successfully and in this way you can focus on fixing the issue in the update steps by checking the /var/log/chef-client.log. Once you have fixed the original problem that caused the cluster update to fail, you can proceed with a new cluster update to modify the instance type(s) of the compute resources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(3.0.0‐3.7.2) Cluster update rollback can fail when modifying the list of instance types declared in the Compute Resources

The issue

Affected versions (OSes, schedulers)

Mitigation

Clone this wiki locally