-
Notifications
You must be signed in to change notification settings - Fork 312
(3.0.0‐3.7.2) Cluster update rollback can fail when modifying the list of instance types declared in the Compute Resources
During a cluster update it is possible to change the list of instance types declared in the Slurm Compute Resources, by modifying the InstanceTypes
or InstanceType
configuration parameter.
This change alone does not cause any issue, but if there is a problem during the cluster update (e.g. connection to Slurm Accounting Database goes wrong) and the whole operation fails, the cluster update rollback will also fail and the stack will go in the UPDATE_ROLLBACK_FAILED
state.
$ pcluster list-clusters
{
"clusters": [
{
"clusterName": "test",
"cloudformationStackStatus": "UPDATE_ROLLBACK_FAILED",
"clusterStatus": "UPDATE_ROLLBACK_FAILED",
...
}
]
}
By looking at the /var/log/chef-client.log
file from the head node instance it’s possible to find an error like: ERROR - Failed to generate slurm configurations, exception: 'p3.2xlarge'
or more in detail:
2023-06-30 01:07:24,378 - [root:main] - ERROR - Failed to generate slurm configurations, exception: 'p3.2xlarge'
Traceback (most recent call last):
File "/opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py", line 453, in main
generate_slurm_config_files(
...
File "/opt/parallelcluster/scripts/slurm/templates/common/slurm_parallelcluster_utils.conf", line 10, in template
{{ ' ' }}CPUs={{ compute_resource | vcpus }}
File "/opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py", line 291, in _vcpus
vcpus_count, threads_per_core = _get_min_vcpus(instance_types)
File "/opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py", line 268, in _get_min_vcpus
instance_type_info = instance_types_data[instance_type]
KeyError: 'p3.2xlarge'
Traceback (most recent call last):
File "/opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py", line 471, in <module>
main()
...
File "/opt/parallelcluster/scripts/slurm/templates/common/slurm_parallelcluster_utils.conf", line 10, in template
{{ ' ' }}CPUs={{ compute_resource | vcpus }}
File "/opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py", line 291, in _vcpus
vcpus_count, threads_per_core = _get_min_vcpus(instance_types)
File "/opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py", line 268, in _get_min_vcpus
instance_type_info = instance_types_data[instance_type]
KeyError: 'p3.2xlarge'
---- End output of /opt/parallelcluster/pyenv/versions/3.9.16/envs/cookbook_virtualenv/bin/python /opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py --output-directory /opt/slurm/etc/ --template-directory /opt/parallelcluster/scripts/slurm/templates/ --input-file /opt/parallelcluster/shared/cluster-config.yaml --instance-types-data /opt/parallelcluster/shared/instance-types-data.json --compute-node-bootstrap-timeout 1800 --realmemory-to-ec2memory-ratio 0.95 --slurmdbd-user slurm --cluster-name cfsan-hpc-aws-prod ----
Ran /opt/parallelcluster/pyenv/versions/3.9.16/envs/cookbook_virtualenv/bin/python /opt/parallelcluster/scripts/slurm/pcluster_slurm_config_generator.py --output-directory /opt/slurm/etc/ --template-directory /opt/parallelcluster/scripts/slurm/templates/ --input-file /opt/parallelcluster/shared/cluster-config.yaml --instance-types-data /opt/parallelcluster/shared/instance-types-data.json --compute-node-bootstrap-timeout 1800 --realmemory-to-ec2memory-ratio 0.95 --slurmdbd-user slurm --cluster-name cfsan-hpc-aws-prod returned 1
- ParallelCluster 3.0.0 - 3.7.1
- Slurm scheduler
The mitigation is to execute another cluster update by using the cluster configuration used for the creation or the last successful update attempt.
Tips to find cluster configurations: Past cluster configurations are stored on a versioned S3 bucket. The name of the bucket and the path to the directory can be retrieved from cluster CloudFormation Parameters:
ResourcesS3Bucket
,ArtifactS3RootDirectory
.
This will permit the rollback to run successfully and in this way you can focus on fixing the issue in the update steps by checking the /var/log/chef-client.log
.
Once you have fixed the original problem that caused the cluster update to fail, you can proceed with a new cluster update to modify the instance type(s) of the compute resources.