-
Notifications
You must be signed in to change notification settings - Fork 312
(3.1.x) Termination of idle dynamic compute nodes potentially broken after performing a cluster update
Due to an issue present in the Slurm scheduler module which is responsible to initiate termination of idle nodes, ParallelCluster scale-down behavior might be affected after performing an UpdateCluster operation.
When handling a cluster update operation that requires a change to the Scheduling
settings, ParallelCluster performs a restart of the Slurm controller and a reload of the configuration with the scontrol reconfigure
Slurm command. Due to an unhandled race condition present in Slurm code, the scheduler might enter a faulty state which prevents a part or all of the configured CLOUD nodes to be terminated after being idle.
Slurm clusters created with ParallelCluster versions >= 3.1.0, <3.2.0 are potentially affected after an UpdateCluster operation is performed
The latest ParallelCluster 3.2.0 includes a mitigation that prevents the Slurm race condition to be triggered, therefore as a long term mitigation we recommend to upgrade and recreate the cluster.
As a quick mitigation for affected clusters created with older versions, please restart Slurm daemon on the head node by running sudo systemctl restart slurmctld
. Note that if an UpdateCluster is executed again the issue could reappear.
In case you want to immediately stop all compute instances, then stop the compute fleet by running pcluster update-compute-fleet --status STOP_REQUESTED --cluster-name NAME
.
In order to patch your running cluster and make it properly handle future cluster-updates
, it is possible to copy the content below in a script and execute it as root on the headnode.
fix-31.sh
#!/bin/bash
set -x
grep "aws-parallelcluster-cookbook-3.1." /opt/parallelcluster/.bootstrapped || exit 1
FILE="aws-parallelcluster-slurm/recipes/update_head_node.rb"
VENDORED_FILE="/etc/chef/cookbooks/$FILE"
CACHED_FILE="/etc/chef/local-mode-cache/cache/cookbooks/$FILE"
for CURRENT_FILE in $CACHED_FILE $VENDORED_FILE
do
grep "chef_sleep '5'" $CURRENT_FILE || sed -i "s/execute 'reload config for running nodes' do/chef_sleep '5'\n\nexecute 'reload config for running nodes' do/g" $CURRENT_FILE && echo "sleep already present in file $CURRENT_FILE"
done
systemctl restart slurmctld
then make the script executable and execute it as root
chmod +x ./fix-31.sh
sudo ./fix-31.sh
To verify if a cluster is affected by the described issue you can retrieve the idle time of running compute nodes and compare it with the configured scale-down idletime.
To verify what timeout is configured for node termination on inactivity you can run the following command on the cluster head node:
[ec2-user@ip-10-0-0-95 ~]$ scontrol show config | grep "SuspendTime "
SuspendTime = 600 sec
To check when was the last time a running node was busy you can run the following command and compare the output with the current UTC time:
[ec2-user@ip-10-0-0-95 ~]$ scontrol show nodes | grep -E "(NodeName|LastBusyTime)"
NodeName=queue1-dy-compute-resource1-1 Arch=x86_64 CoresPerSocket=1
LastBusyTime=2022-08-23T14:43:50
...
[ec2-user@ip-10-0-0-95 ~]$ date
Tue Aug 23 14:48:42 UTC 2022