(3.1.x) Termination of idle dynamic compute nodes potentially broken after performing a cluster update

The issue

Due to an issue present in the Slurm scheduler module which is responsible to initiate termination of idle nodes, ParallelCluster scale-down behavior might be affected after performing an UpdateCluster operation.

When handling a cluster update operation that requires a change to the Scheduling settings, ParallelCluster performs a restart of the Slurm controller and a reload of the configuration with the scontrol reconfigure Slurm command. Due to an unhandled race condition present in Slurm code, the scheduler might enter a faulty state which prevents a part or all of the configured CLOUD nodes to be terminated after being idle.

Affected versions (OSes, schedulers)

Slurm clusters created with ParallelCluster versions >= 3.1.0, <3.2.0 are potentially affected after an UpdateCluster operation is performed

Mitigation

The latest ParallelCluster 3.2.0 includes a mitigation that prevents the Slurm race condition to be triggered, therefore as a long term mitigation we recommend to upgrade and recreate the cluster.

As a quick mitigation for affected clusters created with older versions, please restart Slurm daemon on the head node by running sudo systemctl restart slurmctld. Note that if an UpdateCluster is executed again the issue could reappear.

In case you want to immediately stop all compute instances, then stop the compute fleet by running pcluster update-compute-fleet --status STOP_REQUESTED --cluster-name NAME.

In order to patch your running cluster and make it properly handle future cluster-updates, it is possible to copy the content below in a script and execute it as root on the headnode.

fix-31.sh

#!/bin/bash
set -x

grep "aws-parallelcluster-cookbook-3.1." /opt/parallelcluster/.bootstrapped || exit 1

FILE="aws-parallelcluster-slurm/recipes/update_head_node.rb"
VENDORED_FILE="/etc/chef/cookbooks/$FILE"
CACHED_FILE="/etc/chef/local-mode-cache/cache/cookbooks/$FILE"

for CURRENT_FILE in $CACHED_FILE $VENDORED_FILE
do
      grep "chef_sleep '5'" $CURRENT_FILE || sed -i "s/execute 'reload config for running nodes' do/chef_sleep '5'\n\nexecute 'reload config for running nodes' do/g" $CURRENT_FILE && echo "sleep already present in file $CURRENT_FILE"
done

systemctl restart slurmctld

then make the script executable and execute it as root

chmod +x ./fix-31.sh
sudo ./fix-31.sh

Error detail

To verify if a cluster is affected by the described issue you can retrieve the idle time of running compute nodes and compare it with the configured scale-down idletime.

To verify what timeout is configured for node termination on inactivity you can run the following command on the cluster head node:

[ec2-user@ip-10-0-0-95 ~]$ scontrol show config | grep "SuspendTime "
SuspendTime             = 600 sec

To check when was the last time a running node was busy you can run the following command and compare the output with the current UTC time:

[ec2-user@ip-10-0-0-95 ~]$ scontrol show nodes | grep -E "(NodeName|LastBusyTime)"
NodeName=queue1-dy-compute-resource1-1 Arch=x86_64 CoresPerSocket=1
   LastBusyTime=2022-08-23T14:43:50
...
[ec2-user@ip-10-0-0-95 ~]$ date
Tue Aug 23 14:48:42 UTC 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(3.1.x) Termination of idle dynamic compute nodes potentially broken after performing a cluster update

The issue

Affected versions (OSes, schedulers)

Mitigation

Error detail

Clone this wiki locally