(3.3.0 3.4.0) Slurm cluster NodeName and NodeAddr mismatch after cluster scaling

Issue Summary

There is an issue present in the Slurm 22.05.5-22.05.7 scontrol update NodeName=A,B,... NodeAddr=A,B,... command that mismatches pairs of NodeName and NodeAddr depending on the order of the node list. ParallelCluster uses this functionality during scaling operations and can therefore potentially reach an inconsistent state, resulting in nodes not being accessible or being backed by the incorrect EC2 instance type.

Affected versions (OSes, schedulers)

Slurm clusters created with ParallelCluster versions 3.3.0-3.4.0 are affected with multiple dynamic compute resources in the configuration (with 1 or many queues).

Mitigation

To fix this issue on a running cluster, follow these steps in the order prescribed :

Stop the cluster with command: pcluster update-compute-fleet --cluster-name <cluster-name> --status STOP_REQUESTED
Once the cluster has stopped, run the following script on the HeadNode:

#! /bin/bash
set -ex

cat <<PATCH >/tmp/node.patch
--- slurm_commands.py
+++ slurm_commands.py
@@ -104,17 +104,49 @@
         update_cmd += f" state={state}"
     if reason:
         update_cmd += f' reason="{reason}"'
-    for nodenames, addrs, hostnames in batched_node_info:
-        node_info = f"nodename={nodenames}"
+    for nodenames_, addrs_, hostnames_ in batched_node_info:
+        if addrs_ or hostnames_:
+            # Sorting is only necessary if we set nodeaddrs or nodehostnames
+            nodenames, addrs, hostnames = _sort_nodes_attributes(nodenames_, addrs_, hostnames_)
+        else:
+            nodenames = nodenames_
+            addrs = hostnames = None
+        node_info = f"nodename={','.join(nodenames)}"
         if addrs:
-            node_info += f" nodeaddr={addrs}"
+            node_info += f" nodeaddr={','.join(addrs)}"
         if hostnames:
-            node_info += f" nodehostname={hostnames}"
+            node_info += f" nodehostname={','.join(hostnames)}"
         run_command(  # nosec
             f"{update_cmd} {node_info}", raise_on_error=raise_on_error, timeout=command_timeout, shell=True
         )


+def _sort_nodes_attributes(nodes, nodeaddrs, nodehostnames):
+    nodes_str = ",".join(nodes)
+    sorted_node_names_str = check_command_output(  # nosec
+        f"{SCONTROL} show hostlistsorted {nodes_str} | xargs {SCONTROL} show hostnames", shell=True
+    )
+    sorted_node_names = sorted_node_names_str.strip().split("\n")
+    if nodes == nodeaddrs and nodeaddrs == nodehostnames:
+        # Path from reset_nodes
+        nodes = nodeaddrs = nodehostnames = sorted_node_names
+    else:
+        # Path from _update_slurm_node_addrs
+        order = {k: i for i, k in enumerate(sorted_node_names)}
+
+        def sorting_key(x):
+            return order[x[0]]
+
+        if nodeaddrs and nodehostnames:
+            _, nodeaddrs, nodehostnames = zip(*sorted(zip(nodes, nodeaddrs, nodehostnames), key=sorting_key))
+        elif nodeaddrs:
+            _, nodeaddrs = zip(*sorted(zip(nodes, nodeaddrs), key=sorting_key))
+        elif nodehostnames:
+            _, nodehostnames = zip(*sorted(zip(nodes, nodehostnames), key=sorting_key))
+        nodes = sorted_node_names
+    return nodes, nodeaddrs, nodehostnames
+
+
 def update_partitions(partitions, state):
     succeeded_partitions = []
     for partition in partitions:
@@ -156,7 +188,7 @@
     if expected_length and len(attribute) != expected_length:
         raise ValueError

-    return [",".join(batch) for batch in grouper(attribute, batch_size)]
+    return [batch for batch in grouper(attribute, batch_size)]


 def _batch_node_info(nodenames, nodeaddrs, nodehostnames, batch_size):
PATCH
cd /opt/parallelcluster/pyenv/versions/node_virtualenv/lib/python3.9/site-packages/common/schedulers
sudo patch -p0 < /tmp/node.patch
sudo /opt/parallelcluster/pyenv/versions/cookbook_virtualenv/bin/supervisorctl restart clustermgtd

Start the cluster with the following command: pcluster update-compute-fleet --cluster-name <cluster-name> --status START_REQUESTED

Error details

The problem is caused by a mismatch between NodeName and NodeAddr from the result of running scontrol update command. For example, scontrol NodeName=B,A NodeAddr=B,A will result in node A being updated with NodeAddr B, and node B being updated with NodeAddr A.

The problem can be identified by checking clustermgtd logs for the text “Resetting powering down nodes:” where the NodeName does not match the NodeAddr. For example, in the following log entry:

[slurm_plugin.clustermgtd:_handle_powering_down_nodes] - INFO - Resetting powering down nodes: (x2) ['t2medium-dy-t2medium-1(t2small-dy-t2small-1)', 't2small-dy-t2small-1(t2medium-dy-t2medium-1)']

the NodeName t2medium-dy-t2medium-1 should be matched with the NodeAddr t2medium-dy-t2medium-1, but it is instead matched with t2small-dy-t2small-1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(3.3.0 3.4.0) Slurm cluster NodeName and NodeAddr mismatch after cluster scaling

Issue Summary

Affected versions (OSes, schedulers)

Mitigation

Error details

Clone this wiki locally