Skip to content

(3.6.0‐3.6.1) Slurm NodeHostName and NodeAddr mismatch for MultiNIC instance when managed DNS is disabled and EC2 Hostnames are used

Eddy Mwiti edited this page Sep 1, 2023 · 1 revision

Bug description

When using Slurm compute nodes backed by an instance type with multiple network cards (e.g. p4d.24xlarge, hpc6id.32xlarge), it would be possible that the Slurm node NodeHostName attribute doesn’t match the NodeAddr attribute, when cluster managed DNS is disabled and EC2 Hostnames are used.

The mismatch between NodeHostName and NodeAddr is caused by the random order of the networking interfaces in the EC2 DescribeInstances API output, which is used by ParallelCluster to enumerate those interfaces, and can cause problems for jobs that rely on node hostname knowledge in order to be executed, like MPI jobs.

Issue can be identified by looking at the launch instance log, either in /var/log/parallelcluster/slurm_resum.log (for dynamic nodes) or /var/log/parallelcluster/clustermgtd.log (for static nodes), where the attribute hostname doesn’t correspond to the value of the private_ip, like in the following example log:

2023-08-25 09:44:39,979 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('q1-dy-c1-1', EC2Instance(id='i-03faa591c09e638cc', private_ip='192.168.90.6', hostname='ip-192-168-93-90', launch_time=datetime.datetime(2023, 8, 25, 9, 44, 34, tzinfo=tzlocal()), slurm_node=None))"]

where private_ip='192.168.90.6' doesn’t match the hostname='ip-192-168-93-90'.

Affected versions (OSes, schedulers)

  • ParallelCluster 3.6.0 - 3.6.1
  • Slurm Scheduler
  • cluster managed DNS disabled, via SlurmSettings/Dns/DisableManagedDns=true
  • EC2 hostnames enabled, via SlurmSettings/Dns/UseEc2Hostnames=true
  • multi-NIC instance types, e.g. p4d.24xlarge, hpc6id.32xlarge, etc...

Mitigation

The following mitigation has been tested on ParallelCluster version 3.6.1

  1. Save the following text as pcluster.patch to /tmp onto your head node:
diff --git a/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/common/ec2_utils.py b/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/common/ec2_utils.py
index 9c21a48..5a42772 100644
--- a/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/common/ec2_utils.py
+++ b/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/common/ec2_utils.py
@@ -27,3 +27,23 @@ def get_private_ip_address(instance_info):
             private_ip = network_interface["PrivateIpAddress"]
             break
     return private_ip
+
+
+def get_private_ip_address_and_dns_name(instance_info):
+    """
+    Return the PrivateIpAddress and PrivateDnsName of the EC2 instance.
+
+    The PrivateIpAddress and PrivateDnsName are considered to be the ones for the
+    network interface with DeviceIndex = NetworkCardIndex = 0.
+    :param instance_info: the dictionary returned by a EC2:DescribeInstances call.
+    :return: the PrivateIpAddress and PrivateDnsName of the instance.
+    """
+    private_ip = instance_info["PrivateIpAddress"]
+    private_dns_name = instance_info["PrivateDnsName"]
+    for network_interface in instance_info["NetworkInterfaces"]:
+        attachment = network_interface["Attachment"]
+        if attachment.get("DeviceIndex", -1) == 0 and attachment.get("NetworkCardIndex", -1) == 0:
+            private_ip = network_interface["PrivateIpAddress"]
+            private_dns_name = network_interface["PrivateDnsName"]
+            break
+    return private_ip, private_dns_name
diff --git a/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/fleet_manager.py b/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/fleet_manager.py
index 4bdd291..c757ce5 100644
--- a/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/fleet_manager.py
+++ b/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/fleet_manager.py
@@ -15,7 +15,7 @@ from abc import ABC, abstractmethod
 
 import boto3
 from botocore.exceptions import ClientError
-from common.ec2_utils import get_private_ip_address
+from common.ec2_utils import get_private_ip_address, get_private_ip_address_and_dns_name
 
 logger = logging.getLogger(__name__)
 
@@ -48,10 +48,11 @@ class EC2Instance:
     @staticmethod
     def from_describe_instance_data(instance_info):
         try:
+            private_ip, private_dns_name = get_private_ip_address_and_dns_name(instance_info)
             return EC2Instance(
                 instance_info["InstanceId"],
-                get_private_ip_address(instance_info),
-                instance_info["PrivateDnsName"].split(".")[0],
+                private_ip,
+                private_dns_name.split(".")[0],
                 instance_info["LaunchTime"],
             )
         except KeyError as e:
diff --git a/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/instance_manager.py b/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/instance_manager.py
index 7ec9bc8..646287f 100644
--- a/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/instance_manager.py
+++ b/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/instance_manager.py
@@ -21,7 +21,7 @@ from typing import Iterable
 import boto3
 from botocore.config import Config
 from botocore.exceptions import ClientError
-from common.ec2_utils import get_private_ip_address
+from common.ec2_utils import get_private_ip_address, get_private_ip_address_and_dns_name
 from common.schedulers.slurm_commands import update_nodes
 from common.utils import grouper
 from slurm_plugin.common import ComputeInstanceDescriptor, log_exception, print_with_count
@@ -349,11 +349,12 @@ class InstanceManager:
         instances = []
         for instance_info in filtered_iterator:
             try:
+                private_ip, private_dns_name = get_private_ip_address_and_dns_name(instance_info)
                 instances.append(
                     EC2Instance(
                         instance_info["InstanceId"],
-                        get_private_ip_address(instance_info),
-                        instance_info["PrivateDnsName"].split(".")[0],
+                        private_ip,
+                        private_dns_name.split(".")[0],
                         instance_info["LaunchTime"],
                     )
                 )
  1. Create and run the following script on the head node as the root user:
**#!/bin/bash**
set -e

# Patch file must be run from the root path
pushd /
# Apply the patch and save a backup into *.orig
cat /tmp/pcluster.patch | patch -p1 -b

# Restart clustermgtd
/opt/parallelcluster/pyenv/versions/cookbook_virtualenv/bin/supervisorctl restart clustermgtd

popd

Clone this wiki locally