Skip to content

(3.0.0‐3.8.0) Interactive job submission through srun can fail after increasing the number of compute nodes in the cluster

Luca Carrogu edited this page Mar 19, 2024 · 1 revision

The issue

ParallelCluster allows to extend the size of a cluster without imposing the requirement to stop the compute fleet. Extending the size of a cluster includes adding new queues to the scheduler, new compute resources within a queue, or increasing MaxCount of a compute resource.

According to the documentation of Slurm reported below, adding or removing nodes from a cluster requires to restart both the slurmctld on the head node and the slurmd on all the compute nodes.

From https://slurm.schedmd.com/slurm.conf.html#SECTION_NODE-CONFIGURATION:

The configuration of nodes (or machines) to be managed by Slurm is also specified in /etc/slurm.conf. Changes in node configuration (e.g. adding nodes, changing their processor count, etc.) require restarting both the slurmctld daemon and the slurmd daemons. All slurmd daemons must know each node in the system to forward messages in support of hierarchical communications.

The slurmctld on the head node is restarted during a cluster update operation, but the slurmd daemons running on the compute nodes are not. While there is no impact on the job submission through sbatch (neither in calls to srun within a batch job submitted via sbatch), missing to restart slurmd on the compute fleet when adding new nodes to the cluster may have an impact on direct srun interactive job submissions:

  • srun jobs involving both new and old nodes, with communications from old to new nodes, are affected: if at any time an old node must propagate a Slurm RPC to a new node, this propagation will fail, leading the whole srun to fail;
  • srun jobs ending only on new nodes are not impacted because the nodes have just started all together so they know about others;
  • srun jobs ending only on old nodes are not impacted because all nodes already know each other;
  • single-node srun are not affected.

The error message

The error message shown when facing this issue is

run: error: fwd_tree_thread: can't find address for host <hostname>, check slurm.conf

Affected versions (OSes, schedulers)

  • ParallelCluster 3.0.0 - latest
  • Slurm scheduler

Mitigation

In order to avoid any possible issue with srun job submissions, the simpler mitigation is to stop and start the compute fleet:

pcluster update-compute-fleet  -r <region> -n <cluster-name> --status STOP_REQUESTED

# Wait until all the compute nodes are DOWN

pcluster update-compute-fleet  -r <region> -n <cluster-name> --status START_REQUESTED

Alternatively, you can follow the SchedMD guideline and restart the slurmd on the active compute nodes.

You can retrieve the list of active nodes by issuing the sinfo command and filtering the nodes that are responding and not powered down:

[ec2-user@ip-172-31-29-2 ~]$ sinfo -t idle,alloc,allocated -h | grep -v "~" | tr -s " " | cut -d' ' -f6 > nodes.txt
[ec2-user@ip-172-31-29-2 ~]$ cat nodes.txt 
q1-st-ttm-[1-2]
q2-st-tts-[1-2]

then using a parallel shell tool like ClusterShell in the example below, you can restart the slurmd demon in each host

[ec2-user@ip-172-31-29-2 ~]$ clush --hostfile ./nodes.txt -f 4 'sudo systemctl restart slurmd && echo "slurmd restarted on host $(hostname)"'
q1-st-ttm-1: slurmd restarted on host q1-st-ttm-1
q1-st-ttm-2: slurmd restarted on host q1-st-ttm-2
q2-st-tts-2: slurmd restarted on host q2-st-tts-2
q2-st-tts-1: slurmd restarted on host q2-st-tts-1

where -f N is the level of parallelism you want to adopt.

Clone this wiki locally