Skip to content

(3.3.0 3.5.1) Cluster updates can break Slurm accounting functionality

mauri-melato edited this page Apr 4, 2023 · 3 revisions

The issue

In ParallelCluster versions 3.3.0 up to 3.5.1, if you enabled the Slurm accounting feature, a cluster-update operation involving any change to settings under the Scheduling section, will break Slurm accounting configuration on the system.

Slurm accounting functionality will continue running unless a restart of the slurmdbd or, more in general, a reboot of the head node is executed.

Affected versions

  • ParallelCluster 3.3.0 - 3.5.1, if Slurm Accounting functionality is enabled.
  • All OSes are affected.

How to detect if you are affected

Any Slurm command trying to query the Slurm database will fail with a Connection refused error:

$ sacct 
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:ip-10-0-0-213:6819: Connection refused

And the slurmdbd service will be in failed state.

You can double check that, as a consequence of a update-cluster, you have a default dummy value for the StoragePass parameter in the slurm_parallelcluster_slurmdbd.conf file:

$ sudo grep StoragePass /opt/slurm/etc/slurm_parallelcluster_slurmdbd.conf
StoragePass=dummy

Mitigation

You can restore the original password by running the following command from the head node:

sudo /opt/parallelcluster/scripts/slurm/update_slurm_database_password.sh

Check slurmdbd service status in the head node

sudo systemctl status slurmdbd

Only if the the service is in a failed state, restart it

sudo systemctl restart slurmdbd

Verify that the Slurm accounting functionality is back

$ sacct
JobID        JobName    Partition  Account    AllocCPUS  State      ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
...

Other info

You can find more details about Slurm accounting in ParallelCluster in the official documentation: https://docs.aws.amazon.com/parallelcluster/latest/ug/slurm-accounting-v3.html#slurm-accounting-considerations-v3

Clone this wiki locally