Skip to content

(3.9.0‐latest) SSH bootstrap cannot launch processes on remote host when using Intel MPI with Slurm 23.11

Luca Carrogu edited this page Mar 19, 2024 · 1 revision

The problem

In ParallelCluster 3.9.0, Slurm has been upgraded to 23.11.4 (from 23.02.7). Slurm by default supports mpirun from Intel MPI and permits to use different [I_MPI_HYDRA_BOOTSTRAP](https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-9/hydra-environment-variables.html) mechanisms.

Slurm 23.11 changed the behaviour of mpirun when using I_MPI_HYDRA_BOOTSTRAP=slurm (the default), by injecting two environment variables and passing --external-launcher option to the launcher command.

In the documentation it’s explained that it’s possible to use a different bootstrap mechanism by explicitly setting the I_MPI_HYDRA_BOOTSTRAP environment variable prior to submitting the job with sbatch or salloc.

This means that if the application (e.g. Ansys Fluent) or the job submission script is using a different bootstrap launcher, without setting the I_MPI_HYDRA_BOOTSTRAP variable, the job submission will fail with the following message:

[ec2-user@ip-10-0-0-193 ~]$ mpirun -launcher=ssh -np 1 hostname
[mpiexec@ip-10-0-0-193] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on queue1-dy-t2-1 (pid 8914, exit code 65280)
[mpiexec@ip-10-0-0-193] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@ip-10-0-0-193] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@ip-10-0-0-193] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1065): error waiting for event
[mpiexec@ip-10-0-0-193] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@ip-10-0-0-193] Possible reasons:
[mpiexec@ip-10-0-0-193] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@ip-10-0-0-193] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@ip-10-0-0-193] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@ip-10-0-0-193] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@ip-10-0-0-193]    You may try using -bootstrap option to select alternative launcher.

-launcher=ssh corresponds to the undocumented -rsh=ssh flag, with any of these you’ll receive the same error.

Affected versions (OSes, schedulers)

  • ParallelCluster >= 3.9.0
  • Slurm >= 23.11

Solution

The solution, as stated in the documentation is to set the I_MPI_HYDRA_BOOTSTRAP environment variable prior to submitting the job with sbatch or salloc. Example:

[ec2-user@ip-10-0-0-193 ~]$ export I_MPI_HYDRA_BOOTSTRAP=ssh
[ec2-user@ip-10-0-0-193 ~]$ salloc -n1
salloc: Granted job allocation 5
[ec2-user@ip-10-0-0-193 ~]$ module load intelmpi
Loading intelmpi version 2021.9.0
[ec2-user@ip-10-0-0-193 ~]$ env | grep MPI
OMPI_MCA_plm_slurm_args=—external-launcher
I_MPI_HYDRA_BOOTSTRAP=ssh
I_MPI_ROOT=/opt/intel/mpi/2021.9.0

[ec2-user@ip-10-0-0-193 ~]$ mpirun -np 1 hostname
queue1-dy-t2-1
[ec2-user@ip-10-0-0-193 ~]$ mpirun -rsh=ssh -np 1 hostname
queue1-dy-t2-1
[ec2-user@ip-10-0-0-193 ~]$ mpirun -launcher=ssh -np 1 hostname
queue1-dy-t2-1

Details of the changes

When using Slurm 23.11, I_MPI_HYDRA_BOOTSTRAP=slurm is the default bootstrap system and this is the reason why the --external-launcher param is added:

[ec2-user@ip-10-0-0-193 ~]$ salloc -n1
salloc: Granted job allocation 4
[ec2-user@ip-10-0-0-193 ~]$ env | grep MPI
I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS=--external-launcher
OMPI_MCA_plm_slurm_args=--external-launcher
I_MPI_HYDRA_BOOTSTRAP=slurm

By submitting a job with default bootstrap (slurm) the submission works as expected

[ec2-user@ip-10-0-0-193 ~]$ module load intelmpi
Loading intelmpi version 2021.9.0
[ec2-user@ip-10-0-0-193 ~]$ mpirun -np 1 hostname
queue1-dy-t2-1

If the application launches mpirun with -rsh=ssh or -launcher=ssh flags, it is asking the bootstrap launcher to be ssh rather than slurm. If the application misses to set the I_MPI_HYDRA_BOOTSTRAP variable, it will fail with an issue like the following:

[ec2-user@ip-10-0-0-193 ~]$ mpirun -launcher=ssh -np 1 hostname
[mpiexec@ip-10-0-0-193] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on queue1-dy-t2-1 (pid 8914, exit code 65280)
[mpiexec@ip-10-0-0-193] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@ip-10-0-0-193] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@ip-10-0-0-193] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1065): error waiting for event
[mpiexec@ip-10-0-0-193] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@ip-10-0-0-193] Possible reasons:
[mpiexec@ip-10-0-0-193] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@ip-10-0-0-193] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@ip-10-0-0-193] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@ip-10-0-0-193] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@ip-10-0-0-193]    You may try using -bootstrap option to select alternative launcher.
[ec2-user@ip-10-0-0-193 ~]$ mpirun -rsh=ssh -np 1 hostname
[mpiexec@ip-10-0-0-193] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on queue1-dy-t2-1 (pid 7653, exit code 65280)
[mpiexec@ip-10-0-0-193] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@ip-10-0-0-193] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@ip-10-0-0-193] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1065): error waiting for event
[mpiexec@ip-10-0-0-193] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@ip-10-0-0-193] Possible reasons:
[mpiexec@ip-10-0-0-193] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@ip-10-0-0-193] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@ip-10-0-0-193] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@ip-10-0-0-193] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@ip-10-0-0-193]    You may try using -bootstrap option to select alternative launcher.

To fix the issue is required to export the documented environment variable before submitting the job and everything will work as expected.

[ec2-user@ip-10-0-0-193 ~]$ export I_MPI_HYDRA_BOOTSTRAP=ssh
[ec2-user@ip-10-0-0-193 ~]$ salloc -n1
salloc: Granted job allocation 5
[ec2-user@ip-10-0-0-193 ~]$ env | grep MPI
OMPI_MCA_plm_slurm_args=--external-launcher
I_MPI_HYDRA_BOOTSTRAP=ssh
I_MPI_ROOT=/opt/intel/mpi/2021.9.0
[ec2-user@ip-10-0-0-193 ~]$ module load intelmpi
Loading intelmpi version 2021.9.0

[ec2-user@ip-10-0-0-193 ~]$ mpirun -np 1 hostname
queue1-dy-t2-1
[ec2-user@ip-10-0-0-193 ~]$ mpirun -rsh=ssh -np 1 hostname
queue1-dy-t2-1
[ec2-user@ip-10-0-0-193 ~]$ mpirun -launcher=ssh -np 1 hostname
queue1-dy-t2-1
Clone this wiki locally