Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI jobs fail with intel toolchains after upgrade of EL8 Linux from 8.5 to 8.6 #15651

Closed
OleHolmNielsen opened this issue Jun 9, 2022 · 44 comments
Milestone

Comments

@OleHolmNielsen
Copy link
Contributor

OleHolmNielsen commented Jun 9, 2022

I'm testing the upgrade of our compute nodes from Almalinux 8.5 to 8.6 (the RHEL 8 clone similar to Rocky Linux).

We have found that all MPI codes built with any of the Intel toolchains intel/2020b or intel/2021b fail after the 8.5 to 8.6 upgrade. The codes fail also on login nodes, so the Slurm queue system is not involved.
The FOSS toolchains foss/2020b and foss/2021b work perfectly on EL 8.6, however.

My simple test uses the attached trivial MPI Hello World code running on a single node:

$ module load intel/2021b
$ mpicc mpi_hello_world.c
$ mpirun ./a.out

Now the mpirun command enters an infinite loop (running many minutes) and we see these processes with "ps":

/bin/sh /home/modules/software/impi/2021.4.0-intel-compilers-2021.4.0/mpi/2021.4.0/bin/mpirun ./a.out
mpiexec.hydra ./a.out

The mpiexec.hydra process doesn't respond to 15/SIGTERM and I have to kill it with 9/SIGKILL. I've tried to enable debugging output with

export I_MPI_HYDRA_DEBUG=1
export I_MPI_DEBUG=5

but nothing gets printed from this.

Question: Has anyone tried an EL 8.6 Linux with the Intel toolchain and mpiexec.hydra? Can you suggest how I may debug this issue?

OS information:

$ cat /etc/redhat-release
AlmaLinux release 8.6 (Sky Tiger)
$ uname -r
4.18.0-372.9.1.el8.x86_64
@ocaisa
Copy link
Member

ocaisa commented Jun 9, 2022

Quoting some discussion we've had on this in Slack:

no, its not a glibc issue afaics. If you use a RHEL8.5 kernel (with an uptodate RHEL8.6 system on the other side), intelmpi is working

@boegel boegel added this to the next release (4.5.6?) milestone Jun 9, 2022
@boegel
Copy link
Member

boegel commented Jun 9, 2022

Yikes...

@OleHolmNielsen Have you been in touch with Intel support on this?

@rscohn2 Any thoughts on this?

@OleHolmNielsen
Copy link
Contributor Author

I didn't know that this issue is related to the updated RHEL 8.6 kernel, so I didn't contact Intel support yet. I've never been in touch with Intel compiler/libraries support before, so if someone else knows how to do that, could you kindly open an issue with them?
Thanks,
Ole

@boegel
Copy link
Member

boegel commented Jun 9, 2022

We ran into a silent hang issue several years ago too, details in hpcugent/vsc-mympirun#74

Any luck w.r.t. getting output when using mpirun -d?

@daRecall
Copy link

daRecall commented Jun 9, 2022

It seems (although nothing to be seen within the kernel release notes) that numa info has changed within the kernel.
intelmpi before version 2021.6.0 gets stuck.
using pstack, one can see, that the processes seem to hang within an infinite loop somewhere around ipl_detect_machine_topology
That happens even before mpiexec.hydra tries to do something with the to be called binary (might it be a.out or hostname).

@daRecall
Copy link

daRecall commented Jun 9, 2022

@boegel mpiexec.hydra does not know the -d parameter:

$> mpirun -d -np 2 hostname
[mpiexec@nrm095.hpc.itc.rwth-aachen.de] match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:91): unrecognized argument d
[mpiexec@nrm095.hpc.itc.rwth-aachen.de] Similar arguments:
[mpiexec@nrm095.hpc.itc.rwth-aachen.de]          membind
[mpiexec@nrm095.hpc.itc.rwth-aachen.de]          debug
[mpiexec@nrm095.hpc.itc.rwth-aachen.de]          dac
[mpiexec@nrm095.hpc.itc.rwth-aachen.de]          disable-x
[mpiexec@nrm095.hpc.itc.rwth-aachen.de]          demux
[mpiexec@nrm095.hpc.itc.rwth-aachen.de] HYD_arg_parse_array (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:128): argument matching returned error
[mpiexec@nrm095.hpc.itc.rwth-aachen.de] mpiexec_get_parameters (../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1356): error parsing input array
[mpiexec@nrm095.hpc.itc.rwth-aachen.de] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1749): error parsing parameters

but it knows --debug, but the only thing you see, is the called command:

$> mpiexec.hydra --debug -np 2 hostname
[mpiexec@nrm095.hpc.itc.rwth-aachen.de] Launch arguments: /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/impi/2021.2.0-intel-compilers-2021.2.0/mpi/2021.2.0/bin//hydra_bstrap_proxy --upstream-host nrm095.hpc.itc.rwth-aachen.de --upstream-port 44829 --pgid 0 --launcher ssh --launcher-number 0 --base-path /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/impi/2021.2.0-intel-compilers-2021.2.0/mpi/2021.2.0/bin/ --tree-width 16 --tree-level 1 --time-left -1 --launch-type 2 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/impi/2021.2.0-intel-compilers-2021.2.0/mpi/2021.2.0/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9

@ocaisa
Copy link
Member

ocaisa commented Jun 9, 2022

Looking at https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/bug-mpiexec-segmentation-fault/m-p/1183364, you can influence this with

I_MPI_HYDRA_TOPOLIB=ipl

(Ha, look who is posting the last comment in that link)

@daRecall
Copy link

daRecall commented Jun 9, 2022

using impi 2021.6.0, everything is working:

$> mpiexec.hydra --version
Intel(R) MPI Library for Linux* OS, Version 2021.6 Build 20220227 (id: 28877f3f32)
Copyright 2003-2022, Intel Corporation.

$> mpiexec.hydra -np 2 hostname
nrm095.hpc.itc.rwth-aachen.de
nrm095.hpc.itc.rwth-aachen.de

@daRecall
Copy link

daRecall commented Jun 9, 2022

@ocaisa that doesn't change anything, looping around the same function

@daRecall
Copy link

daRecall commented Jun 9, 2022

ahh, yes, I forgot, we have had an issue open with intel (case# 05472393) regarding the problem. Their first comment was as usual "are you trying the newest version?"
I did not visit ISC this year, but some of my colleagues did and they talked to some intel guys directly. Outcome was, with RHEL 8.6 and newer, old intelmpi is not working anymore.

@OleHolmNielsen
Copy link
Contributor Author

Is there any chance that Red Hat will accept a bug report for the older IntelMPI versions not working?
This would require a deeper understanding of what changes in the new kernel does that breaks IntelMPI, so documenting the bug might be a challenge...

@stdweird
Copy link
Contributor

stdweird commented Jun 9, 2022

@OleHolmNielsen kernel updates that break userspace are frwoned upon, so you can try to open a bugreport with redhat. they will at some point as you what they need, or point you to the release notes that say what has changed that broke this. they will probaby blame intel (and sounds like intel already fixed it, but doesn't want to backport it)

@OleHolmNielsen
Copy link
Contributor Author

@stdweird Yes, but how do we get any error messages from mpiexec.hydra which can be reported to Red Hat?

@stdweird
Copy link
Contributor

stdweird commented Jun 9, 2022

@OleHolmNielsen the error you need to report is that an application is hanging since an upgrade to RHEL8.6 was done. you can already add what was said here (ie it works on 8.5, pstack points to the ipl thingie so they can have some idea in what direction to look).

@OleHolmNielsen
Copy link
Contributor Author

@stdweird Thanks for the info. I have made this test:

$ module load iimpi/2021b
$ module list
Currently Loaded Modules:

  1. GCCcore/11.2.0 5) numactl/2.0.14-GCCcore-11.2.0
  2. zlib/1.2.11-GCCcore-11.2.0 6) UCX/1.11.2-GCCcore-11.2.0
  3. binutils/2.37-GCCcore-11.2.0 7) impi/2021.4.0-intel-compilers-2021.4.0
  4. intel-compilers/2021.4.0 8) iimpi/2021b
    $ which mpiexec
    /home/modules/software/impi/2021.4.0-intel-compilers-2021.4.0/mpi/2021.4.0/bin/mpiexec
    $ mpiexec.hydra --version

Now I can execute pstack on the process PID:

$ pstack 717906
#0 0x000000000045009a in ipl_get_exclude_mask (str=, mask=, maxcpu=) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:529
#1 IPL_init_numa_nodes (ncpu=26320320, n_avail_cpu=1) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:715
#2 0x000000000044b347 in ipl_detect_machine_topology () at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:1614
#3 0x0000000000449bc8 in ipl_processor_info (info=0x1919dc0, pid=0x1, detect_platform_only=26320320) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:1908
#4 0x000000000044c282 in ipl_entrance (detect_platform_only=26320320) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_main.c:38
#5 0x000000000041b958 in i_set_core_and_thread_count () at ../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec_params.h:284
#6 mpiexec_get_parameters (t_argv=0x1919dc0) at ../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1266
#7 0x00000000004049fb in main (argc=26320320, argv=0x1) at ../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1763

Do we agree that this is the issue which I should report to Red Hat?

Thanks,
Ole

@stdweird
Copy link
Contributor

stdweird commented Jun 9, 2022

@OleHolmNielsen the issue to report to RHEL is that your application is hanging after an upgrade. RH has no knoweldge about intel mpi itself (and they will most likely not provide a solution, only an explanation)

@OleHolmNielsen
Copy link
Contributor Author

I have created an issue in the Red Hat Bugzilla:
[Bug 2095281] New: Intel MPI mpiexec.hydra hangs after upgrade to RHEL 8.6
This bug is unfortunately not accessible to others because it relates to the kernel.

@truatpasteurdotfr
Copy link

AFAIK, you can add anyone (with their email) to the report, so that they can also read it...

@OleHolmNielsen
Copy link
Contributor Author

I anyone would like their E-mail to be added to the Red Hat bug 2095281 you can ask me to do it.

@bgoglin
Copy link

bgoglin commented Jun 10, 2022

If there's a regression in the RHEL kernel topology information, you may want to compare the output of lstopo before and after the upgrade.

@OleHolmNielsen
Copy link
Contributor Author

@bgoglin I took an EL85 node and copied the output of lstopo to a file. Then I upgraded the node to EL86 and rebooted. The EL86 lstopo output is 100% identical to that of EL85.

@OleHolmNielsen
Copy link
Contributor Author

The Intel MPI Release Notes at https://www.intel.com/content/www/us/en/developer/articles/release-notes/mpi-library-release-notes-linux.html don't mention any bugs related to mpixec.hydra, there's only a terse "Bug fixes" line.
I have not been able to locate the mentioned intel case# 05472393.
It would seem that going forward with EL8.6, we can no longer use the older Intel MPI libraries prior to 2021.6. So much for all the EasyBuild modules based on intel toolchains which we have already installed :-(

@OleHolmNielsen
Copy link
Contributor Author

I received a response in Red Hat bug 2095281:

I agree that it looks like the kernel should be blamed too, but
this is not necessarily true.

Finally. In any case the application is buggy. It should not spin in the
infinite loop anyway. According to pstack it doesn't hang in syscall. And
this is what we need to investigate first, imo. Until then it is absolutely
unclear how can we find the root of the problem, if _if_ the kernel is wrong.

In short. IMO, this is user-space bug no matter what.

So the conclusion is that Intel MPI prior to 2021.6 is buggy. We cannot use older Intel MPI versions on EL 8.6 kernels then :-(

If no workaround is found, it seems that all EB modules iimpi/* prior to 2021.6 have to be discarded after we upgrade from EL 8.5 to 8.6.

@boegel
Copy link
Member

boegel commented Jun 17, 2022

Or the impi in the installed iimpi and intel toolchains is updated in place to 2021.6 (not happy with that workaround, but I see no better alternative).

@akesandgren
Copy link
Contributor

Should only be done on a per-site initiative I think.

@OleHolmNielsen
Copy link
Contributor Author

For the record: When I load the module iimpi/2021b on an EL 8.6 node running kernel 4.18.0-372.9.1.el8.x86_64, the mpiexec.hydra enters an infinite loop while reading /sys/devices/system/node/node0/cpulist as seen by strace:

$ strace -f -e file mpiexec.hydra --version
(many lines deleted)
openat(AT_FDCWD, "/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
openat(-1, "/sys/devices/system/cpu/possible", O_RDONLY) = 3
openat(AT_FDCWD, "/sys/devices/system/node/node0/cpulist", O_RDONLY) = 3
(Now I type Ctrl-C)
^C--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
strace: Process 4960 detached

After rebooting the node with the EL 8.5 kernel 4.18.0-348.23.1.el8_5.x86_64 the mpiexec.hydra works correctly.

I've now built the EB module iimpi/2022.05 which contains the latest Intel MPI module:

$ ml
Currently Loaded Modules:

  1. GCCcore/11.3.0 5) numactl/2.0.14-GCCcore-11.3.0
  2. zlib/1.2.12-GCCcore-11.3.0 6) UCX/1.12.1-GCCcore-11.3.0
  3. binutils/2.38-GCCcore-11.3.0 7) impi/2021.6.0-intel-compilers-2022.1.0
  4. intel-compilers/2022.1.0 8) iimpi/2022.05

Running this module on the EL 8.6 node running kernel 4.18.0-372.9.1.el8.x86_64 the mpiexec.hydra works correctly (as observed by others):

$ mpiexec.hydra --version
Intel(R) MPI Library for Linux* OS, Version 2021.6 Build 20220227 (id: 28877f3f32)
Copyright 2003-2022, Intel Corporation.

@OleHolmNielsen
Copy link
Contributor Author

One additional information is about the Intel MKL library: I've built the latest EB module imkl/2022.1.0 which includes an HPL benchmark executable .../modules/software/imkl/2022.1.0/mkl/2022.1.0/benchmarks/linpack/xlinpack_xeon64

Running the MKL2022.1.0 xlinpack_xeon64 executable also results in multiple copies of mpiexec.hydra in infinite loops, just like with Intel MPI prior to 2021.6.

I think there exists a newer MKL 2022.2.0 but I don't know how to make en EB module with it for testing - can anyone help?

@branfosj
Copy link
Member

branfosj commented Jun 21, 2022

I think there exists a newer MKL 2022.2.0 but I don't know how to make en EB module with it for testing - can anyone help?

I see 2022.1.0 on https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html#inpage-nav-9-7

This is the easyconfig that you've tested: https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/i/imkl/imkl-2022.1.0.eb - to update it you would change:

source_urls = ['https://registrationcenter-download.intel.com/akdlm/irc_nas/18721/']
sources = ['l_onemkl_p_%(version)s.223_offline.sh']

with the relevant source url and source for the offline Linux installer.

@OleHolmNielsen
Copy link
Contributor Author

OleHolmNielsen commented Jul 11, 2022

I have built the intel/2022a toolchain with EB 4.6.0, and I can confirm that with the new module impi/2021.6.0-intel-compilers-2022.1.0 the above issue with all previous Intel MPI versions has been resolved:

$ module load impi/2021.6.0-intel-compilers-2022.1.0
$ mpiexec.hydra --version
Intel(R) MPI Library for Linux* OS, Version 2021.6 Build 20220227 (id: 28877f3f32)
Copyright 2003-2022, Intel Corporation.

Of course, we still face an issue with all software modules that use the Intel MPI module prior to 2021.6.0 being broken on EL8 systems running the latest kernel.

@daRecall
Copy link

daRecall commented Aug 8, 2022

We got some feedback from intel:

The issue was analyzed and the root cause was found. In RHEL8.6 and other OS with recent kernel versions, system files are reported to have 0 bytes size. In previous kernel versions ftell was reporting size == blocksize != 0.

Using size==0 lead to a memory leak with the known consequences.

I have written a small workaround library that can be used with LD_PRELOAD. This lib will use an "adapted" version of ftell for the startup of IMPI. Once the program is started there should be no issue. It is also possible to switch off LD_PRELOAD for the user mpi program.

If this form of workaround is acceptable and you are willing to test it I can attach it to this issue.

Preferred methodology is, however, to use the newest version of IMPI.

@boegel
Copy link
Member

boegel commented Aug 9, 2022

We got some feedback from intel:

The issue was analyzed and the root cause was found. In RHEL8.6 and other OS with recent kernel versions, system files are reported to have 0 bytes size. In previous kernel versions ftell was reporting size == blocksize != 0.

@daRecall That's... interesting. Is that a deliberate change, perhaps related to security or something?

Using size==0 lead to a memory leak with the known consequences.

I have written a small workaround library that can be used with LD_PRELOAD. This lib will use an "adapted" version of ftell for the startup of IMPI. Once the program is started there should be no issue. It is also possible to switch off LD_PRELOAD for the user mpi program.

If this form of workaround is acceptable and you are willing to test it I can attach it to this issue.

I would certainly like to see this, if only to learn more about the underlying issue... Is this library available publicly somewhere?

Preferred methodology is, however, to use the newest version of IMPI.

Both "mangling" existing intel toolchain to use a more recent Intel MPI and using an $LD_PRELOAD library I consider dirty workaround, but I don't see another option here...

@michaellass
Copy link
Contributor

michaellass commented Aug 9, 2022

The issue was analyzed and the root cause was found. In RHEL8.6 and other OS with recent kernel versions, system files are reported to have 0 bytes size. In previous kernel versions ftell was reporting size == blocksize != 0.

Using size==0 lead to a memory leak with the known consequences.

Is this related to the size of the cpulist and cpumap files? If so, there is a kernel fix available: torvalds/linux@7ee951a

Ah, I see that cpulist was mentioned in #15651 (comment). So testing this kernel patch seems worth a try.

@OleHolmNielsen
Copy link
Contributor Author

Red Hat has issued a Knowledgebase article Intel MPI version 2019 hangs while reading cpulist about this:
https://access.redhat.com/solutions/6972452

Issue: Running even a simple mpirun on Intel's MPI version 2019 hangs after reading /sys/devices/system/node/node0/cpulist
Resolution: Kernel-side patch is in the process of being merged into RHEL kernel via BZ#2089715. Alternatively, updating Intel MPI to version 2021 or later reportedly also has fixed behavior, omitting the problem.

The issue may be fixed in rhel-8 with kernel-4.18.0-414.el8 (not yet available). The latest available kernel is kernel-4.18.0-372.19.1.el8_6.x86_64.

@stdweird
Copy link
Contributor

stdweird commented Sep 5, 2022

@OleHolmNielsen that is probably 8.7 kernel. is there any indication they will backport it to 8.6? (probably a separate BZ ticket will be created for that; but the original BZ you mentioned is not accessible, so i can't check)

@OleHolmNielsen
Copy link
Contributor Author

@stdweird You are very likely right about the 8.7 kernel. I have asked once again in https://bugzilla.redhat.com/show_bug.cgi?id=2095281 plus https://bugzilla.redhat.com/show_bug.cgi?id=2089715 if the fix will become available in an 8.6 kernel.

@stdweird
Copy link
Contributor

stdweird commented Sep 5, 2022

@OleHolmNielsen thanks a lot for tracking this!

@OleHolmNielsen
Copy link
Contributor Author

I received a reply from Red Hat in BZ case https://bugzilla.redhat.com/show_bug.cgi?id=2089715 as follows:

the fix for RHEL-8.6 is handled in bz#2112030 (private) and will be released
soon. If you need more details, please reach out to Red Hat by opening a
support case.

@stdweird
Copy link
Contributor

stdweird commented Sep 5, 2022

@OleHolmNielsen excellent news

@bartoldeman
Copy link
Contributor

@daRecall can you still attach the LD_PRELOAD workaround? We're mainly using Open MPI, and people programming MPI can easily use the newest version, but it's tougher with commercial packages such as Ansys and Star CCM+ that ship with particular older Intel MPI versions and are very intertwined with it.

@stdweird
Copy link
Contributor

@OleHolmNielsen kernel-4.18.0-372.26.1.el8_6.x86_64 is out, containing the fix * Intel MPI 2019.0 - mpirun stuck on latest kernel (BZ#2112030)

@OleHolmNielsen
Copy link
Contributor Author

Yes, the kernel fix for RHEL 8.6 is out, see https://access.redhat.com/errata/RHSA-2022:6460
This kernel is not yet out for AlmaLinux and RockyLinux, but that should be imminent.

To verify whether the patch has been applied or not, list this file:

$ ls -l /sys/devices/system/node/node0/cpulist
-r--r--r--. 1 root root 0 Sep 13 16:50 /sys/devices/system/node/node0/cpulist

The file size must be >0.

@OleHolmNielsen
Copy link
Contributor Author

The EL 8.6 kernel kernel-4.18.0-372.26.1.el8_6.x86_64.rpm is available in both AlmaLinux and RockyLinux since last night! RHEL 8.6 was updated a couple of days ago.

I've upgraded an EL 8.6 server and the cpulist file now has size>0 as expected:

$ uname -r
4.18.0-372.26.1.el8_6.x86_64
$ ls -l /sys/devices/system/node/node0/cpulist
-r--r--r--. 1 root root 28672 Sep 15 07:42 /sys/devices/system/node/node0/cpulist

and tested all our Intel toolchains on this system:

$ ml intel/2020b
$ mpiexec.hydra --version
Intel(R) MPI Library for Linux* OS, Version 2019 Update 9 Build 20200923 (id: abd58e492)
Copyright 2003-2020, Intel Corporation.
$ ml purge
$ ml intel/2021b
$ mpiexec.hydra --version
Intel(R) MPI Library for Linux* OS, Version 2021.4 Build 20210831 (id: 758087adf)
Copyright 2003-2021, Intel Corporation.
$ ml purge
$ ml intel/2022a
[$ mpiexec.hydra --version
Intel(R) MPI Library for Linux* OS, Version 2021.6 Build 20220227 (id: 28877f3f32)
Copyright 2003-2022, Intel Corporation.

As you can see, the Intel MPI is now working correctly again :-)) It was OK on EL 8.5, but broken on EL 8.6 until the above listed kernel was released.

@boegel
Copy link
Member

boegel commented Sep 17, 2022

@OleHolmNielsen Thanks a lot for the update, very happy to see that this problem has been resolved properly...

I guess we can close this issue then, since i) the issue is resolved by updating to a sufficiently recent kernel, ii) there's nothing to do on the EasyBuild side for this?

@boegel boegel modified the milestones: next release (4.6.2?), 4.x Sep 17, 2022
@OleHolmNielsen
Copy link
Contributor Author

@boegel: I agree with you that the issue has been resolved by Red Hat delivering an RHEL 8.6 kernel update with an appropriate fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests