Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

impi/2021.9.0-intel-compilers-2023.1.0 sanity check fails when using RPATH due to missing libfabric #20295

Open
cgross95 opened this issue Apr 4, 2024 · 13 comments
Labels
bug report EESSI Related to EESSI project
Milestone

Comments

@cgross95
Copy link
Contributor

cgross95 commented Apr 4, 2024

I'm trying to build impi/2021.9.0-intel-compilers-2023.1.0 and it fails during the sanity check, I believe due to building with RPATH enabled and libfabric not being found.

The sanity check fails after a small test has been built with mpicc -cc=icx ... -o mpi_test. The compilation succeeds, but running with mpirun -n 8 .../mpi_test fails with not much help:

== 2024-04-04 16:41:21,155 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): Sanity check failed: sanity check command mpirun -n 8 /tmp/eessi-build.gkHxuLQPcO/easybuild/build/impi/2021.9.0/intel-compilers-2023.1.0/mpi_test exited with code 143 (output: Abort(1090959) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........: 
MPID_Init(1546)..............: 
MPIDI_OFI_mpi_init_hook(1480): 
(unknown)(): Other MPI error
...
Abort(1090959) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........: 
MPID_Init(1546)..............: 
MPIDI_OFI_mpi_init_hook(1480): 
(unknown)(): Other MPI error
) (at easybuild/framework/easyblock.py:3661 in _sanity_check_step)

Some digging (reproducing the install environment and running with I_MPI_DEBUG=30 mpirun -v -n 2 .../mpi_test) shows me:

[1] MPI startup(): failed to load libfabric: libfabric.so.1: cannot open shared object file: No such file or directory

I originally thought there might be a problem with mpicc -cc=icx not using icx with an RPATH wrapper, since readelf -d .../mpi_test shows

Dynamic section at offset 0x2dc8 contains 28 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libmpifort.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libmpi.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000001d (RUNPATH)            Library runpath: [/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/lib/release:/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/lib]
...

I tried forcing a compilation of the new copy with the wrapper which gave me

Dynamic section at offset 0x2dc8 contains 28 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libmpifort.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libmpi.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000001d (RUNPATH)            Library runpath: [/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/lib:/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/lib64:$ORIGIN:$ORIGIN/../lib:$ORIGIN/../lib64:/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/lib/release:/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/lib:/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/libfabric/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/generic/software/UCX/1.14.1-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/generic/software/numactl/2.0.16-GCCcore-12.3.0/lib:/opt/software-current/2023.06/x86_64/generic/software/intel-compilers/2023.1.0/tbb/2021.9.0/lib/intel64/gcc4.8:/opt/software-current/2023.06/x86_64/generic/software/intel-compilers/2023.1.0/compiler/2023.1.0/linux/compiler/lib/intel64_lin:/opt/software-current/2023.06/x86_64/generic/software/intel-compilers/2023.1.0/compiler/2023.1.0/linux/lib/x64:/opt/software-current/2023.06/x86_64/generic/software/intel-compilers/2023.1.0/compiler/2023.1.0/linux/lib]
...

but had the same problem (even though the runpath includes the path to the libfabric libs).

Eventually running with

LD_LIBRARY_PATH=/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/libfabric/lib mpirun -n 2 ./mpi_test

succeeds with no errors.

So I'm not sure why the executable is not picking up the libfabric libraries when compiled with RPATH. Any help would be greatly appreciated! As a side note, this is being done on top of EESSI, so if there's anything relevant there that I can share, please let me know.

@yuke-li1
Copy link

yuke-li1 commented Sep 7, 2024

I encountered the same issue. To resolve it, I installed libfabric/1.18.0-GCCcore-12.3.0. This resolved the problem. I then added export I_MPI_OFI_LIBRARY_INTERNAL=0 to the .bashrc file. Afterwards, I ran the command:

mpiexec.hydra -env I_MPI_DEBUG=1 -np 8 ./a.out

The result was as follows:
[0] MPI startup(): Intel(R) MPI Library, Version 2021.9 Build 20230307 (id: d82b3071db)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.18.0
[0] MPI startup(): libfabric provider: verbs;ofi_rxm
Hello world: rank 0 of 8 running on
....
Hello world: rank 7 of 8 running on

@boegel boegel added this to the 4.x milestone Sep 11, 2024
@boegel
Copy link
Member

boegel commented Sep 11, 2024

@bedroge @ocaisa Any ideas here?

@boegel boegel added the EESSI Related to EESSI project label Sep 11, 2024
@boegel
Copy link
Member

boegel commented Sep 11, 2024

easybuilders/easybuild-easyblocks#2910 looks relevant here, perhaps?

@bedroge
Copy link
Contributor

bedroge commented Sep 12, 2024

Edited and removed most parts of this message, as most of it was not relevant / correct.

I suspect the real issue is that we're trying to compile something in the sanity check, but the RPATH wrappers are not available here (anymore)? Adding a echo $PATH to the command shows that the temporary dir with the wrappers is not listed anymore, and then it fails to find the included libfabric.

@bedroge
Copy link
Contributor

bedroge commented Sep 12, 2024

Alternatively, I guess we could adjust the easyblock and add the libfabric lib directory to LD_RUN_PATH when we're running the test executable. Could even only do that if $LD_LIBRARY_PATH is being filtered (and RPATH is being used?).

@ocaisa
Copy link
Member

ocaisa commented Sep 12, 2024

So the Allicance put compiler configurations in place beside the Intel compilers, the same problem is being discussed on Slack (https://easybuild.slack.com/archives/C34UA1HT7/p1726141669066889)

@ocaisa
Copy link
Member

ocaisa commented Sep 12, 2024

The configurations are only enough to use the compat layer though (so not relevant here). The only way to address the particular problem is do elf header modification of the libraries (or indeed use our compiler wrappers when compiling).

@ocaisa
Copy link
Member

ocaisa commented Sep 12, 2024

From EESSI/docs#175 (comment) , a hack would indeed be to set LD_RUN_PATH during the compilation step...but that doesn't really help your end users

@ocaisa
Copy link
Member

ocaisa commented Sep 30, 2024

I think this could be fixed in the EasyBlock for impi. The problem is it is not using our RPATH wrappers for mpicc (it is not part of the toolchain which is what we create wrappers for), but the sanity check block could be configured then to make them available before the compilation during the sanity check.

@ocaisa
Copy link
Member

ocaisa commented Sep 30, 2024

Hmm, having the underlying compilers wrapped should actually be enough. The problem is it is shipping it's own libfabric and not rpath-ed against it. I'm flooding this thread with poorly thought out ideas.

Without patchelfing, the solution would be to force the setting of LD_LIBRARY_PATH for impi (and print a warning that this needs to be done, and why)

@ocaisa
Copy link
Member

ocaisa commented Sep 30, 2024

Wait a minute, I don't see libfabric as a dependency in https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/i/impi/impi-2021.8.0-intel-compilers-2023.0.0.eb (or other easyconfigs). If it is added, and rebuilt, then this problem goes away if we include #20295 (comment) (export I_MPI_OFI_LIBRARY_INTERNAL=0)

@yuke-li1
Copy link

yuke-li1 commented Oct 6, 2024

The setting I_MPI_DEBUG=1 can be used to debug issues with mpirun.
I executed the command to test libfabric:

mpiexec.hydra -env I_MPI_DEBUG=1 -np 8 ./a.out

[0] MPI startup(): Intel(R) MPI Library, Version 2021.9 Build 20230307 (id: d82b3071db)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.18.0
[0] MPI startup(): libfabric provider: verbs;ofi_rxm
Hello world: rank 0 of 8 running on ...
...
Hello world: rank 7 of 8 running on ...

It is evident that libfabric is being utilized. When no additional libfabric module is provided, the Intel MPI Library defaults to using its own embedded libfabric. This appears to be the source of the error.

I recommended adding the libfabric/1.18.0-GCCcore-12.3.0 module to the IMPI EasyBuild file.

@bartoldeman
Copy link
Contributor

@yuke-li filed an issue to our internal ticket system so I had another look.

The core issue in the first comment by @cgross95 is the use of RUNPATH instead of RPATH. RUNPATH unlike RPATH isn't transitive, so it does not apply to the dlopen of Intel's libmpi.so.12. If you

patchelf --force-rpath --set-rpath '/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/lib:/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/lib64:$ORIGIN:$ORIGIN/../lib:$ORIGIN/../lib64:/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/lib/release:/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/lib:/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/libfabric/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/generic/software/UCX/1.14.1-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/generic/software/numactl/2.0.16-GCCcore-12.3.0/lib:/opt/software-current/2023.06/x86_64/generic/software/intel-compilers/2023.1.0/tbb/2021.9.0/lib/intel64/gcc4.8:/opt/software-current/2023.06/x86_64/generic/software/intel-compilers/2023.1.0/compiler/2023.1.0/linux/compiler/lib/intel64_lin:/opt/software-current/2023.06/x86_64/generic/software/intel-compilers/2023.1.0/compiler/2023.1.0/linux/lib/x64:/opt/software-current/2023.06/x86_64/generic/software/intel-compilers/2023.1.0/compiler/2023.1.0/linux/lib' mpi_test

it'll work fine! Alternatively you can patchelf libmpi.so.12 itself to add an RPATH there to libfabric (we do that in our intelmpi modules). RUNPATH will work there as well, if you like.

Now I don't know here where the RUNPATH comes from, as the EB RPATH wrappers specifically uses --disable-new-dtags, so that's still a mystery to me.

The external vs. internal libfabric is orthogonal to this issue. The advantage of the internal libfabric is that it can use UCX and can benefit from better performance with Mellanox hardware. External libfabric includes a UCX provider again since version 1.18.0 (fi_ucx), but it is not yet enabled in the easyconfigs for libfabric, and I haven't tested or benchmarked it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug report EESSI Related to EESSI project
Projects
None yet
Development

No branches or pull requests

6 participants