Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMPI v2019.6: MLX provider in libfabric not working #10213

Open
lexming opened this issue Mar 22, 2020 · 24 comments
Open

IMPI v2019.6: MLX provider in libfabric not working #10213

lexming opened this issue Mar 22, 2020 · 24 comments
Labels
2020a before 2020a is released problem report
Milestone

Comments

@lexming
Copy link
Contributor

lexming commented Mar 22, 2020

Intel has introduced a new MLX provider for libfabric in IMPI v2019.6, the one used in intel/2020a toolchain. More info: https://software.intel.com/en-us/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband

Issue: currently, all executables fail to initialize MPI with intel/2020a in our nodes with Mellanox cards.

Steps to reproduce:

  1. Use a system with a Mellanox card. Check that version 1.4 or higher of UCX is installed
$  ucx_info -v
# UCT version=1.5.1 revision 0000000
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check
  1. Install and load intel-2020a.eb (I'll use the full toolchain for simplicity)

  2. Check if the provider of libfabric is listed as mlx. This can be done with the fi_info tool from IMPI v2019.6 in intel/2020a.

$ fi_info
provider: mlx
    fabric: mlx
    domain: mlx
    version: 1.5
    type: FI_EP_UNSPEC
    protocol: FI_PROTO_MLX
provider: mlx;ofi_rxm
    fabric: mlx
    domain: mlx
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXM
  1. Compile end execute the minimal test program from IMPI v2019.6
$ mpicc $EBROOTIMPI/test/test.c -o test
$ FI_LOG_LEVEL=debug ./test

Result:
Output on our systems with Mellanox can be found at https://gist.github.com/lexming/fa6cd07bdb8e4d35be873b501935bb61

Solution/workaround:
I have not found a solution to the failing MLX provider. Moreover, the official libfabric project has removed the mlx provider altogether since version 1.9 due to lack of maintenance (ofiwg/libfabric@d8c8a2b). IMPI v2019.6 uses its own fork labelled 1.9.0a1-impi.

The workaround is to switch to a different provider by setting the FI_PROVIDER environment variable. On a system with a Mellanox card this can be set to tcp or verbs. Even though this works, it is not clear the performance impact of this change and it defeats the purpose of having a framework than can automatically detect the best transport layer.

@boegel boegel added 2020a before 2020a is released problem report labels Mar 22, 2020
@boegel boegel added this to the 2020a milestone Mar 22, 2020
@lexming
Copy link
Contributor Author

lexming commented Mar 22, 2020

One additional note: it seems that downgrading OFED to v4.5 would fix the broken MLX provider, based on the report in https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/842957

However, this is hardly a solution as that version is quite old. For instance, Cent OS 7 is on OFED v4.7 already.

@boegel
Copy link
Member

boegel commented Mar 22, 2020

I'm basically seeing the same issue during the h5py sanity check (for #10160):

== 2020-03-22 20:10:38,425 build_log.py:169 ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:124 in __init__): Sanity check failed: command "/software/Python/3.8.2-GCCcore-9.3.0/bin/python -c "import h5py"" failed; output:
Abort(2140047) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1334):
MPIDU_bc_table_create(444)...: (at easybuild/framework/easyblock.py:2634 in _sanity_check_step)

More info:

$ ucx_info -v
# UCT version=1.5.1 revision 0000000
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check
$ fi_info
provider: mlx
    fabric: mlx
    domain: mlx
    version: 1.5
    type: FI_EP_UNSPEC
    protocol: FI_PROTO_MLX
provider: mlx;ofi_rxm
    fabric: mlx
    domain: mlx
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXM

@lexming
Copy link
Contributor Author

lexming commented Mar 23, 2020

General purpose workaround: FI_PROVIDER=verbs,tcp
This setting for the libfabric provider will use IB if it's available or fallback to TCP.
It should be possible to combine any number of providers as described in https://software.intel.com/en-us/mpi-developer-guide-linux-ofi-providers-support

@boegel
Copy link
Member

boegel commented Mar 23, 2020

@lexming But that's only advisable for systems with Infiniband though?

@lexming
Copy link
Contributor Author

lexming commented Mar 23, 2020

@boegel with the setting FI_PROVIDER=verbs,tcp systems without IB will just fallback to TCP seamlessly. No error.
The setting that only works on systems with IB is FI_PROVIDER=verbs.

@lexming
Copy link
Contributor Author

lexming commented Mar 23, 2020

It is also possible to use IMPI with an external libfabric by setting I_MPI_OFI_LIBRARY_INTERNAL=0.

  • Test with IMPI 2019.6 and libfabric-1.8.1, which is the last upstream release that still has mlx provider, does not work. The error is different but the mlx provider still fails. It must be noted that the code for mlx in libfabric-1.8.1 was released two years ago for UCX v1.3 and we have UCX v1.5 in our systems, with several changes and deprecated functions. So, in this case mlx probably fails for different reasons than in the standard IMPI v2019.6. The point is that it fails and hence, it is not a usable alternative.
$ module load intel/2020a
$ module load libfabric/1.8.1-GCCcore-9.3.0
$ I_MPI_OFI_LIBRARY_INTERNAL=0 FI_PROVIDER=mlx ./test                      
Abort(1091471) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........: 
MPID_Init(958)...............: 
MPIDI_OFI_mpi_init_hook(1060): OFI fi_open domain failed (ofi_init.c:1060:MPIDI_OFI_mpi_init_hook:No data available)
  • Test with IMPI 2019.6 and libfabric-1.9.1, which is the last upstream release, does work. However, the mlx provider is no longer available, this works because at least all providers compatible with the host hardware are automatically enabled (eg, verbs, tcp, etc...).
$ module load intel/2020a
$ module load libfabric/1.9.1-GCCcore-9.3.0
$ FI_PROVIDER_PATH='' I_MPI_OFI_LIBRARY_INTERNAL=0 ./test
Hello world: rank 0 of 1 running on login2.cerberus.os

Therefore, this second test does not use mlx and is similar to forcing IMPI 2019.6 with bundled libfabric-1.9.0a1-impi to use verbs or tcp by setting FI_PROVIDER=verbs,tcp.

In conclusion, we have two workaround solutions at our disposal:

  • Force IMPI v2019.6 with bundled libfabric-1.9.0a1-impi to use other providers, such as verbs or tcp. Requirements:

    1. add modextravars with FI_PROVIDER=verbs,tcp to impi-2019.6.166-iccifort-2020.0.166.eb
  • Use IMPI v2019.6 with external libfabric-1.9.1, which by default enables all compatible providers. Requirements:

    1. add libfabric-1.9.1-GCCcore-9.3.0 as dependency of impi-2019.6.166-iccifort-2020.0.166.eb
    2. add modextravars to disable FI_PROVIDER_PATH
    3. add modextravars with I_MPI_OFI_LIBRARY_INTERNAL=0

@boegel
Copy link
Member

boegel commented Mar 23, 2020

@lexming Thanks a lot for digging into this!

My preference goes to using the external libfabric 1.9.1, since that avoids "hardcoding" stuff to Infiniband via $FI_PROVIDER.

Can you clarify why setting $FI_PROVIDER_PATH is needed?

Also, should we reach out to Intel support on this, and try to get some feedback on the best way forward (and maybe also ask how they managed to overlook this issue)?

@lexming
Copy link
Contributor Author

lexming commented Mar 23, 2020

Regarding $FI_PROVIDER_PATH, the easyblock impi sets that path to the bundled providers shipped with IMPI in $EBROOTIMPI/intel64/libfabric/lib/prov (as it should be).
In this case, $FI_PROVIDER_PATH has to be unset to use the providers in the external libfabric-1.9.1. And it is not necessary to set any other path because external libfabric bundles the providers in the libfabric library.

On our side, we will contact Intel about this issue. The real solution requires fixing IMPI v2019.6 as far as I can tell.

@bartoldeman
Copy link
Contributor

bartoldeman commented Mar 24, 2020

I tested on 2 clusters:

  1. Béluga with UCX 1.7.0, ConnectX-5, CentOS-7.7, no MOFED.
provider: mlx
    fabric: mlx
    domain: mlx
    version: 1.5
    type: FI_EP_UNSPEC
    protocol: FI_PROTO_MLX

test works ok with srun -n 2 ./test and with mpirun -n 2 ./test (if I set UCX_IB_MLX5_DEVX=no but I need that for Open MPI as well)
2. Graham with UCX 1.7.0 ConnectX-4, CentOS-7.5, no MOFED

    fabric: mlx
    domain: mlx
    version: 1.5
    type: FI_EP_UNSPEC
    protocol: FI_PROTO_MLX
provider: mlx;ofi_rxm
    fabric: mlx
    domain: mlx
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXM

works ok too! Note that the fi_info list for the second case is longer.

Also note that the easyblock for Intel MPI has a parameter ofi_internal = False which you can use to disable that without needing to play with modextravars.

@bartoldeman
Copy link
Contributor

Hmm, I get errors if I run ./test without srun/mpirun, but if I use one of those it's ok.

@lexming
Copy link
Contributor Author

lexming commented Mar 24, 2020

@bartoldeman thank you for the feedback. This is very interesting, in my system executing the test with mpirun ./test does indeed work.This is good news as this means that mlx does work for inter-node jobs.

The reason I never tried mpirun is that the origin of this issue are failed sanity checks of Python modules linking with MPI (eg h5py and TensorFlow). In those cases importing the respective module in Python initializes MPI before running any distributed execution (so no mpirun) and those fail with the aforementioned errors.

If mlx is working as intended, this is a change of behaviour compared to other providers such as tcp, which can be used without mpirun and it is equivalent to mpirun -n 1.

@lexming
Copy link
Contributor Author

lexming commented Mar 24, 2020

@boegel the sanity check command of h5py does work if called with mpirun. Hence, the best solution seems to be to change the sanity check command of any Python module using MPI to

mpirun -n 1 python -c "import module_name"

@boegel
Copy link
Member

boegel commented Mar 24, 2020

@lexming So, there's actually no real problem, as long as mpirun is used (which should always be used anyway, I think), Intel MPI 2019 update 6 works just fine?

@boegel
Copy link
Member

boegel commented Mar 24, 2020

@lexming h5py import check fixed in #10246

@lexming
Copy link
Contributor Author

lexming commented Mar 24, 2020

@boegel yeah with mpirun it seems to work just fine, but I have not done any extensive testing yet. On the side of Easybuild all that needs to be done is make sure that sanity checks of packages with mpi are done with mpirun.

Keep in mind that using those packages (such as h5py in intel/2020a) will now require to use mpirun at all times, which might break some user's workflows. But that is not an issue for Easybuild in my opinion.

@bartoldeman
Copy link
Contributor

There is some guidance about singleton MPI in the MPI standard
https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node254.htm#Node254
but it has been problematic in my experience, sometimes it works, sometimes not. E.g. Open MPI on QLogic/Intel PSM infinipath, needed an environment variable.

@boegel
Copy link
Member

boegel commented Mar 24, 2020

I'm hitting a serious issue with mpiexec using Intel MPI 2019 update 6 when running in Slurm jobs:

$ mpiexec -np 1 /bin/ls
Segmentation fault

See details in https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/807359#comment-1955057 .

@bartoldeman
Copy link
Contributor

@akesandgren has something in his hooks for this. I've borrowed this from his hooks in our local easyconfigs:

postinstallcmds = [
    # Fix mpirun from IntelMPI to explicitly unset I_MPI_PMI_LIBRARY
    # it can only be used with srun.
    "sed -i 's@\\(#!/bin/sh.*\\)$@\\1\\nunset I_MPI_PMI_LIBRARY@' %(installdir)s/intel64/bin/mpirun",
]

@lexming
Copy link
Contributor Author

lexming commented Mar 30, 2020

I got feedback from Intel on this specific issue (precisely, the crashes of executables linking to Intel MPI 2019.6 if executed without mpiexec/mpirun with the mlx provider).
Intel support team has been able to reproduce the issue and they acknowledge it. They will escalate the issue and it should be fixed at some point.

@lexming
Copy link
Contributor Author

lexming commented Apr 14, 2020

We got a new reply from Intel regarding this issue

Our engineering team is planning to have this resolved in 2019 Update 8.

@boegel
Copy link
Member

boegel commented Apr 15, 2020

@lexming Can you ask them when they expect update 8 to be released?

Feel free to tell them that this is holding us back from going forward with intel/2020a in EasyBuild, I'm sure that'll convince them to get their act together... ;)

@lexming
Copy link
Contributor Author

lexming commented Apr 15, 2020

@boegel done, I'll update this issue as soon as I get any reply

@maxim-masterov
Copy link
Collaborator

FYI, IMPI v2019.8.254 is released. I've tested a simple MPI code and it seems that the new release resolves the issue.

@lexming
Copy link
Contributor Author

lexming commented Sep 25, 2020

I confirm that the new update release IMPI v2019.8.254 fixes this issue. This can be tested with the easyconfig in #11337 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2020a before 2020a is released problem report
Projects
None yet
5 participants