Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI inter-node issues with Intel MPI v2019 on Mellanox IB #10314

Closed
lexming opened this issue Apr 2, 2020 · 22 comments · Fixed by #10280
Closed

MPI inter-node issues with Intel MPI v2019 on Mellanox IB #10314

lexming opened this issue Apr 2, 2020 · 22 comments · Fixed by #10280
Labels
2020a before 2020a is released problem report
Milestone

Comments

@lexming
Copy link
Contributor

lexming commented Apr 2, 2020

I tested the execution of a simple inter-node job between two nodes over our Infiniband network with updates 5, 6 and 7 of Intel MPI v2019 and I found very different results for each release. All tests were carried out with iccifort/2020.1.217 as base of the toolchain.

Characteristics of the testing system

  • CPU: 2x Intel(R) Xeon(R) Gold 6126
  • Adapter: Mellanox Technologies MT27700 Family [ConnectX-4]
  • Operative System: Cent OS 7.7
  • Related system libraries: UCX v1.5.1, OFED v4.7-3.2.9
  • ICC: v2020.1 (from Easybuild)
  • Resource manager: Torque

Steps to reproduce:

  1. Start a job on two nodes
  2. Load impi
  3. mpicc ${EBROOTIMPI}/test/test.c -o test
  4. mpirun ./test

Intel MPI v2019 update 5: works out of the box

$ module load impi/2019.5.281-iccifort-2020.1.217
$ fi_info --version
fi_info: 1.7.2a
libfabric: 1.7.2a
libfabric api: 1.7
$ fi_info | grep provider
provider: verbs;ofi_rxm
provider: verbs;ofi_rxd
provider: verbs
provider: verbs
provider: verbs
$ mpirun ./test
Hello world: rank 0 of 2 running on node357.hydra.os
Hello world: rank 1 of 2 running on node356.hydra.os

Intel MPI v2019 update 6: does NOT work out of the box, but can be fixed

$ module load impi/2019.6.166-iccifort-2020.1.217
$ fi_info --version
fi_info: 1.9.0a1
libfabric: 1.9.0a1-impi
libfabric api: 1.8
$ fi_info | grep provider
provider: mlx
provider: mlx;ofi_rxm
$ mpirun ./test
[1585832682.960816] [node357:302190:0]         select.c:406  UCX  ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy, mm/sysv - Destination is unreachable, mm/posix - Destination is unreachable, cma/cma - no am bcopy
Abort(1091471) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........: 
MPID_Init(958)...............: 
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed
  • Solution 1: use verbs or tcp libfabric providers instead of mlx
$ module load impi/2019.6.166-iccifort-2020.1.217
$ FI_PROVIDER=verbs,tcp mpirun ./test
Hello world: rank 0 of 2 running on node357.hydra.os
Hello world: rank 1 of 2 running on node356.hydra.os
$ module load impi/2019.6.166-iccifort-2020.1.217
$ module load UCX/1.7.0-GCCcore-9.3.0
$ ucx_info
# UCT version=1.7.0 revision 
# configured with: --prefix=/user/brussel/101/vsc10122/.local/easybuild/software/UCX/1.7.0-GCCcore-9.3.0 --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --enable-optimizations --enable-cma --enable-mt --with-verbs --without-java --disable-doxygen-doc
$ FI_PROVIDER=mlx mpirun ./test
Hello world: rank 0 of 2 running on node357.hydra.os
Hello world: rank 1 of 2 running on node356.hydra.os
  • Solution 3: use external libfabric v1.9.1. Upstream libfabric dropped mlx with version 1.9.0
$ module load impi/2019.6.166-iccifort-2020.1.217
$ module load libfabric/1.9.1-GCCcore-9.3.0
$ export FI_PROVIDER_PATH=
$ fi_info --version
fi_info: 1.9.1
libfabric: 1.9.1
libfabric api: 1.9
$ mpirun ./test
Hello world: rank 0 of 2 running on node357.hydra.os
Hello world: rank 1 of 2 running on node356.hydra.os

Intel MPI v2019 update 7: does NOT work at all

$ module load impi/2019.7.217-iccifort-2020.1.217
$ fi_info --version
fi_info: 1.10.0a1
libfabric: 1.10.0a1-impi
libfabric api: 1.9
$ fi_info | grep provider
provider: verbs;ofi_rxm
[...]
provider: tcp;ofi_rxm
[...]
provider: verbs
[...]
provider: tcp
[...]
provider: sockets
[...]
$ $ I_MPI_DEBUG=4 I_MPI_HYDRA_DEBUG=on FI_LOG_LEVEL=debug mpirun ./test
[mpiexec@node357.hydra.os] Launch arguments: /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node357.hydra.brussel.vsc --upstream-port 40969 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9 
[mpiexec@node357.hydra.os] Launch arguments: /usr/bin/ssh -q -x node356.hydra.brussel.vsc /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node357.hydra.brussel.vsc --upstream-port 40969 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 1 --node-id 1 --subtree-size 1 /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9 
[proxy:0:0@node357.hydra.os] Warning - oversubscription detected: 1 processes will be placed on 0 cores
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:1@node356.hydra.os] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get_maxes
[proxy:0:1@node356.hydra.os] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get_appnum
[proxy:0:1@node356.hydra.os] PMI response: cmd=appnum appnum=0
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get_my_kvsname
[proxy:0:1@node356.hydra.os] PMI response: cmd=my_kvsname kvsname=kvs_309778_0
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get kvsname=kvs_309778_0 key=PMI_process_mapping
[proxy:0:1@node356.hydra.os] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,2,1))
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=barrier_in

(the execution does not stop, it just hangs at this point)

The system log of the node shows the following entry

traps: hydra_pmi_proxy[549] trap divide error ip:4436ed sp:7ffed012ef50 error:0 in hydra_pmi_proxy[400000+ab000]

This error with IMPI v2019.7 happens way before initializing libfabric. Therefore, it does not depend on the provider or the version of UCX. It happens all the time.

Update

@lexming lexming added problem report 2020a before 2020a is released labels Apr 2, 2020
@lexming lexming added this to the 2020a milestone Apr 2, 2020
@lexming lexming linked a pull request Apr 2, 2020 that will close this issue
@lexming
Copy link
Contributor Author

lexming commented Apr 2, 2020

Regarding the specific issue with Intel MPI 2019 update 7, I already opened a support ticket in the Intel community forums: https://software.intel.com/en-us/forums/intel-oneapi-hpc-toolkit/topic/851724

@akesandgren
Copy link
Contributor

For update 7, have you tried using PMIx's PMI1/2 backport interface?

@lexming
Copy link
Contributor Author

lexming commented Apr 2, 2020

I also carried out a few quick benchmarks with OSU-Micro-Benchmarks with the different solutions for IMPI v2019.6. Either using an external UCX or using an external libfabric offers better bandwidth and latency than intel/2019b (default settings), and compared to foss/2020a they are on par.

@lexming
Copy link
Contributor Author

lexming commented Apr 2, 2020

@akesandgren nope, I have not. I'll check it out

@lexming
Copy link
Contributor Author

lexming commented Apr 2, 2020

@akesandgren we are not using PMIx in our system and Intel MPI is not configured to use it. I_MPI_PMI and I_MPI_PMI_LIBRARY are not defined. It can be discarded then.

@akesandgren
Copy link
Contributor

Well impi is using PMI version 1 according to the error above.

So it may still be worthwhile to test. Can't quite remember how to do it when using mpirun. we only use it for srun started stuff.

@lexming
Copy link
Contributor Author

lexming commented Apr 2, 2020

@akesandgren ok, you meant to specifically use an external PMIx (I understood the opposite). Unfortunately, I cannot do that because we use the Hydra process manager and it does not allow it. However, checking the PMI support in IMPI I just saw the following in the changelog of IMPI-2019.7

Added PMI2 support

Does this mean that Hydra now supports PMI2? I have no idea how it is enabled though. I played with I_MPI_PMI and I_MPI_PMI2 but nothing changes.

@akesandgren
Copy link
Contributor

yeah, I never gotten it to do what I thought it would do either.

But have you tested the behavior when using srun to start it? And then possibly with external PMIx using the PMI1/2 interface of PMIx

@lexming
Copy link
Contributor Author

lexming commented Apr 3, 2020

@akesandgren I forgot to mention that we do not use Slurm in our site 🙂

@mj-harvey
Copy link

This is the set of configuration variables we've had to use to get 2019u7 working on our 100g RoCE (ConnectX-5, Mellanox OFED 5.0.2, CentOS 8)

export FL_PROVIDER=mlx
export I_MPI_OFI_EXPERIMENTAL=1
export I_MPI_TUNING_BIN=/apps/mpi/intel/2019.7.217/etc/tuning_generic_shm-ofi_mlx_hcoll.dat
export I_MPI_OFI_LIBRARY_INTERNAL=1

@lexming
Copy link
Contributor Author

lexming commented Apr 16, 2020

@mj-harvey thank you very much for sharing your setup. Unfortunately none of those settings change the crashes of update 7 in our systems with ConnectX-4. Currently I suspect that our issues with update 7 are more related to PMI than anything else. The crash happens very very early in hydra_pmi_proxy (before libfabric or UCX kick in) and the changelog states that they have added Support for PMI2 in this update, without specifying anything else. So I guess that they broke something there.

@mj-harvey
Copy link

mj-harvey commented Apr 17, 2020

Do you have any pmi library installed on the system? Do you only see this within the context of a Torque job, or also when outside of a job?

Have you tried getting any insight by strace-ing mpiexec.hydra and hydra_pmi_proxy and also attaching gdb to see if you can get a stack trace from the point at which it faults?

How is a single node job, using just shm, or just ofi?

@lexming
Copy link
Contributor Author

lexming commented Apr 17, 2020

Do you have any pmi library installed on the system? Do you only see this within the context of a Torque job, or also when outside of a job?

There aren't any external PMI libs in use. The crash of hydra_pmi_proxy happens within a Torque job, being it a single node or a multi-node job. However, outside of a job in a single node it does not happen and actually something like mpirun -n 2 ./test works flawlessly. So Torque is definitely involved in this issue and might be hindering the communication between those processes... the question now is what has been changed in update 7 that is not compatible with our production Torque environment? In any case, thanks for rising that question, this is a new path to explore.

Have you tried getting any insight by strace-ing mpiexec.hydra and hydra_pmi_proxy and also attaching gdb to see if you can get a stack trace from the point at which it faults?

I have not gone that far. Since we pay for support to Intel I'm leaving that work to them 🙂

How is a single node job, using just shm, or just ofi?

Single node jobs do not work with update 7 and show a similar behaviour to multi-node jobs. hydra_pmi_proxy crashes immediately with the same trap divide error. However, the error output in a singe node job is different to what I showed before for multi-node jobs

$ I_MPI_DEBUG=4 I_MPI_HYDRA_DEBUG=on FI_LOG_LEVEL=debug mpi
[mpiexec@node377.hydra.os] Launch arguments: /user/brussel/101/vsc10122/.local/easybuild-skylake/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node377.hydra.brussel.vsc --upstream-port 46384 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild-skylake/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /user/brussel/101/vsc10122/.local/easybuild-skylake/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[proxy:0:0@node377.hydra.os] Warning - oversubscription detected: 2 processes will be placed on 0 cores
[mpiexec@node377.hydra.os] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:532): downstream from host node377.hydra.brussel.vsc was killed by signal 8 (Floating point exception)
[mpiexec@node377.hydra.os] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:2114): assert (exitcodes != NULL) failed

It seems that Intel MPI is not capable of using the allocated cores in this case, which is very weird. I'll update Intel support with this information. Thanks for the questions.

@mj-harvey
Copy link

That's interesting that the failure's confined to within a Torque job. My guess is it's something related to using pbs_tmrsh? (we're PBS Pro, so there's some guesswork about how similar Torque still is).
We have our own mpiexec, rather than using Intel's. This kicks off hydra with:

$MPI_HOME/bin/mpiexec.hydra  -print-rank-map  \
    -rmk pbs  \
    -bootstrap ssh  \
    -bootstrap-exec "$BOOTSTRAP_EXEC" \
    -n "$NALLOC" \
    -wdir "$PWD" \
    -machinefile $TEMPFILE  \
    "$PROGRAM" "$@"

where BOOTSTRAP_EXEC is pbs_tmrsh if within a job context, or ssh if without.

Maybe try forcing ssh use?

@lexming
Copy link
Contributor Author

lexming commented Apr 17, 2020

@mj-harvey thanks for the suggestion! Unfortunately playing with either -bootstrap or -launcher in mpiexec.hydra does not make a difference. But I found that there is a legacy mpiexec.hydra under $MPI_HOME/bin/legacy/mpiexec.hydra and that one does work. So now we have a fallback solution at least.

@bartoldeman
Copy link
Contributor

@lexming @akesandgren

Here's my experience with Intel MPI 2019.7.217 with slurm:

It works with srun if I point I_MPI_PMI_LIBRARY to libpmi2.so (no other env vars necessary to make it work with libpmi2.so) or for some reason if I don't set I_MPI_PMI_LIBRARY at all.

With mpirun I get a floating point exception (SIGFPE) if multiple nodes are used. With ulimit -c unlimited this produces core files and a backtrace on mpiexec.hydra then gives:

#0  0x00000000004436ed in ipl_create_domains (pi=0x0, scale=4786482)
    at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_service.c:2240
2240    ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_service.c: No such file or directory.
(gdb) bt
#0  0x00000000004436ed in ipl_create_domains (pi=0x0, scale=4786482)
    at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_service.c:2240
#1  0x0000000000000001 in ?? ()
#2  0x312c383637320030 in ?? ()
#3  0x322c363735383430 in ?? ()
#4  0x0030343835333838 in ?? ()
#5  0x00002b18dfa7a791 in tsearch ()
   from /cvmfs/soft.computecanada.ca/gentoo/2019/lib64/libc.so.6
#6  0x00002b18df9b464e in __add_to_environ ()
   from /cvmfs/soft.computecanada.ca/gentoo/2019/lib64/libc.so.6
#7  0x000000000040b0d4 in cleanup_cb (fd=0, events=2354, userp=0x0)
    at ../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:221
#8  0x0000000000000000 in ?? ()

The "legacy" mpiexec.hydra runs without issue though.

@bartoldeman
Copy link
Contributor

I can avoid the SIGFPE using export I_MPI_HYDRA_TOPOLIB=ipl (like HPC UGent in https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/807359
)

@bartoldeman
Copy link
Contributor

There's another issue with 2019.7 for me, the "mlx" provider does not work, I get:

$ FI_LOG_LEVEL=debug FI_PROVIDER=mlx fi_info
libfabric:157032:core:core:fi_getinfo_():1095<warn> Provider mlx fi_version 1.8 < requested 1.9
fi_getinfo: -61

(default provider works but it's not optimal then).

@lexming
Copy link
Contributor Author

lexming commented Apr 20, 2020

@bartoldeman thanks a lot for all the information! Switching to the native topology detection with I_MPI_HYDRA_TOPOLIB=ipl does indeed work on my side with update 7. It's unfortunate that the internal hwloc cannot be replaced with an external one.

Regarding your error with the MLX provider, I have the same error messages as you from fi_info. The MLX provider is not even listed as available. However, it still gets picked up by mpirun (don't ask me why).

One additional note, to make mpirun work with I_MPI_HYDRA_TOPOLIB=ipl I had to use at least UCX v1.7, which I loaded from Easybuild. Then it seems to work well, even enabling multi-ep with I_MPI_THREAD_SPLIT=1 works.

I put the full output of $ I_MPI_THREAD_SPLIT=1 I_MPI_HYDRA_TOPOLIB=ipl FI_LOG_LEVEL=debug mpirun ./test in https://gist.github.com/lexming/35100920d38bf2a846da06abeebd9399

[...]
libfabric:80486:mlx:core:mlx_getinfo():226<info> Loaded MLX version 1.7.0
libfabric:80486:mlx:core:fi_param_get_():280<info> variable enable_spawn=<not set>
libfabric:80486:mlx:core:mlx_getinfo():264<warn> MLX: spawn support 0 
libfabric:80486:core:core:ofi_layering_ok():945<info> Need core provider, skipping ofi_rxm
libfabric:80486:core:core:fi_getinfo_():1082<info> Since mlx can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:80486:core:core:fi_getinfo_():1082<info> Since mlx can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:80486:core:core:fi_getinfo_():1082<info> Since mlx can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:80486:core:core:fi_getinfo_():1082<info> Since mlx can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:80486:core:core:fi_getinfo_():1082<info> Since mlx can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:80486:core:core:fi_getinfo_():1082<info> Since mlx can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:80486:mlx:core:mlx_fabric_open():69<info> 
libfabric:80486:core:core:fi_fabric_():1316<info> Opened fabric: mlx
libfabric:80486:core:core:fi_fabric_():1316<info> Opened fabric: mlx
[...]
Hello world: rank 0 of 2 running on node377.hydra.os
Hello world: rank 1 of 2 running on node375.hydra.os

The MLX provider is properly selected and loaded. Hence tcp and verbs are skipped. Even though MLX is never listed by fi_info.

@lexming
Copy link
Contributor Author

lexming commented Apr 20, 2020

In conclusion it seems that we have two different solutions for update 7 depending on the resource manager:

  • Slurm: use PMI2 by pointing I_MPI_PMI_LIBRARY to libpmi2.so
  • Torque: use the native topology detection by setting I_MPI_HYDRA_TOPOLIB=ipl and use an up to date UCX

The question now is, can/should Easybuild do anything to automagically fix this issue? Or would it be preferable to add this information as a comment in the easyconfig of Intel MPI and let each site handle it on their own?

@akesandgren
Copy link
Contributor

This should not be auto handled by EB, an explaining comment is what we need. This is one of the things the hooks are good for. Site tweaking of generic easyconfigs.

bartoldeman added a commit to ComputeCanada/easybuild-easyconfigs that referenced this issue Apr 21, 2020
* if the system UCX is too old
* if you want to make Intel MPI work with srun via PMI2
* or if you need to avoid a floating point exception

The UCX dep could perhaps be unconditional once UCX 1.8.0 is merged.

Fixes easybuilders#10314
@Micket Micket modified the milestones: 2020a, 4.x May 26, 2020
@lexming
Copy link
Contributor Author

lexming commented Jul 2, 2020

Closing as this is already fixed in impi-2019.7.217-iccifort-2020.1.217.eb and by extension intel/2020a.

@lexming lexming closed this as completed Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2020a before 2020a is released problem report
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants