-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI inter-node issues with Intel MPI v2019 on Mellanox IB #10314
Comments
Regarding the specific issue with Intel MPI 2019 update 7, I already opened a support ticket in the Intel community forums: https://software.intel.com/en-us/forums/intel-oneapi-hpc-toolkit/topic/851724 |
For update 7, have you tried using PMIx's PMI1/2 backport interface? |
I also carried out a few quick benchmarks with |
@akesandgren nope, I have not. I'll check it out |
@akesandgren we are not using PMIx in our system and Intel MPI is not configured to use it. |
Well impi is using PMI version 1 according to the error above. So it may still be worthwhile to test. Can't quite remember how to do it when using mpirun. we only use it for srun started stuff. |
@akesandgren ok, you meant to specifically use an external PMIx (I understood the opposite). Unfortunately, I cannot do that because we use the Hydra process manager and it does not allow it. However, checking the PMI support in IMPI I just saw the following in the changelog of IMPI-2019.7
Does this mean that Hydra now supports PMI2? I have no idea how it is enabled though. I played with |
yeah, I never gotten it to do what I thought it would do either. But have you tested the behavior when using srun to start it? And then possibly with external PMIx using the PMI1/2 interface of PMIx |
@akesandgren I forgot to mention that we do not use Slurm in our site 🙂 |
This is the set of configuration variables we've had to use to get 2019u7 working on our 100g RoCE (ConnectX-5, Mellanox OFED 5.0.2, CentOS 8) export FL_PROVIDER=mlx |
@mj-harvey thank you very much for sharing your setup. Unfortunately none of those settings change the crashes of update 7 in our systems with ConnectX-4. Currently I suspect that our issues with update 7 are more related to PMI than anything else. The crash happens very very early in |
Do you have any pmi library installed on the system? Do you only see this within the context of a Torque job, or also when outside of a job? Have you tried getting any insight by strace-ing mpiexec.hydra and hydra_pmi_proxy and also attaching gdb to see if you can get a stack trace from the point at which it faults? How is a single node job, using just shm, or just ofi? |
There aren't any external PMI libs in use. The crash of
I have not gone that far. Since we pay for support to Intel I'm leaving that work to them 🙂
Single node jobs do not work with update 7 and show a similar behaviour to multi-node jobs.
It seems that Intel MPI is not capable of using the allocated cores in this case, which is very weird. I'll update Intel support with this information. Thanks for the questions. |
That's interesting that the failure's confined to within a Torque job. My guess is it's something related to using pbs_tmrsh? (we're PBS Pro, so there's some guesswork about how similar Torque still is).
where BOOTSTRAP_EXEC is pbs_tmrsh if within a job context, or ssh if without. Maybe try forcing ssh use? |
@mj-harvey thanks for the suggestion! Unfortunately playing with either |
Here's my experience with Intel MPI 2019.7.217 with slurm: It works with srun if I point I_MPI_PMI_LIBRARY to libpmi2.so (no other env vars necessary to make it work with libpmi2.so) or for some reason if I don't set I_MPI_PMI_LIBRARY at all. With mpirun I get a floating point exception (SIGFPE) if multiple nodes are used. With
The "legacy" mpiexec.hydra runs without issue though. |
I can avoid the SIGFPE using |
There's another issue with 2019.7 for me, the "mlx" provider does not work, I get:
(default provider works but it's not optimal then). |
@bartoldeman thanks a lot for all the information! Switching to the native topology detection with Regarding your error with the MLX provider, I have the same error messages as you from One additional note, to make I put the full output of
The MLX provider is properly selected and loaded. Hence tcp and verbs are skipped. Even though MLX is never listed by |
In conclusion it seems that we have two different solutions for update 7 depending on the resource manager:
The question now is, can/should Easybuild do anything to automagically fix this issue? Or would it be preferable to add this information as a comment in the easyconfig of Intel MPI and let each site handle it on their own? |
This should not be auto handled by EB, an explaining comment is what we need. This is one of the things the hooks are good for. Site tweaking of generic easyconfigs. |
* if the system UCX is too old * if you want to make Intel MPI work with srun via PMI2 * or if you need to avoid a floating point exception The UCX dep could perhaps be unconditional once UCX 1.8.0 is merged. Fixes easybuilders#10314
Closing as this is already fixed in |
I tested the execution of a simple inter-node job between two nodes over our Infiniband network with updates 5, 6 and 7 of Intel MPI v2019 and I found very different results for each release. All tests were carried out with
iccifort/2020.1.217
as base of the toolchain.Characteristics of the testing system
Steps to reproduce:
impi
mpicc ${EBROOTIMPI}/test/test.c -o test
mpirun ./test
Intel MPI v2019 update 5: works out of the box
Intel MPI v2019 update 6: does NOT work out of the box, but can be fixed
verbs
ortcp
libfabric providers instead ofmlx
mlx
, but for us it only works with UCX v1.7 (available in Easybuild).mlx
with version 1.9.0Intel MPI v2019 update 7: does NOT work at all
(the execution does not stop, it just hangs at this point)
The system log of the node shows the following entry
This error with IMPI v2019.7 happens way before initializing libfabric. Therefore, it does not depend on the provider or the version of UCX. It happens all the time.
Update
The text was updated successfully, but these errors were encountered: