-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IMPI v2019.6: MLX provider in libfabric not working #10213
Comments
One additional note: it seems that downgrading OFED to v4.5 would fix the broken MLX provider, based on the report in https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/842957 However, this is hardly a solution as that version is quite old. For instance, Cent OS 7 is on OFED v4.7 already. |
I'm basically seeing the same issue during the
More info:
|
General purpose workaround: |
@lexming But that's only advisable for systems with Infiniband though? |
@boegel with the setting |
It is also possible to use IMPI with an external
Therefore, this second test does not use In conclusion, we have two workaround solutions at our disposal:
|
@lexming Thanks a lot for digging into this! My preference goes to using the external libfabric 1.9.1, since that avoids "hardcoding" stuff to Infiniband via Can you clarify why setting Also, should we reach out to Intel support on this, and try to get some feedback on the best way forward (and maybe also ask how they managed to overlook this issue)? |
Regarding On our side, we will contact Intel about this issue. The real solution requires fixing IMPI v2019.6 as far as I can tell. |
I tested on 2 clusters:
test works ok with
works ok too! Note that the fi_info list for the second case is longer. Also note that the easyblock for Intel MPI has a parameter |
Hmm, I get errors if I run |
@bartoldeman thank you for the feedback. This is very interesting, in my system executing the test with The reason I never tried If |
@boegel the sanity check command of
|
@lexming So, there's actually no real problem, as long as |
@boegel yeah with Keep in mind that using those packages (such as |
There is some guidance about singleton MPI in the MPI standard |
I'm hitting a serious issue with
See details in https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/807359#comment-1955057 . |
@akesandgren has something in his hooks for this. I've borrowed this from his hooks in our local easyconfigs:
|
I got feedback from Intel on this specific issue (precisely, the crashes of executables linking to Intel MPI 2019.6 if executed without |
We got a new reply from Intel regarding this issue
|
@lexming Can you ask them when they expect update 8 to be released? Feel free to tell them that this is holding us back from going forward with |
@boegel done, I'll update this issue as soon as I get any reply |
FYI, IMPI v2019.8.254 is released. I've tested a simple MPI code and it seems that the new release resolves the issue. |
I confirm that the new update release IMPI v2019.8.254 fixes this issue. This can be tested with the easyconfig in #11337 . |
Intel has introduced a new MLX provider for
libfabric
in IMPI v2019.6, the one used inintel/2020a
toolchain. More info: https://software.intel.com/en-us/articles/improve-performance-and-stability-with-intel-mpi-library-on-infinibandIssue: currently, all executables fail to initialize MPI with
intel/2020a
in our nodes with Mellanox cards.Steps to reproduce:
Install and load
intel-2020a.eb
(I'll use the full toolchain for simplicity)Check if the provider of
libfabric
is listed asmlx
. This can be done with thefi_info
tool from IMPI v2019.6 inintel/2020a
.Result:
Output on our systems with Mellanox can be found at https://gist.github.com/lexming/fa6cd07bdb8e4d35be873b501935bb61
Solution/workaround:
I have not found a solution to the failing MLX provider. Moreover, the official
libfabric
project has removed themlx
provider altogether since version 1.9 due to lack of maintenance (ofiwg/libfabric@d8c8a2b). IMPI v2019.6 uses its own fork labelled1.9.0a1-impi
.The workaround is to switch to a different provider by setting the
FI_PROVIDER
environment variable. On a system with a Mellanox card this can be set totcp
orverbs
. Even though this works, it is not clear the performance impact of this change and it defeats the purpose of having a framework than can automatically detect the best transport layer.The text was updated successfully, but these errors were encountered: