-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault with FFTW/3.3.9/gompi-2021.04 on x86 #12978
Comments
I suspect that this is a bug in OpenMPI 4.1 Tried (with the same, already compiled FFTW):
|
Bug found: open-mpi/ompi#9018 Fixed in #12983 EasyBlock enhanced to detect this: easybuilders/easybuild-easyblocks#2444 |
@Flamefire What's puzzling here is why others haven't reported this, even though FFTW was tested extensively on top of OpenMPI 4.1.1 via #12867. It's also surprising that a bug like this didn't surface at all during the pre-release testing of OpenMPI 4.1.1... Is there something special in your setup why only you (so far) have been running into this? |
I can tell with certainty that this bug in OMPI affects all users because it is a real bug (overwriting stack memory) The results of that can be anything, it depends e.g. on the number of processes mpirun is called with. E.g. in my case (1 process) it overwrites a vital table that will then lead to the crash. When more processes are used, it may not crash immediately and it may even work. Or it may silently use wrong values for something. So the only "special" thing I can imagine is that it is run with 1 process. Maybe the others were just "lucky" that the value that was used to overwrite the (wrong) memory was the same as what was there before so it didn't actually change anything. This depends on the system memory allocator |
I understand that this type of problem can only surface under specific circumstances, but I'm still a bit surprised it's so easy to trigger for you, while others haven't seen it. I can't seem to trigger the issue at all on CentOS 7.9, even when using a single process. |
IIRC it writes a zero byte into some variable which holds a memory address. I guess it is "just" likely that this byte is zero already In any case, I wouldn't recommend shipping a 2021a without this patch. |
I'm unable to install FFTW-3.3.9-gompi-2021.04.eb on a haswell node due to a segfault during the tests:
This is an E5-2680 v3 CPU on a node running RHEL 7.9. Rebuilding OpenMPI didn't help and the error happens always
The text was updated successfully, but these errors were encountered: