Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault with FFTW/3.3.9/gompi-2021.04 on x86 #12978

Closed
Flamefire opened this issue May 28, 2021 · 6 comments · Fixed by #12983
Closed

Segmentation fault with FFTW/3.3.9/gompi-2021.04 on x86 #12978

Flamefire opened this issue May 28, 2021 · 6 comments · Fixed by #12983
Milestone

Comments

@Flamefire
Copy link
Contributor

I'm unable to install FFTW-3.3.9-gompi-2021.04.eb on a haswell node due to a segfault during the tests:

make[3]: Entering directory `/tmp/s3248973-EasyBuild/FFTW/3.3.9/gompi-2021.04/fftw-3.3.9/mpi'
perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10  --mpi "mpirun -np 1 `pwd`/mpi-bench"
Executing "mpirun -np 1 /tmp/s3248973-EasyBuild/FFTW/3.3.9/gompi-2021.04/fftw-3.3.9/mpi/mpi-bench --verbose=1   --verify 'ok]4bx11hx10e00' --verify 'ik]4bx11hx10e00' --verify 'obr[48x18' --verify 'ibr[48x18' --verify 'obc[48x18' --verify 'ibc[48x18' --verify 'ofc[48x18' --verify 'ifc[48x18' --verify 'okd]5e11x2hx11e00x5o10v2' --verify 'ikd]5e11x2hx11e00x5o10v2' --verify 'okd11bx7e01x6h' --verify 'ikd11bx7e01x6h' --verify 'ofr]10x5x9x11' --verify 'ifr]10x5x9x11' --verify 'obc]10x5x9x11' --verify 'ibc]10x5x9x11' --verify 'ofc]10x5x9x11' --verify 'ifc]10x5x9x11' --verify 'obr[12x8x6x2' --verify 'ibr[12x8x6x2' --verify 'obc[12x8x6x2' --verify 'ibc[12x8x6x2' --verify 'ofc[12x8x6x2' --verify 'ifc[12x8x6x2' --verify 'ok]45hx16o00' --verify 'ik]45hx16o00' --verify 'obr2x7v2' --verify 'ibr2x7v2' --verify 'ofr2x7v2' --verify 'ifr2x7v2' --verify 'obc2x7v2' --verify 'ibc2x7v2' --verify 'ofc2x7v2' --verify 'ifc2x7v2' --verify 'obcd660' --verify 'ibcd660' --verify 'ofcd660' --verify 'ifcd660' --verify 'obr[6x12' --verify 'ibr[6x12' --verify 'obc[6x12' --verify 'ibc[6x12' --verify 'ofc[6x12' --verify 'ifc[6x12'"
[taurusa5:04118] *** Process received signal ***
[taurusa5:04118] Signal: Segmentation fault (11)
[taurusa5:04118] Signal code:  (128)
[taurusa5:04118] Failing at address: (nil)
[taurusa5:04118] [ 0] /lib64/libpthread.so.0(+0xf5f0)[0x2b72d76f15f0]
[taurusa5:04118] [ 1] /scratch/ws/1/s3248973-EasyBuild/easybuild-haswell/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmca_common_ofi.so.10(opal_mca_common_ofi_select_provider+0x1f0)[0x2b72d9982dd0]
[taurusa5:04118] [ 2] /scratch/ws/1/s3248973-EasyBuild/easybuild-haswell/software/OpenMPI/4.1.1-GCC-10.3.0/lib/openmpi/mca_btl_ofi.so(+0x3469)[0x2b72d9979469]
[taurusa5:04118] [ 3] /scratch/ws/1/s3248973-EasyBuild/easybuild-haswell/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libopen-pal.so.40(mca_btl_base_select+0x109)[0x2b72d7dfb019]
[taurusa5:04118] [ 4] /scratch/ws/1/s3248973-EasyBuild/easybuild-haswell/software/OpenMPI/4.1.1-GCC-10.3.0/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x14)[0x2b72d9948764]
[taurusa5:04118] [ 5] /scratch/ws/1/s3248973-EasyBuild/easybuild-haswell/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(mca_bml_base_init+0x81)[0x2b72d6fa7161]
[taurusa5:04118] [ 6] /scratch/ws/1/s3248973-EasyBuild/easybuild-haswell/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(ompi_mpi_init+0x67b)[0x2b72d6fea71b]
[taurusa5:04118] [ 7] /scratch/ws/1/s3248973-EasyBuild/easybuild-haswell/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(PMPI_Init_thread+0x57)[0x2b72d6f8b737]
[taurusa5:04118] [ 8] /tmp/s3248973-EasyBuild/FFTW/3.3.9/gompi-2021.04/fftw-3.3.9/mpi/.libs/mpi-bench[0x405815]
[taurusa5:04118] [ 9] /tmp/s3248973-EasyBuild/FFTW/3.3.9/gompi-2021.04/fftw-3.3.9/mpi/.libs/mpi-bench[0x406c14]
[taurusa5:04118] [10] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b72d7920505]
[taurusa5:04118] [11] /tmp/s3248973-EasyBuild/FFTW/3.3.9/gompi-2021.04/fftw-3.3.9/mpi/.libs/mpi-bench[0x402d3e]
[taurusa5:04118] *** End of error message ***

This is an E5-2680 v3 CPU on a node running RHEL 7.9. Rebuilding OpenMPI didn't help and the error happens always

@Flamefire
Copy link
Contributor Author

Flamefire commented May 28, 2021

I suspect that this is a bug in OpenMPI 4.1

Tried (with the same, already compiled FFTW):

  • OpenMPI/4.1.0-GCC-10.2.0: Broken
  • OpenMPI/4.1.1-GCC-10.3.0: Broken
  • OpenMPI/4.0.5-GCC-10.2.0: Works
  • OpenMPI/4.0.5-GCC-10.3.0: Works

@Flamefire
Copy link
Contributor Author

Bug found: open-mpi/ompi#9018

Fixed in #12983

EasyBlock enhanced to detect this: easybuilders/easybuild-easyblocks#2444

@boegel
Copy link
Member

boegel commented May 29, 2021

@Flamefire What's puzzling here is why others haven't reported this, even though FFTW was tested extensively on top of OpenMPI 4.1.1 via #12867.

It's also surprising that a bug like this didn't surface at all during the pre-release testing of OpenMPI 4.1.1...

Is there something special in your setup why only you (so far) have been running into this?

@Flamefire
Copy link
Contributor Author

I can tell with certainty that this bug in OMPI affects all users because it is a real bug (overwriting stack memory)

The results of that can be anything, it depends e.g. on the number of processes mpirun is called with. E.g. in my case (1 process) it overwrites a vital table that will then lead to the crash. When more processes are used, it may not crash immediately and it may even work. Or it may silently use wrong values for something.

So the only "special" thing I can imagine is that it is run with 1 process. Maybe the others were just "lucky" that the value that was used to overwrite the (wrong) memory was the same as what was there before so it didn't actually change anything. This depends on the system memory allocator

@boegel
Copy link
Member

boegel commented May 29, 2021

I understand that this type of problem can only surface under specific circumstances, but I'm still a bit surprised it's so easy to trigger for you, while others haven't seen it.

I can't seem to trigger the issue at all on CentOS 7.9, even when using a single process.

@Flamefire
Copy link
Contributor Author

IIRC it writes a zero byte into some variable which holds a memory address. I guess it is "just" likely that this byte is zero already

In any case, I wouldn't recommend shipping a 2021a without this patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants