Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/psm3: illegal instruction #10589

Open
nmorey opened this issue Nov 28, 2024 · 7 comments
Open

prov/psm3: illegal instruction #10589

nmorey opened this issue Nov 28, 2024 · 7 comments
Assignees

Comments

@nmorey
Copy link
Contributor

nmorey commented Nov 28, 2024

Similar issue to #8933

Running libfabric 1.22, I still hit the same issue (probably only now because we had a GCC update):

 # gdb --args ./simple
GNU gdb (GDB; SUSE Linux Enterprise 15) 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./simple...
(gdb) r
Starting program: /root/simple 
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.38-150600.14.14.2.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Detaching after fork from child process 3597]
[New Thread 0x7ffff79636c0 (LWP 3601)]
[New Thread 0x7ffff6f4c6c0 (LWP 3602)]

Thread 1 "simple" received signal SIGILL, Illegal instruction.
0x00007ffff59bdee3 in psm3_getenv_range () from /usr/lib64/libfabric.so.1
(gdb) bt
#0  0x00007ffff59bdee3 in psm3_getenv_range () from /usr/lib64/libfabric.so.1
#1  0x00007ffff59c0fad in psm3_getenv () from /usr/lib64/libfabric.so.1
#2  0x00007ffff59797c2 in psm3_getenv_bool () from /usr/lib64/libfabric.so.1
#3  0x00007ffff59518a1 in psmx3_param_get_bool () from /usr/lib64/libfabric.so.1
#4  0x00007ffff5952161 in fi_psm3_ini () from /usr/lib64/libfabric.so.1
#5  0x00007ffff5641765 in fi_ini () from /usr/lib64/libfabric.so.1
#6  0x00007ffff5641d1b in fi_getinfo () from /usr/lib64/libfabric.so.1
#7  0x00007ffff66eae27 in mca_btl_ofi_component_init (num_btl_modules=0x7fffffffd914, enable_progress_threads=<optimized out>, enable_mpi_threads=<optimized out>) at btl_ofi_component.c:357
#8  0x00007ffff7ac438f in mca_btl_base_select (enable_progress_threads=true, enable_mpi_threads=false) at base/btl_base_select.c:110
#9  0x00007ffff66f6512 in mca_bml_r2_component_init (priority=0x7fffffffd994, enable_progress_threads=<optimized out>, enable_mpi_threads=<optimized out>) at bml_r2_component.c:86
#10 0x00007ffff7f243f4 in mca_bml_base_init (enable_progress_threads=enable_progress_threads@entry=true, enable_mpi_threads=false) at base/bml_base_init.c:74
#11 0x00007ffff7f73976 in ompi_mpi_init (argc=<optimized out>, argv=<optimized out>, requested=0, provided=provided@entry=0x7fffffffdac4, reinit_ok=reinit_ok@entry=false) at runtime/ompi_mpi_init.c:613
#12 0x00007ffff7f0700e in PMPI_Init (argc=0x7fffffffdb1c, argv=0x7fffffffdb10) at pinit.c:67
#13 0x0000000000400909 in main ()
(gdb) frame 4
#4  0x00007ffff5952161 in fi_psm3_ini () from /usr/lib64/libfabric.so.1
(gdb) q
A debugging session is active.

	Inferior 1 [process 3594] will be killed.

Quit anyway? (y or n) y

My best guess is that GCC added some unsupported vector instructions in some of the code called through PSMX3_INFO
I've tried a quick ad not too clean patch to check it out:

diff --git a/prov/psm3/src/psmx3_init.c b/prov/psm3/src/psmx3_init.c
index 29359d3ea348..cc259a1b6301 100644
--- a/prov/psm3/src/psmx3_init.c
+++ b/prov/psm3/src/psmx3_init.c
@@ -680,18 +680,6 @@ static int psmx3_getinfo(uint32_t api_version, const char *node,
 
    PSMX3_INFO(&psmx3_prov, FI_LOG_CORE,"\n");
 
-   __builtin_cpu_init();
-   if (!__builtin_cpu_supports(PSM3_MARCH)) {
-       PSMX3_INFO(&psmx3_prov, FI_LOG_CORE,
-           "CPU does not support '%s'.\n", PSM3_MARCH);
-       OFI_INFO_STR(&psmx3_prov,
-           (__builtin_cpu_supports("avx2") ? "AVX2" :
-               (__builtin_cpu_supports("avx") ? "AVX" :
-                   (__builtin_cpu_supports("sse4.2") ? "SSE4.2" : "unknown"))),
-           PSM3_MARCH, "CPU Supports", "PSM3 Built With");
-       goto err_out;
-   }
-
    if (psmx3_init_prov_info(hints, &prov_info))
        goto err_out;
 
@@ -895,6 +883,19 @@ struct fi_provider psmx3_prov = {
 
 PROVIDER_INI
 {
+   __builtin_cpu_init();
+   if (!__builtin_cpu_supports(PSM3_MARCH)) {
+       FI_INFO(&core_prov, FI_LOG_CORE,
+           "PSM3: CPU does not support '%s'.\n", PSM3_MARCH);
+       OFI_INFO_STR(&core_prov,
+           (__builtin_cpu_supports("avx2") ? "AVX2" :
+               (__builtin_cpu_supports("avx") ? "AVX" :
+                   (__builtin_cpu_supports("sse4.2") ? "SSE4.2" : "unknown"))),
+           PSM3_MARCH, "CPU Supports", "PSM3 Built With");
+       return NULL;
+   }
+
+
    psmx3_prov.version = get_psm3_provider_version();
 
    PSMX3_INFO(&psmx3_prov, FI_LOG_CORE, "build options: VERSION=%u.%u=%u.%u.%u.%u, "

And this works.

I haven't checked v2.0.0 but I'm guessing it is broken there as well

@nmorey nmorey added the bug label Nov 28, 2024
@nmorey
Copy link
Contributor Author

nmorey commented Dec 2, 2024

This only partially solved the issue. Because of all the attribute ((constructor)) used in prov/psm3, this test is not enough !

Program received signal SIGILL, Illegal instruction.
0x00007ffff7644341 in init_picos_per_cycle () from /usr/lib64/libfabric.so.1
Missing separate debuginfos, use: zypper install libefa1-debuginfo-54.0-1.1.x86_64 libibverbs1-debuginfo-54.0-1.1.x86_64 libnl3-200-debuginfo-3.11.0-1.1.x86_64 libnuma1-debuginfo-2.0.18.10.g6c14bd5-1.1.x86_64 libpsm2-2-debuginfo-12.0.1-2.2.x86_64 librdmacm1-debuginfo-54.0-1.1.x86_64 libucm0-debuginfo-1.17.0-3.1.x86_64 libucp0-debuginfo-1.17.0-3.1.x86_64 libuct0-debuginfo-1.17.0-3.1.x86_64 libuuid1-debuginfo-2.40.2-2.1.x86_64
(gdb) bt
#0  0x00007ffff7644341 in init_picos_per_cycle () from /usr/lib64/libfabric.so.1
#1  0x00007ffff7fc969e in call_init () from /lib64/ld-linux-x86-64.so.2
#2  0x00007ffff7fc979c in _dl_init () from /lib64/ld-linux-x86-64.so.2
#3  0x00007ffff7fc65fe in _dl_catch_exception () from /lib64/ld-linux-x86-64.so.2
#4  0x00007ffff7fd06ce in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#5  0x00007ffff7fc6571 in _dl_catch_exception () from /lib64/ld-linux-x86-64.so.2
#6  0x00007ffff7fd0b2c in _dl_open () from /lib64/ld-linux-x86-64.so.2
#7  0x00007ffff7c93a3c in dlopen_doit () from /lib64/libc.so.6
#8  0x00007ffff7fc6571 in _dl_catch_exception () from /lib64/ld-linux-x86-64.so.2
#9  0x00007ffff7fc66a3 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#10 0x00007ffff7c934e7 in _dlerror_run () from /lib64/libc.so.6
#11 0x00007ffff7c93b01 in dlopen@GLIBC_2.2.5 () from /lib64/libc.so.6
#12 0x00005555555551d5 in ?? ()
#13 0x00007ffff7c2a2ae in __libc_start_call_main () from /lib64/libc.so.6
#14 0x00007ffff7c2a379 in __libc_start_main_impl () from /lib64/libc.so.6
#15 0x00005555555550b5 in ?? ()
Dump of assembler code for function init_picos_per_cycle:
   0x00007ffff7644340 <+0>:	push   %rbp
=> 0x00007ffff7644341 <+1>:	vpxor  %xmm0,%xmm0,%xmm0

@j-xiong
Copy link
Contributor

j-xiong commented Dec 4, 2024

One way is to put the check in the "_init" function which is called before all the "constructor" functions.

@nmorey
Copy link
Contributor Author

nmorey commented Dec 4, 2024

I'm not sure you can prevent constructor from running.
In my last logs, they are dlopened so we might be able to check something before that, but if libfabric had been linked, constructors would have been run before main() anyway. So if they contain unsupported instruction, it all breaks whether we do try to use PSM3 or not

@raffenet
Copy link
Contributor

raffenet commented Dec 4, 2024

The use of constructors also causes issues when statically linking libfabric into your app or library #8979. It would be great if the code could be delayed until after provider selection.

@nmorey
Copy link
Contributor Author

nmorey commented Dec 5, 2024

The most annoying thing in there is that the only reason to force AVX down everyone's throat is (see #8933):

We require AVX/AVX2 to compile because there is a significant performance improvement observed in most environments (when compiled with). We also do not technically support/validate on older CPUs (AVX was added ~10 years ago).

At least the only one I've seen.

SSE4.2 really does bring some support for CRC computation (although some legacy code for really old HW would be easy to add and handle at runtime). The rest is compiler optimisation...

And that should be the call of each user/packager/distro to build how they want for their use case: most portable, most optimized, or anything in between.

For now, I've had to completely disable psm3 AVX[12] support in openSUSE/SLES builds because that's the only way to avoid breaking libfabric (whteher using psm3 or not) on all non AVX supporting systems.

@acgoldma
Copy link
Contributor

acgoldma commented Dec 9, 2024

Hi @nmorey, A summary of the issue is that avx instruction set 'required' by psm3 is causing issues as init code (including constructors) are hitting these instructions before the check in psm3 code.

Originally, it was written that only specific objects in psm3 would have this flag applied when compiling so that it would not pollute the other processes. It seems this logic is broken. I will work with my team to fix this.

Hi @raffenet, As for the constructors, I will discuss with my team and see if we can remove them and merge the code into the init calls where possible.

@nmorey
Copy link
Contributor Author

nmorey commented Dec 9, 2024

I'll repeat what I said in both issues and in an email to you:
There should NOT be a dependency on AVX if the only justification for it is:
there is a significant performance improvement

Half the software out there (HPC, video encoding, pretty much anything computational) can get better performances with the right mtune/march. However very very few have hard requirements on an instruction set.
Either:

  • They have IS specific function that are picked at runtime to use the most optimized version (while still supporting old 'dumb' i386)
  • They let packager, distro, vendors, users and/or dev make a choice on whether they want greater portability or greater performances.

Intel has made the choice for us here and it has proven yet again to cause issues.
Can't we drop this requirement? And just add an entry in the doc that says: use march=native for better performance if your CPU has AVX support.
I'm pretty sure this does apply to the whole libfabric and not just psm3 anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants