Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic detection of SSE42 #75

Merged
merged 1 commit into from
Oct 27, 2020
Merged

Dynamic detection of SSE42 #75

merged 1 commit into from
Oct 27, 2020

Conversation

siddhesh
Copy link
Collaborator

This series of patches moves the CRC32 based lj_str_hash implementation into the main sources and builds it with -msse42. We also move the CPU feature detection into a DSO constructor and use it to patch in the optimised lj_str_hash if we are running on a CPU with SSE4.2, making the binaries more portable across x86 platforms.

This should obsolete #20 and also give a good framework to implement #21 making it a simple matter of adding the conditional compilation (and feature check) for crc32 and defining the right crc32 macros.

@siddhesh
Copy link
Collaborator Author

FYI, The test failure is due to the fix for LuaJIT/LuaJIT#494 and not due to this patch.

@siddhesh
Copy link
Collaborator Author

Added some more patches to fix building with make amalg. I think we need to drop make amalg in favour of LTO though, since LTO has come quite a long way over the last few years.

@capr
Copy link

capr commented Oct 31, 2019

Sorry if this sounds stupid, but would this patch change #60 in any way? Would it make that problem worse/better/same?

@siddhesh
Copy link
Collaborator Author

It would remain the same.

@siddhesh
Copy link
Collaborator Author

Reviving this at the request of @agentzh . The approach is slightly different keeping in mind that luajit2 wants to remain compatible with LuaJIT/LuaJIT. That is, I've duplicated the relevant portion of the cpu flag check instead of moving the lj_cpudetect function so that future merges are a wee bit easier.

It still won't protect luajit2 from the possibility of Mike changing all of string hashing from under them ;)

@siddhesh
Copy link
Collaborator Author

Updated to credit @fsfod in the commit log for Windows support.

@siddhesh
Copy link
Collaborator Author

The test failure should be fixed with openresty/luajit2-test-suite#9 .

@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

@siddhesh I'm seeing warnings when cross-compiling for aarch64 and armv7:

lj_init.c: In function 'lj_init_cpuflags':
lj_init.c:68:3: warning: implicit declaration of function 'str_hash_init' [-Wimplicit-function-declaration]
   68 |   str_hash_init (flags);
      |   ^~~~~~~~~~~~~
lj_init.c: In function 'lj_init_cpuflags':
lj_init.c:68:3: warning: implicit declaration of function 'str_hash_init' [-Wimplicit-function-declaration]
   68 |   str_hash_init (flags);
      |   ^~~~~~~~~~~~~

@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

@siddhesh maybe we should add this extra patch?

diff --git a/src/lj_init.c b/src/lj_init.c
index 5c20f15c5d..e963763013 100644
--- a/src/lj_init.c
+++ b/src/lj_init.c
@@ -46,7 +46,10 @@ static void str_hash_init(uint32_t flags)
    convenient. */
 LJ_INITIALIZER(lj_init_cpuflags)
 {
+#ifdef LJ_HAS_OPTIMISED_HASH
   uint32_t flags = 0;
+#endif
+
 #if LJ_TARGET_X86ORX64

   uint32_t vendor[4];
@@ -64,6 +67,8 @@ LJ_INITIALIZER(lj_init_cpuflags)

 #endif

+#ifdef LJ_HAS_OPTIMISED_HASH
   /* The reason why we initialized early: select our string hash functions.  */
   str_hash_init (flags);
+#endif
 }

@siddhesh
Copy link
Collaborator Author

Oops, the LJ_HAS_OPTIMISED_HASH ifdef should encompass str_init_hash as well as lj_init_cpuflags functions. lj_init_cpuflags is unnecessary if LJ_HAS_OPTIMISED_HASH is not defined.

src/lj_init.c Outdated Show resolved Hide resolved
@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

@siddhesh I just tried this on a modern Intel CPU (Core i9-9900K) and built this branch with the following command:

make CC=gcc Q= PREFIX=$PREFIX CCDEBUG=-g -j8 XCFLAGS='-DLUAJIT_NUMMODE=2 -DLUAJIT_ENABLE_LUA52COMPAT -O2'

But gdb shows that it never invokes the SSE 4.2 primitives:

Reading symbols from /opt/luajit21/bin/luajit...
(gdb) b lj_str_new
Breakpoint 1 at 0x406930: file lj_str.c, line 340.
(gdb) r
Starting program: /opt/luajit21/bin/luajit a.lua

Breakpoint 1, lj_str_new (L=0x7ffff7fd8380, str=0x466fd8 "__index__newindex__gc__mode__eq__len__lt__le__concat__call__add__sub__mul__div__mod__pow__unm__metatable__tostring__new__pairs__ipairs", lenx=7) at lj_str.c:340
340	  global_State *g = G(L);
(gdb) n
341	  if (lenx-1 < LJ_MAX_STR-1) {
(gdb) n
342	    MSize len = (MSize)lenx;
(gdb) n
343	    StrHash hash = hash_sparse(g->str.seed, str, len);
(gdb) s
hash_sparse_def (seed=15926837650161584487, str=0x466fd8 "__index__newindex__gc__mode__eq__len__lt__le__concat__call__add__sub__mul__div__mod__pow__unm__metatable__tostring__new__pairs__ipairs", len=7) at lj_str.c:103
103	  StrHash a, b, h = len ^ (StrHash)seed;
(gdb) n
104	  if (len >= 4) {  /* Caveat: unaligned access! */
(gdb) n
107	    b = lj_getu32(str+(len>>1)-2);
(gdb) s
lj_getu32 (p=<optimized out>) at lj_def.h:248
248	  return ((const Unaligned32 *)p)->u;
(gdb) n
hash_sparse_def (seed=<optimized out>, str=0x466fd8 "__index__newindex__gc__mode__eq__len__lt__le__concat__call__add__sub__mul__div__mod__pow__unm__metatable__tostring__new__pairs__ipairs", len=<optimized out>) at lj_str.c:108
108	    h ^= b; h -= lj_rol(b, 14);
(gdb) s
116	  a ^= h; a -= lj_rol(h, 11);
(gdb) n
117	  b ^= a; b -= lj_rol(a, 25);
(gdb) n
118	  h ^= b; h -= lj_rol(b, 16);

I do see two only only two .c files are compiled with the -msse4.2 flag:

gcc -fPIC -g -O2 -fomit-frame-pointer -Wall -DLUAJIT_NUMMODE=2 -DLUAJIT_ENABLE_LUA52COMPAT -O2  -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -U_FORTIFY_SOURCE  -DLUA_ROOT=\"/opt/luajit21\" -DLUA_MULTILIB=\"lib\" -fno-stack-protector   -msse4.2 -c -o lj_str_hash_dyn.o lj_str_hash.c
gcc -g -O2 -fomit-frame-pointer -Wall -DLUAJIT_NUMMODE=2 -DLUAJIT_ENABLE_LUA52COMPAT -O2  -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -U_FORTIFY_SOURCE  -DLUA_ROOT=\"/opt/luajit21\" -DLUA_MULTILIB=\"lib\" -fno-stack-protector   -msse4.2 -c -o lj_str_hash.o lj_str_hash.c

Am I missing anything here? The current v2.1-agentzh branch works fine when specifying -msse4.2 unconditionally.

@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

For comparison, when I specify -msse4.2 unconditionally to build the v2.1-agentzh branch and run the same Lua script on the same machine gives the desired result:

Reading symbols from /opt/luajit21/bin/luajit...
(gdb) b lj_str_new
Breakpoint 1 at 0x406bc0: file lj_str.c, line 329.
(gdb) r
Starting program: /opt/luajit21/bin/luajit a.lua

Breakpoint 1, lj_str_new (L=0x7ffff7fd8380, str=0x467298 "__index__newindex__gc__mode__eq__len__lt__le__concat__call__add__sub__mul__div__mod__pow__unm__metatable__tostring__new__pairs__ipairs", lenx=7) at lj_str.c:329
329	  global_State *g = G(L);
(gdb) n
330	  if (lenx-1 < LJ_MAX_STR-1) {
(gdb) n
331	    MSize len = (MSize)lenx;
(gdb) n
332	    StrHash hash = hash_sparse(g->str.seed, str, len);
(gdb) s
hash_sparse (seed=9592180116011563192, str=0x467298 "__index__newindex__gc__mode__eq__len__lt__le__concat__call__add__sub__mul__div__mod__pow__unm__metatable__tostring__new__pairs__ipairs", len=7) at x64/src/lj_str_hash_x64.h:254
254	  if (len < 4 || len >= 128)
(gdb) n
257	  if (len >= 16) /* [16, 128) */
(gdb) n
252	static uint32_t hash_sparse(uint64_t seed, const char* str, size_t len)
(gdb) s
hash_sparse (len=7, str=0x467298 "__index__newindex__gc__mode__eq__len__lt__le__concat__call__add__sub__mul__div__mod__pow__unm__metatable__tostring__new__pairs__ipairs", seed=9592180116011563192) at x64/src/lj_str_hash_x64.h:252
252	static uint32_t hash_sparse(uint64_t seed, const char* str, size_t len)
(gdb) s
hash_sparse_4_16 (len=7, str=0x467298 "__index__newindex__gc__mode__eq__len__lt__le__concat__call__add__sub__mul__div__mod__pow__unm__metatable__tostring__new__pairs__ipairs", seed=9592180116011563192) at x64/src/lj_str_hash_x64.h:261
261	  return hash_sparse_4_16(seed, str, len);
(gdb) s
79	    v1 = *cast_uint32p(str);
(gdb) n
80	    v2 = *cast_uint32p(str + len - 4);
(gdb) n
841	  return __builtin_ia32_crc32si (__C, __V);

BTW, I'm using this simple Lua script for the tests above:

local a = string.rep("a", 20)
local b = a .. "c"
print("ok")

@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

My CPU info:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
stepping    : 12
microcode   : 0xae
cpu MHz     : 800.365
cache size  : 16384 KB
physical id : 0
siblings    : 16
core id     : 0
cpu cores   : 8
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64
monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hl
e avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
bugs        : spectre_v1 spectre_v2 spec_store_bypass mds
bogomips    : 7200.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

@siddhesh maybe the CPU ID detection is buggy?

@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

@siddhesh Seems like the constructor routine lj_init_cpuflags is never called. The linker optimizes it away since no other CUs directly reference any symbols in this lj_init.o CU.

@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

@siddhesh The following (hacky) patch seems to fix it. Do you know a better way?

diff --git a/src/lj_init.c b/src/lj_init.c
index e963763013..4774a6e2c2 100644
--- a/src/lj_init.c
+++ b/src/lj_init.c
@@ -4,6 +4,10 @@
 #include "lj_vm.h"
 #include "lj_str.h"

+/* meant to be referenced by the lj_state.c CU so that the linker won't exclude
+ * this CU. */
+int lj_init_used = 0;
+
 #if LJ_TARGET_ARM && LJ_TARGET_LINUX
 #include <sys/utsname.h>
 #endif
diff --git a/src/lj_state.c b/src/lj_state.c
index a2e5fdcb5a..83e1d738d1 100644
--- a/src/lj_state.c
+++ b/src/lj_state.c
@@ -30,6 +30,8 @@
 #include "lj_alloc.h"
 #include "luajit.h"

+extern int lj_init_used;  /* defined in lj_init.c */
+
 /* -- Stack handling ------------------------------------------------------ */

 /* Stack sizes. */
@@ -201,6 +203,9 @@ LUA_API lua_State *lua_newstate(lua_Alloc allocf, void *allocd)
   GG_State *GG;
   lua_State *L;
   global_State *g;
+
+  lj_init_used = 1;  /* enforce the linker to include the lj_init.c CU */
+
   /* We need the PRNG for the memory allocator, so initialize this first. */
   if (!lj_prng_seed_secure(&prng)) {
     lj_assertX(0, "secure PRNG seeding failed");

@siddhesh
Copy link
Collaborator Author

Sorry, I think I messed up the build when I shuffled the functions around from moonjit. I'll take a closer look at it and submit an update.

@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

@siddhesh OK, thanks

This is a port of the dynamic SSE4.2 detection feature from moonjit.
This makes luajit2 builds portable since SSE4.2 string hash functions
are now built separately and chosen at runtime based on whether the
CPU supports it.

This patch also includes work by Thomas Fransham in moonjit to support
Windows builds.
@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

@siddhesh I wonder if there's any easy way to test it. I found my AMD CPUs already support SSE4.2 since they use the latest Zen 2 architecture...Alas.

@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

OK, never mind, seems like qemu is the best way to test an x64 CPU without SSE 4.2 :)

@siddhesh
Copy link
Collaborator Author

@siddhesh I wonder if there's any easy way to test it. I found my AMD CPUs already support SSE4.2 since they use the latest Zen 2 architecture...Alas.

I'm afraid my machines are all new too, so you'll need a volunteer to test the hash_sparse_def paths; maybe someone who complained about the sse4.2 requirement in the past could help :)

That last push should hopefully address all issues. There was an implicit dependency (the LJ_CPU_FLAGS variable) that kept everything in place in moonjit and it got lost when I got rid of it in luajit2. That is also why your hacky patch worked.

@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

Just for the record, I'm using the following command to test on an "old" x86_64 CPU without SSE 4.1/4.2:

exec qemu-x86_64 -cpu qemu64,+ssse3,-sse4.1,-sse4.2 `which luajit-2.1.0-beta3` "$@"

And I can indeed get the following error when trying to run an old LuaJIT built with -msse4.2 unconditionally:

qemu: uncaught target signal 4 (Illegal instruction) - core dumped

And now the new version in this PR works fine.

@agentzh agentzh merged commit 34b63ba into openresty:v2.1-agentzh Oct 27, 2020
@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

Merged. Thanks!

@agentzh agentzh mentioned this pull request Oct 27, 2020
@agentzh
Copy link
Member

agentzh commented Oct 27, 2020

Looking forward to #21 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants