Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

luajit Infinite loop and cpu 100% #42

Closed
zengjinji opened this issue Feb 26, 2019 · 12 comments
Closed

luajit Infinite loop and cpu 100% #42

zengjinji opened this issue Feb 26, 2019 · 12 comments

Comments

@zengjinji
Copy link

zengjinji commented Feb 26, 2019

Centos6.5, kernel 2.6.32-754.2.1.el6.x86_64, nginx 1.14.2
luajit2-2.1-20190221, lua-resty-core 0.1.16

behavior:
nginx cpu 100% and luajit Fall into the loop

gbd result:

#0  ctype_raw (L=0x416691f8, cd=0x40fef868) at lj_ctype.h:420
#1  lj_ccall_func (L=0x416691f8, cd=0x40fef868) at lj_ccall.c:1148
#2  0x00000000004d1076 in lj_cf_ffi_meta___call (L=0x416691f8) at lib_ffi.c:230
#3  0x00000000004d2fd7 in lj_BC_FUNCC ()
#4  0x000000000049bfc6 in ngx_http_lua_run_thread (L=0x41654378, r=0x1cc8290, ctx=0x1cc8eb8, nrets=0)
    at /tmp/luajit-bug-report/lua-nginx-module-0.10.14/src/ngx_http_lua_util.c:1084
#5  0x000000000049d968 in ngx_http_lua_content_by_chunk (L=0x41654378, r=0x1cc8290) at /tmp/luajit-bug-report/lua-nginx-module-0.10.14/src/ngx_http_lua_contentby.c:122
#6  0x000000000049da4e in ngx_http_lua_content_handler_inline (r=0x1cc8290) at /tmp/luajit-bug-report/lua-nginx-module-0.10.14/src/ngx_http_lua_contentby.c:312
#7  0x000000000049dd14 in ngx_http_lua_content_handler (r=0x1cc8290) at /tmp/luajit-bug-report/lua-nginx-module-0.10.14/src/ngx_http_lua_contentby.c:224
#8  0x0000000000442323 in ngx_http_core_content_phase (r=0x1cc8290, ph=0x1ce02a8) at src/http/ngx_http_core_module.c:1169
#9  0x000000000043ccd3 in ngx_http_core_run_phases (r=0x1cc8290) at src/http/ngx_http_core_module.c:858
#10 0x000000000043cde6 in ngx_http_handler (r=<value optimized out>) at src/http/ngx_http_core_module.c:841
#11 0x0000000000444cdf in ngx_http_process_request (r=0x1cc8290) at src/http/ngx_http_request.c:1952
#12 0x000000000044684c in ngx_http_process_request_headers (rev=<value optimized out>) at src/http/ngx_http_request.c:1379
#13 0x0000000000446cce in ngx_http_process_request_line (rev=0x1ce23a0) at src/http/ngx_http_request.c:1052
#14 0x0000000000446fb2 in ngx_http_keepalive_handler (rev=0x1ce23a0) at src/http/ngx_http_request.c:3238
#15 0x0000000000439cde in ngx_epoll_process_events (cycle=<value optimized out>, timer=<value optimized out>, flags=<value optimized out>)
    at src/event/modules/ngx_epoll_module.c:902
#16 0x00000000004319d7 in ngx_process_events_and_timers (cycle=0x1cc4070) at src/event/ngx_event.c:242
#17 0x000000000043824c in ngx_worker_process_cycle (cycle=0x1cc4070, data=<value optimized out>) at src/os/unix/ngx_process_cycle.c:750
#18 0x0000000000436a38 in ngx_spawn_process (cycle=0x1cc4070, proc=0x4381db <ngx_worker_process_cycle>, data=0x1, name=0x515be8 "worker process", respawn=-3)
    at src/os/unix/ngx_process.c:199
#19 0x00000000004376f8 in ngx_start_worker_processes (cycle=0x1cc4070, n=2, type=-3) at src/os/unix/ngx_process_cycle.c:359
#20 0x000000000043879d in ngx_master_process_cycle (cycle=0x1cc4070) at src/os/unix/ngx_process_cycle.c:131
#21 0x0000000000414c4a in main (argc=<value optimized out>, argv=<value optimized out>) at src/core/nginx.c:382
(gdb) f 0
#0  ctype_raw (L=0x416691f8, cd=0x40fef868) at lj_ctype.h:420
420   while (ctype_isattrib(ct->info)) ct = ctype_child(cts, ct);

LuaJIT:

    make install CCDEBUG=-g XCFLAGS='-msse4.2' PREFIX=/opt/luajit

nginx install:

    cd /tmp
    wget http://nginx.org/download/nginx-1.14.2.tar.gz;tar -zxvf nginx-1.14.2.tar.gz
    wget https://github.com/openresty/lua-nginx-module/archive/v0.10.14.tar.gz;tar -zxvf v0.10.14.tar.gz
    wget https://codeload.github.com/simplresty/ngx_devel_kit/tar.gz/v0.3.1rc1;tar -zxvf v0.3.1rc1
    cd /tmp/nginx-1.14.2/
    ./configure  --prefix=/usr/local/nginx --add-module=/tmp/ngx_devel_kit-0.3.1rc1 --add-module=/tmp/lua-nginx-module-0.10.14
    make install

Steps:

    git clone https://github.com/zx827882285/luajit-bug-report.git
    cp luajit-bug-report/nginx.conf /usr/local/nginx/conf
    cp luajit-bug-report/lua /usr/local/nginx -rf
    /usr/local/nginx/sbin/nginx
    curl  http://localhost/set;./wrk -t4 -c120 -d10  http://localhost/
    top

instructions:
Sorry, it is difficult to occur, if I delete some code or function, it may not occur.
a.lua and b.lua belong to company, I delete some function. until behavior can't be occur when i delete code

@spacewander
Copy link
Member

@zx827882285
Could you use lbt to get the Lua stack?

@zengjinji
Copy link
Author

yes
(gdb) lbt
builtin#176
@/usr/local/nginx/lua/resty/core/shdict.lua:333
@/usr/local/nginx/lua/test.lua:15
=content_by_lua(nginx.conf:37):5

@agentzh
Copy link
Member

agentzh commented Feb 26, 2019

@zx827882285 Will you show us your /usr/local/nginx/lua/test.lua file?

@agentzh
Copy link
Member

agentzh commented Feb 26, 2019

Okay, never mind, I saw it is in the zx827882285/luajit-bug-report github repo.

@agentzh
Copy link
Member

agentzh commented Feb 26, 2019

@zx827882285 I just tried your luajit-bug-report repo and the steps on my side with OpenResty 1.15.8.1 RC0 on Fedora 28 x86_64 and I failed to reproduce any infinite looping no matter how hard I try:

https://gist.github.com/agentzh/85bd2c83d1c7a8badf5851268e695743

The CPU usage of the nginx or openresty worker processes dropped to 0 immediately after the wrk runs finished.

I tried both the GC64 and non-GC64 modes of LuaJIT.

I only applied the following patch to your nginx.conf so that I don't need to run the nginx server using root nor need I pollute my system's /usr/local/ directory:

diff --git a/nginx.conf b/nginx.conf
index 60214ef..cda2f53 100644
--- a/nginx.conf
+++ b/nginx.conf
@@ -8,8 +8,8 @@ events {

 http {

-    lua_package_path  "/usr/local/nginx/clua/?.lua;/usr/local/nginx/lua/?.lua;/usr/local/nginx/api/?.lua;;";
-    lua_package_cpath  "/usr/local/nginx/lua/?.so;;";
+    lua_package_path  "$prefix/clua/?.lua;$prefix/lua/?.lua;$prefix/api/?.lua;;";
+    lua_package_cpath  "$prefix/lua/?.so;;";

     lua_shared_dict a_shm 1M;
     lua_shared_dict b_shm 1M;
@@ -21,7 +21,7 @@ http {

     server {
         server_name localhost;
-        listen 80;
+        listen 8083;

         location /set {
             content_by_lua_block {

You should do the same when preparing a minimal example for others to run. BTW, you know about the -p PREFIX option of the nginx command line so that you can specify your own prefix. See this tutorial for a full example: https://openresty.org/en/getting-started.html

BTW, will you try the following OpenResty 1.15.8.1 RC0 version? To see if you can still reproduce it. It's much easier for us to reproduce things when using OpenResty. And you can also save a lot of setup commands to explain as in this issue.

https://openresty.org/download/openresty-1.15.8.1rc0.tar.gz

Thanks!

@agentzh
Copy link
Member

agentzh commented Feb 26, 2019

@zx827882285 BTW, you should try fetching both the C and Lua backtraces at different times for the same nginx worker process that spins at 100% CPU usage forever.

So does your nginx worker processes spin at 100% CPU usage forever? Even after all wrk processes and any other traffic generators quit?

The Lua and C backtraces do not look like inside any C or Lua loops at all, which is weird.

@zengjinji
Copy link
Author

zengjinji commented Feb 27, 2019

@agentzh
yes, i start two nginx worker, after wrk processes quit, only one nginx worker process spins at 100% CPU usage forever.
i try the following OpenResty 1.15.8.1 RC0 version, but can't reproduce it

In the morning, I install fedroa28 on VMware Workstation and reproduce it;
fedroa28 iso: http://mirrors.163.com/fedora/releases/28/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-28-1.1.iso

For convenience to reproduce it, i commit install.sh
could you try this in you fedroa system, see can reproduce it.
This needn't to run the nginx server using root and needn't pollute your system's /usr/local/ directory

$ cd /tmp
$ git clone https://github.com/zx827882285/luajit-bug-report.git
$ cd luajit-bug-report
$ sh install.sh
$ export LD_LIBRARY_PATH=/tmp/luajit-bug-report/luajit/lib
$ ./nginx/sbin/nginx -p . -c nginx.conf

$ cd /path/wrk;
$ curl  http://localhost:8083/set;./wrk -t4 -c120 -d10  http://localhost:8083/
$ top

thanks!

@agentzh
Copy link
Member

agentzh commented Feb 27, 2019

OK, I can reproduce it using your install.sh script. BTW, it's considered bad practice to provide binary executable files like cjson.so in your minimal example.

You cannot reproduce it with OpenResty 1.15.8.1 RC0 just because it enables the GC64 mode in LuaJIT on x86_64 by default. If you enable that in your install.sh, the problem will also disappear.

If you enable internal assertions in your LuaJIT build, you will see that it hits an internal assertion inside LuaJIT first, before entering the infinite loop:

nginx: lj_obj.h:926: copyTV: Assertion `!((((o1)->it) - ((~4u)+1)) > ((~13u) - ((~4u)+1))) || ((~((o1)->it) == (((GCobj *)(uintptr_t)((o1)->gcr).gcptr32))->gch.gct) && !(((((GCobj *)(uintptr_t)((o1)->gcr).gcptr32)))->gch.marked & ((((global_State *)(void *)(uintptr_t)(L->glref).ptr32))->gc.currentwhite ^ (0x01 | 0x02)) & (0x01 | 0x02)))' failed.

Using openresty-valgrind (with no-pool patch for nginx and the system allocator for luajit) and valgrind fails to catch any memory problems.

Disabling the JIT compiler in LuaJIT does not make the problem go away, so it should not be related to the JIT compiler in at all.

Enforcing full GC cycle upon every Lua code line also makes the issue disappear.

Replacing C.free() calls with ffi.C.free() calls in the resty.core.shdict also make the issue go away.

These are my initial experiments and findings. Need to dig deeper to find out why the GC state of the C.free TValue is incorrect.

@zengjinji
Copy link
Author

sorry for provide binary executable files in my minimal example.
it confused me that it can't reproduce when change nginx.conf, like it.

diff --git a/nginx.conf b/nginx.conf
index cda2f53..f99e936 100644
--- a/nginx.conf
+++ b/nginx.conf
@@ -30,7 +30,7 @@ http {
         }
         location / {
             content_by_lua_block {
-                require("a");
+                --require("a");
                 require("b");
                 require("resty.lock");
                 require("test"):shm_get();

Although I think there is no infulence.

thie problem may related to another problem, segfault in luajit.
i can't reproduce it, so i post it to openresty geogle group.
https://groups.google.com/forum/#!topic/openresty/SKnI8kd_rdY

thank you for your experiments and findings, it was helpful!
may be i need try enables the GC64 mode in LuaJIT, to see can it reproduce in production environment.

@agentzh
Copy link
Member

agentzh commented Feb 28, 2019

@zx827882285 I think I've already fixed it. It's a bug in LuaJIT's FFI library when handling GC. Please try the following LuaJIT patch on your side:

diff --git a/src/lj_clib.c b/src/lj_clib.c
index f016b06b96..a867205247 100644
--- a/src/lj_clib.c
+++ b/src/lj_clib.c
@@ -384,6 +384,7 @@ TValue *lj_clib_index(lua_State *L, CLibrary *cl, GCstr *name)
       cd = lj_cdata_new(cts, id, CTSIZE_PTR);
       *(void **)cdataptr(cd) = p;
       setcdataV(L, tv, cd);
+      lj_gc_anybarriert(L, cl->cache);
     }
   }
   return tv;

@zengjinji
Copy link
Author

@agentzh
thanks!
i will take this patch and try it in production environment.
it may need a few day to watch.

@agentzh
Copy link
Member

agentzh commented Feb 28, 2019

@zx827882285 BTW, this is a use-after-free kind of bug so it may lead to any weird things from infinite loop to core dumps. And that's why we hit the internal LuaJIT assertion failure before entering the infinite loop. The infinite loop happens because the ctype ID becomes zero, which is not possible for normal ctype objects at all in the first place.

Thanks a lot for that minimal and standalone example, otherwise debugging this would be much much more difficult (if possible at all).

This bug also affects GC64 builds and any other builds as far as I can see. The issue was not reproducible using your test case just because the GC progresses quite differently in the GC64 mode (for one thing, Lua tables in GC64 builds use table-level free list for recycling nodes while the X64 mode uses bucket-level free lists, thus it is less demanding on GC in GC64 mode). But it's still possible to hit at least in theory.

Small changes in your nginx.conf or Lua code may have big impact on reproducing the issue. This is because that we need perfect timing for just the right GC activities to trigger this bug. And incremental GCs are notoriously known for their inherent nondeterminism.

In a nutshell, what happens here in your test case is like this:

  1. resty/core/shdict.lua:142: this Lua line creates a nil entry for the key "free" in the Lua table in the ffi.C (on the C level as cl->cache) userdata: if not pcall(function () return C.free end) then
  2. the tri-color mark-and-sweek GC progresses and its mark phase marks ffi.C's cache table as black (or "live").
  3. resty/core/shdict.lua:333: this Lua source line invokes C.free() which looks up the free key in C's cache table via the metamethod. It loads the free function cdata into the cache table under C. The cache table is already black while the new entry reuses the one created in step 1 but the value object is a white object. This breaks the tri-color GC invariant: "no black objects can hold references to white ones". So shortly after, once a GC cycle is completed, the "free" value in the cache table is collected prematurely (because it is "white" and thus considered "dead") even it is still to be used.
  4. The "free" object looked up from the table on step 3 gets used and leads to memory corruptions.

The key to reproduce this is to have a completed GC cycle in the middle of step 3 and step 4 above. The time window is very small and we have to be lucky enough to hit this. That's why you need wrk and special layout of the test case.

To fix this problem, we just need to preserve the tri-color GC's invariant in step 3 by adding a write barrier there and turning the cache table under ffi.C to gray. That's essentially what my patch above does via the lj_gc_anybarrier() function call.

Just for the record, below is my debugging session using Mozilla rr and gdb python tools generated by our ylang compiler:

https://gist.github.com/agentzh/534aabb3a5bc75ff62b8fd25e3d371e0

For more background info on LuaJIT's GC algorithm, we can look at Mike Pall's explanation here:

http://wiki.luajit.org/New-Garbage-Collector#gc-algorithms_tri-color-incremental-mark-sweep

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants