Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker Segmentation Fault (v0.44.0) #6896

Closed
ReillyTevera opened this issue Feb 22, 2021 · 118 comments · Fixed by #7709
Closed

Worker Segmentation Fault (v0.44.0) #6896

ReillyTevera opened this issue Feb 22, 2021 · 118 comments · Fixed by #7709
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ReillyTevera
Copy link

NGINX Ingress controller version:
0.44.0

Kubernetes version (use kubectl version):
1.18.3
Environment:

  • AWS
  • Debian 10.8
  • Linux 5.4.96
  • Weave-net 2.6.5

What happened:
We encountered a major production outage a few days ago that was traced back to ingress-nginx. ingress-nginx pod logs were filled with messages like the following:

2021/02/22 22:30:18 [alert] 27#27: worker process 38 exited on signal 11 (core dumped)

We discovered that the following message was being printed to the system log as well (timed with the worker exits):

Feb 22 22:30:18 ip-10-153-47-170 kernel: traps: nginx[24701] general protection fault ip:7f03278adc59 sp:7ffcdba0aa10 error:0 in ld-musl-x86_64.so.1[7f032789f000+48000]

I ultimately identified that whatever was occurring was linked to the version of ingress-nginx we were using and reverted production to 0.43.0 until we could identify the underlying issue.

We have a few other lower-load ingress-nginx deployments that have remained at 0.44.0 and have observed apparently random worker crashes however there are always enough running workers and these are infrequent enough that things seemingly remain stable.

I was able to get a worker coredump from one of those infrequent crashes and the backtrace is as follows:

Core was generated by `nginx: worker process is shutting down              '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  a_crash () at ./arch/x86_64/atomic_arch.h:108
108	./arch/x86_64/atomic_arch.h: No such file or directory.
[Current thread is 1 (LWP 38)]
(gdb) backtrace
#0  a_crash () at ./arch/x86_64/atomic_arch.h:108
#1  get_nominal_size (end=0x7f03273ee90c "", p=0x7f03273ec4d0 "") at src/malloc/mallocng/meta.h:169
#2  __libc_free (p=0x7f03273ec4d0) at src/malloc/mallocng/free.c:110
#3  0x00007f0327811f7b in lj_vm_ffi_call () from /usr/local/lib/libluajit-5.1.so.2
#4  0x00007f0327858077 in lj_ccall_func (L=<optimized out>, cd=<optimized out>) at lj_ccall.c:1382
#5  0x00007f032786e38d in lj_cf_ffi_meta___call (L=0x7f032368ee58) at lib_ffi.c:230
#6  0x00007f032780fb45 in lj_BC_FUNCC () from /usr/local/lib/libluajit-5.1.so.2
#7  0x000056357cdf639a in ngx_http_lua_run_thread (L=L@entry=0x7f03236a5380, r=r@entry=0x7f0327403030, ctx=ctx@entry=0x7f0323459bb0, nrets=<optimized out>, nrets@entry=1)
    at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_util.c:1167
#8  0x000056357ce1530a in ngx_http_lua_timer_handler (ev=<optimized out>) at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_timer.c:650
#9  0x000056357ce13b03 in ngx_http_lua_abort_pending_timers (ev=0x7f032740d790) at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_timer.c:899
#10 0x000056357cd1de6d in ngx_close_idle_connections (cycle=cycle@entry=0x7f0327407570) at src/core/ngx_connection.c:1352
#11 0x000056357cd3981c in ngx_worker_process_cycle (cycle=0x7f0327407570, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:791
#12 0x000056357cd376f1 in ngx_spawn_process (cycle=cycle@entry=0x7f0327407570, proc=proc@entry=0x56357cd396af <ngx_worker_process_cycle>, data=data@entry=0x0, name=name@entry=0x56357ce5ef4f "worker process", respawn=respawn@entry=-4)
    at src/os/unix/ngx_process.c:199
#13 0x000056357cd38310 in ngx_start_worker_processes (cycle=cycle@entry=0x7f0327407570, n=1, type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:378
#14 0x000056357cd3a359 in ngx_master_process_cycle (cycle=0x7f0327407570, cycle@entry=0x7f032740c210) at src/os/unix/ngx_process_cycle.c:234
#15 0x000056357cd0cad9 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386

One of the major differences between 0.43.0 and 0.44.0 is the update to Alpine 3.13, perhaps the version of musl in use is the issue and it would be appropriate to revert that change until Alpine has released a fixed version?

/kind bug

@ReillyTevera ReillyTevera added the kind/bug Categorizes issue or PR as related to a bug. label Feb 22, 2021
@yuha0
Copy link
Contributor

yuha0 commented Mar 1, 2021

Not sure if it was caused by the base image upgrade (alpine 3.13) but we have seen the same issue as well in our environment after upgrading to 0.44.0. It had happened for 4 times in two days before we downgraded to 0.43.0. We haven't seen a single instance after the downgrade.

All 4 incidents were started with nginx config reload

First we saw very high CPU loads on nodes hosting Nginx pods. Those nodes have 128 cores and the loads got up to 1k-2k:
image

According to Nginx logs, every one of those load spikes were followed by a config reload:

I0225 17:59:31.825591       6 controller.go:146] "Configuration changes detected, backend reload required"
I0225 17:59:32.429276       6 controller.go:163] "Backend successfully reloaded"
I0225 17:59:32.429503       6 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"ingress-nginx-controller-68974744d7-7v54w", UID:"668e9e5b-37e7-4493-ba37-51b10522c7b0", APIVersion:"v1", ResourceVersion:"1342977486", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
2021/02/25 17:59:36 [alert] 63#63: worker process 35497 exited on signal 11 (core dumped)
2021/02/25 17:59:36 [alert] 63#63: worker process 34639 exited on signal 11 (core dumped)
2021/02/25 17:59:36 [alert] 63#63: worker process 34938 exited on signal 11 (core dumped)
2021/02/25 17:59:36 [alert] 63#63: worker process 34249 exited on signal 11 (core dumped)
2021/02/25 17:59:36 [alert] 63#63: worker process 35369 exited on signal 11 (core dumped)
2021/02/25 17:59:36 [alert] 63#63: worker process 34430 exited on signal 11 (core dumped)
2021/02/25 17:59:36 [alert] 63#63: worker process 35181 exited on signal 11 (core dumped)
2021/02/25 17:59:36 [alert] 63#63: worker process 34138 exited on signal 11 (core dumped)

At this point, readiness/liveness probes became a hit or miss, if it was lucky enough to hit a liveness failure threshold, kubelet will restart the pod and the system load would recover back to normal. Otherwise, it could affect other pods on the same node, or even knock down the whole node.

@Stono
Copy link
Contributor

Stono commented Mar 2, 2021

FWIW whilst we haven't seen segfaults, we have noted quite a cpu jump from 0.42 -> 0.44:

You can see it here with our deployment at 11:10:
Screenshot 2021-03-02 at 12 52 19

Interestingly our lower traffic qualifying environments observed no increase in CPU, so this only seem to be an "under load" thing.

Will switch to 0.43 and report back.

@Stono
Copy link
Contributor

Stono commented Mar 2, 2021

Yup can confirm 0.43 is back to expected levels.

Screenshot 2021-03-02 at 13 10 09

@alfianabdi
Copy link

alfianabdi commented Mar 15, 2021

I also have similar issue. Recently several nodes get NodePressure status and pods getting Evicted, because ingress controller generate so many core-dumps. So many, the total is almost 200GB.

Using gdp here what i got

bash-5.1# gdb /sbin/nginx core.904
GNU gdb (GDB) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-alpine-linux-musl".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /sbin/nginx...

warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing

warning: Can't open file /[aio] (deleted) during file-backed mapping note processing
[New LWP 904]
[New LWP 913]
[New LWP 936]
[New LWP 907]
[New LWP 963]
[New LWP 915]
[New LWP 947]
[New LWP 934]
[New LWP 917]
[New LWP 941]
[New LWP 928]
[New LWP 951]
[New LWP 932]
[New LWP 953]
[New LWP 939]
[New LWP 955]
[New LWP 943]
[New LWP 909]
[New LWP 945]
[New LWP 961]
[New LWP 949]
[New LWP 957]
[New LWP 938]
[New LWP 959]
[New LWP 964]
[New LWP 930]
[New LWP 926]
[New LWP 925]
[New LWP 919]
[New LWP 921]
[New LWP 923]
[New LWP 905]
[New LWP 911]
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `nginx: master process /usr/local/nginx/sbin/nginx -c'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fa2ed778c59 in ?? () from /lib/ld-musl-x86_64.so.1
[Current thread is 1 (LWP 904)]
(gdb) backtrace
#0  0x00007fa2ed778c59 in ?? () from /lib/ld-musl-x86_64.so.1
#1  0x00000000000000a0 in ?? ()
#2  0x00007ffd3bb46230 in ?? ()
#3  0x00007ffd3bb461c0 in ?? ()
#4  0x0000000000000000 in ?? ()

After cleaning up, it takes only several hours and the node get DiskPressure again.
Could you help, how can I troubleshoot further to find the cause ?

@ReillyTevera
Copy link
Author

@alfianabdi Well, first of all the ?? in your backtrace indicates that you don't have all of the debugging symbols installed.

Run the following to install the musl debug symbols prior to running that gdb command:

apk add musl-dbg

@alfianabdi
Copy link

@ReillyTevera

Thanks, finally got it, somehow does not work in arm64 node.
Here what I got from backtrace

#0  a_crash () at ./arch/x86_64/atomic_arch.h:108
#1  get_nominal_size (end=0x7f2d94edb90c "", p=0x7f2d94eda790 "") at src/malloc/mallocng/meta.h:169
#2  __libc_free (p=0x7f2d94eda790) at src/malloc/mallocng/free.c:110
#3  0x00007f2d991acf7b in lj_vm_ffi_call () from /usr/local/lib/libluajit-5.1.so.2
#4  0x00007f2d991f3077 in lj_ccall_func (L=<optimized out>, cd=<optimized out>) at lj_ccall.c:1382
#5  0x00007f2d9920938d in lj_cf_ffi_meta___call (L=0x7f2d951c9380) at lib_ffi.c:230
#6  0x00007f2d991aab45 in lj_BC_FUNCC () from /usr/local/lib/libluajit-5.1.so.2
#7  0x00007f2d991bd8ff in lua_pcall (L=L@entry=0x7f2d951c9380, nargs=nargs@entry=0, nresults=nresults@entry=0, errfunc=errfunc@entry=10) at lj_api.c:1140
#8  0x00005587e51718aa in ngx_http_lua_do_call (log=log@entry=0x7f2d98da2568, L=L@entry=0x7f2d951c9380)
    at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_util.c:4233
#9  0x00005587e51888ce in ngx_http_lua_init_worker_by_inline (log=0x7f2d98da2568, lmcf=<optimized out>, L=0x7f2d951c9380)
    at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_initworkerby.c:323
#10 0x00005587e5188786 in ngx_http_lua_init_worker (cycle=0x7f2d98da2550)
    at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_initworkerby.c:296
#11 0x00005587e50b20ab in ngx_worker_process_init (cycle=cycle@entry=0x7f2d98da2550, worker=<optimized out>) at src/os/unix/ngx_process_cycle.c:955
#12 0x00005587e50b26d0 in ngx_worker_process_cycle (cycle=0x7f2d98da2550, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:759
#13 0x00005587e50b06f1 in ngx_spawn_process (cycle=cycle@entry=0x7f2d98da2550, proc=0x5587e50b26af <ngx_worker_process_cycle>, data=0x4,
    name=0x5587e51d7f4f "worker process", respawn=respawn@entry=5) at src/os/unix/ngx_process.c:199
#14 0x00005587e50b1885 in ngx_reap_children (cycle=cycle@entry=0x7f2d98da2550) at src/os/unix/ngx_process_cycle.c:641
#15 0x00005587e50b34fe in ngx_master_process_cycle (cycle=0x7f2d98da2550, cycle@entry=0x7f2d98da7210) at src/os/unix/ngx_process_cycle.c:174
#16 0x00005587e5085ad9 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386

@ReillyTevera
Copy link
Author

@alfianabdi That's the same stacktrace I posted in my original comment and unfortunately it doesn't give us (or the ingress-nginx maintainers) any new information to help debug the issue. It does confirm that you are experiencing the same issue as I was at least (and presumably the same as the rest of the people in this issue even though they haven't posted backtraces to confirm).

I just checked the latest changelogs for Alpine 3.13.x, musl, and the newest nginx version and nothing in them looks like it could be helpful. I would not expect this to be resolved with an upcoming ingress-nginx image (unless the issue was caused by something transient in the build).

@ReillyTevera
Copy link
Author

Pinging the following (mentioned in the owners file) for visibility.

@ghouscht
Copy link

Same issue here with 0.44.0. Higher loaded clusters are affected more often. Is this resolved with 0.45.0 (don't think so accoring to the changelog)?

@Stono
Copy link
Contributor

Stono commented Apr 16, 2021

Same issue here with 0.44.0. Higher loaded clusters are affected more often. Is this resolved with 0.45.0 (don't think so accoring to the changelog)?

Yeah we're quite loaded. Each instance doing around 400 ops/sec? We never saw the seg fault but observed almost double the load (cpu) for the same ops, until we rolled back.

@yuha0
Copy link
Contributor

yuha0 commented Apr 16, 2021

Maybe slightly off topic, as I don't know if the CPU spikes we saw were 100% caused by something in alpine, but would it be worth to provide a debian based image as an option? From what I gathered from #6527, the motivations were:

  1. Smaller image size
  2. Minimal CVE exposures

But would the same goals still be achievable with some trimmed down versions of debian? Like distroless?

Although I still love alpine for a lot of things, I have also moved away from it for many projects due to some famous issues like performance hit (mainly because musl libc I think), networking and DNS problems(e.g. https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#known-issues)

@ReillyBrogan
Copy link

I haven't tried 0.45.0, but I just compared the image with the 0.44.0 one and the musl library, the nginx binary, and the libluajit library all hashed out to the same as the 0.44.0 ones. I would be very surprised if the issue was resolved given that the issue is most likely in one of those.

Unfortunately though I don't think I can be of much more assistance in helping debug this issue. We ended up switching to Traefik for our ingress controller because of this issue (and also because Traefik doesn't close TCP connections when it reloads its config). We no longer have any ingress-controller deployments running at all and have no plans to switch back even if this issue is fixed.

@longwuyuan
Copy link
Contributor

Can the coredump be reproduced at will ? (For example by traffic to the controller on a kinD or a minikube cluster on laptop)

@LuckySB
Copy link

LuckySB commented Apr 21, 2021

Followup got coredump with v0.44.0 and v0.45.0
centos 7
docker 19.03
kernel 4.20.13 from elrepo

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v0.45.0
  Build:         7365e9eeb2f4961ef94e4ce5eb2b6e1bdb55ce5c
  Repository:    https://github.com/kubernetes/ingress-nginx
I0420 15:56:09.765547       6 flags.go:208] "Watching for Ingress" class="nginx"
W0420 15:56:09.765622       6 flags.go:213] Ingresses with an empty class will also be processed by this Ingress controller
  nginx version: nginx/1.19.6
 
-------------------------------------------------------------------------------
I0420 15:56:15.300180       6 controller.go:163] "Backend successfully reloaded"
I0420 15:56:15.300272       6 controller.go:174] "Initial sync, sleeping for 1 second"
I0420 15:56:15.300626       6 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"ingress-nginx-controller-2g4fc", UID:"80460d24-5a6c-4524-8c21-ab08140e6efb", APIVersion:"v1", ResourceVersion:"19639245", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
2021/04/20 15:56:17 [alert] 26#26: worker process 39 exited on signal 11 (core dumped)
2021/04/20 15:56:17 [alert] 26#26: worker process 40 exited on signal 11 (core dumped)
2021/04/20 15:56:17 [alert] 26#26: worker process 38 exited on signal 11 (core dumped)
2021/04/20 15:56:18 [alert] 26#26: worker process 139 exited on signal 11 (core dumped)

and etc....

@ReillyBrogan
Copy link

@longwuyuan I don't know that anyone has reproduced this in kinD or minikube. If this issue is happening though for someone it is fairly consistent. Have have multiple k8s clusters that are fairly identical and the issue was present in all of them (the segfaults just happen at a reduced rate on clusters that don't process as much traffic).

@LuckySB Can I ask why you're using that kernel? 4.20 has been end-of-life since March of 2019, it is very insecure to be using it now. I see that 5.4 is in elrepo, if I were you I'd just use that as it's a LTS kernel and is supported until Dec 2025.

@longwuyuan
Copy link
Contributor

longwuyuan commented Apr 22, 2021

any chance of anyone posting a fairly reasonable step by step process to reproduce this problem. Particularly hoping OS, Networking and such information and specifications nailed down for reproducing. Because I am not able to.

/remove-kind bug

@k8s-ci-robot k8s-ci-robot removed the kind/bug Categorizes issue or PR as related to a bug. label Apr 22, 2021
@longwuyuan
Copy link
Contributor

/triage needs-information
/triage not-reproducible

@k8s-ci-robot k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. triage/not-reproducible Indicates an issue can not be reproduced as described. labels Apr 22, 2021
@ReillyBrogan
Copy link

@longwuyuan Multiple people have provided stacktraces and additionally have full nginx worker coredumps that they can provide to ingress-nginx core developers (obviously they are sensitive files). I suppose I'm curious as that is not sufficient?

@bowei
Copy link
Member

bowei commented Apr 22, 2021

Hi @ReillyBrogan sorry to hear that you are no longer using ingress-nginx.

There is a topic in the upcoming sig-network to figure out how this project can get the appropriate amount of bandwidth from the community.

@longwuyuan
Copy link
Contributor

@ReillyBrogan I am not able to reproduce. The info available hints at a combination of kernel version of the node, certain volume of traffic so I am guessing cpu/mem available etc.

@ghouscht
Copy link

I think the issue is related to amount of traffic combined with frequent reloads of nginx not certainly with cpu/mem available.

The cluster where we saw the issue very frequently (~50 restarts in 3 days on all 4 ingress pods) was a dev cluster with very frequent config changes and quite a lot of traffic through the ingress controller. Our kubelets in this cluster (55 in total) all have 24 CPUs together with 64GB of memory and an average usage of around 60% (cpu/mem). OS is a RHEL 7.9 with kernel 5.10.15-1.el7.elrepo.x86_64.

@longwuyuan
Copy link
Contributor

Do you have metrics of inodes, filehandles, conntrack and such resources on the node where the pod was running at the time of the segfault

@Stono
Copy link
Contributor

Stono commented Apr 27, 2021

Hey; as the segfaults are relatively infrequent and difficult to reproduce - shouldn't we be working with data that's more readily accessible? As I demonstrated above, we're able to observe a loosely double CPU usage between 0.43 and 0.44, it's not a huge leap to say that whatever is causing that additional load is only going to exacerbate config reloads (already a high CPU event).

The CPU increase should be relatively trivial to reproduce. In the above example we're running 6 pods with 500m CPU (no limits) with each pod doing around 250-300ops/sec.

@ramijebara
Copy link

I can confirm we saw the same issue on AKS with version 0.45.0. Issue went away when we downgraded to 0.43.0.

@merlindorin
Copy link

merlindorin commented May 27, 2021

I can confirm we saw the same issue on GCP/GKE with version 0.45.0. Issue also went away with 0.43.0.

From our Compute engine, we also found that:

[39968.569424] traps: nginx[1006038] general protection fault ip:7f6a54085c59 sp:7ffc2d3b6230 error:0 in ld-musl-x86_64.so.1[7f6a54077000+48000]\r\n 

On this cluster, we have a lot of ingress (~200). We didn't see this issue on a similar cluster with quite similar ingress volume.

@shiraz-zscaler
Copy link

Just learned about this issue the hard way! I confirm the issue is present on 0.46.0 as well. Planning a downgrade till the issue is fixed.

@rikatz
Copy link
Contributor

rikatz commented Sep 27, 2021

We have released v0.49.2 and v1.0.2 with the fix.

Can you all please test and give us some feedbacks?

Thank you so much, specially @doujiang24 and @tao12345666333 for the help!

@sepich
Copy link
Contributor

sepich commented Sep 27, 2021

Thank you for prompt fix! I've tested k8s.gcr.io/ingress-nginx/controller:v0.49.2@sha256:84e351228337bb7b09f0e90e6b6f5e2f8f4cf7d618c1ddc1e966f23902d273db and unfortunately worker process 71 exited on signal 11 continue to happen, but now twice slower.
Also, it does not seems to be lua related now:

Core was generated by `nginx: worker process                               '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  get_meta (p=p@entry=0x7f5bfc59b550 "") at src/malloc/mallocng/meta.h:141
141	src/malloc/mallocng/meta.h: No such file or directory.
[Current thread is 1 (LWP 71)]
(gdb) backtrace
#0  get_meta (p=p@entry=0x7f5bfc59b550 "") at src/malloc/mallocng/meta.h:141
#1  0x00007f5bfdf36c2f in __libc_free (p=0x7f5bfc59b550) at src/malloc/mallocng/free.c:105
#2  0x00007f5bfdc7dd0d in OPENSSL_LH_doall_arg () from /lib/libcrypto.so.1.1
#3  0x00007f5bfddeb6d0 in SSL_CTX_flush_sessions () from /lib/libssl.so.1.1
#4  0x00007f5bfde01ad3 in ?? () from /lib/libssl.so.1.1
#5  0x00007f5bfddf5fb4 in ?? () from /lib/libssl.so.1.1
#6  0x000055a5517ca678 in ngx_ssl_handshake (c=c@entry=0x7f5bfd670728) at src/event/ngx_event_openssl.c:1720
#7  0x000055a5517caa91 in ngx_ssl_handshake_handler (ev=0x7f5bfd4d7e90) at src/event/ngx_event_openssl.c:2091
#8  0x000055a5517c5103 in ngx_epoll_process_events (cycle=0x7f5bfbf927d0, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#9  0x000055a5517b8140 in ngx_process_events_and_timers (cycle=cycle@entry=0x7f5bfbf927d0) at src/event/ngx_event.c:257
#10 0x000055a5517c27c8 in ngx_worker_process_cycle (cycle=0x7f5bfbf927d0, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:753
#11 0x000055a5517c0850 in ngx_spawn_process (cycle=cycle@entry=0x7f5bfbf927d0, proc=proc@entry=0x55a5517c26bf <ngx_worker_process_cycle>, data=data@entry=0x0,
    name=name@entry=0x55a5518e7f97 "worker process", respawn=respawn@entry=-4) at src/os/unix/ngx_process.c:199
#12 0x000055a5517c149e in ngx_start_worker_processes (cycle=cycle@entry=0x7f5bfbf927d0, n=1, type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:373
#13 0x000055a5517c3369 in ngx_master_process_cycle (cycle=0x7f5bfbf927d0, cycle@entry=0x7f5bfda95200) at src/os/unix/ngx_process_cycle.c:234
#14 0x000055a551795af9 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386

This is without libmimalloc.so

@rikatz
Copy link
Contributor

rikatz commented Sep 27, 2021

Hi @sepich , question:

Are you using TLS/SSL for mutual auth? Just for secure endpoint?

What's your load currently, and amount of ingress objects?

Thanks!

@sepich
Copy link
Contributor

sepich commented Sep 27, 2021

We do not verify client certs, so it is only TLS for server certs (provided by cert-manager).
It is just 215 Ingress objects / 80 rps, 3 ingress-nginx pods with 5% cpu load each.
I've re-checked with LD_PRELOAD=/usr/local/lib/libmimalloc.so, still dies and backtrace is the same:
#1 0x00007fdca3720c2f in __libc_free (p=0x7fdca187c9e0) at src/malloc/mallocng/free.c:105
Do you have any suggestions what else to try? Is there a branch with nginx v1.19.9 (and patches) on debian?

@doujiang24
Copy link

@sepich
It is likely to be another bug, maybe on the OpenSSL side.
Does it possible to send me a core file? This one seems harder to debug than the previous one. So, a core file will be very appreciated, Thanks!

Also, does anyone other who got segfault have tried the new release? Feedbacks welcome, Thanks!

@ghouscht
Copy link

@sepich
It is likely to be another bug, maybe on the OpenSSL side.
Does it possible to send me a core file? This one seems harder to debug than the previous one. So, a core file will be very appreciated, Thanks!

Also, does anyone other who got segfault have tried the new release? Feedbacks welcome, Thanks!

Running 4 replicas of v0.49.2 since 24h so far without a crash! There is also one replica of an old v0.44.0 image running in the same cluster since 24h which so far also had no crash. The strange thing is that back in April we had like a dozen of crashes per day with v0.44.X and v0.45.X and currently it seems to be not reproducible.

The Cluster is quite large with ~1400 ingress objects and around 200rp/s. From an Infra perspective of view the only significant change was the switch from RHEL 7 (with docker) to Debian 10 (with containerd) but I think this is not related that the problem disappeared. I think it's probably related to some user who previously used a special Ingress configuration (eg. some special annotations), but that's also just a guess...

@doujiang24
Copy link

I think it's probably related to some user who previously used a special Ingress configuration

I also think it's reasonable. Thanks for your feedback!

@rikatz rikatz modified the milestones: v1.0.2, v1.0.3 Sep 28, 2021
@doujiang24
Copy link

Hi @sepich , I need some help from you, thanks!

  1. There are some key informations missing in the backtrace, I guess may be the debug symbol for openssl is not working properly.
# key information in backtrace: function arguments are missing and two functions are missing.
#2  0x00007f5bfdc7dd0d in OPENSSL_LH_doall_arg () from /lib/libcrypto.so.1.1
#3  0x00007f5bfddeb6d0 in SSL_CTX_flush_sessions () from /lib/libssl.so.1.1
#4  0x00007f5bfde01ad3 in ?? () from /lib/libssl.so.1.1
#5  0x00007f5bfddf5fb4 in ?? () from /lib/libssl.so.1.1

Could you please confirm the openssl-dbg package is installed and is loaded properly by gdb?
Please check if there was some kind of these messages. It means gdb haven't load openssl debug symbol properly when there was.

Reading symbols from /lib/libssl.so.1.1...
(No debugging symbols found in /lib/libssl.so.1.1)
Reading symbols from /lib/libcrypto.so.1.1...
(No debugging symbols found in /lib/libcrypto.so.1.1)
  1. Please follow these steps to make gdb load openssl-dbg properly.
# install openssl-dbg
$ apk add openssl-dbg

# load symbol files manually when gdb still not load it automatically
(gdb) add-symbol-file /usr/lib/debug/usr/lib/libcrypto.so.1.1.debug
(gdb) add-symbol-file /usr/lib/debug/usr/lib/libssl.so.1.1.debug

# bt and bt full
(gdb) bt
(gdb) bt full
  1. If the key information is still missing when gdb loaded the debug symbol properly, we have to get it manually, like the following steps:
# change to the frame 2
(gdb) frame 2

# get the second argument for OPENSSL_LH_doall_arg
(gdb) p/x $r13
(gdb) info symbol $r13

# get the first argument for OPENSSL_LH_doall_arg
(gdb) p/x $r12

Again, core file is very appreciated. It's painful to debug such a bug without a core file. Thanks very much!

@sepich
Copy link
Contributor

sepich commented Sep 29, 2021

Sorry for delay, I was discussing sharing the coredump internally and unfortunately I not allowed to. We can arrange a call with interactive gdb session if you wish.
Thank you for detailed instructions, here is full output (this is v0.49.2):

bash-5.1# gdb /sbin/nginx /tmp/core
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-alpine-linux-musl".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /sbin/nginx...

warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing

warning: Can't open file /[aio] (deleted) during file-backed mapping note processing
[New LWP 71]
[New LWP 77]
[New LWP 78]
[New LWP 79]
[New LWP 80]
[New LWP 81]
[New LWP 82]
[New LWP 83]
[New LWP 84]
[New LWP 76]
[New LWP 85]
[New LWP 75]
[New LWP 86]
[New LWP 74]
[New LWP 87]
[New LWP 88]
[New LWP 94]
[New LWP 93]
[New LWP 95]
[New LWP 92]
[New LWP 96]
[New LWP 91]
[New LWP 97]
[New LWP 90]
[New LWP 98]
[New LWP 89]
[New LWP 99]
[New LWP 104]
[New LWP 103]
[New LWP 105]
[New LWP 102]
[New LWP 101]
[New LWP 100]
Core was generated by `nginx: worker process                               '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  get_meta (p=p@entry=0x7f5bfc59b550 "") at src/malloc/mallocng/meta.h:141
141	src/malloc/mallocng/meta.h: No such file or directory.
[Current thread is 1 (LWP 71)]
(gdb) add-symbol-file /usr/lib/debug/usr/lib/libcrypto.so.1.1.debug
add symbol table from file "/usr/lib/debug/usr/lib/libcrypto.so.1.1.debug"
(y or n) y
Reading symbols from /usr/lib/debug/usr/lib/libcrypto.so.1.1.debug...
(gdb) add-symbol-file /usr/lib/debug/usr/lib/libssl.so.1.1.debug
add symbol table from file "/usr/lib/debug/usr/lib/libssl.so.1.1.debug"
(y or n) y
Reading symbols from /usr/lib/debug/usr/lib/libssl.so.1.1.debug...
(gdb) bt
#0  get_meta (p=p@entry=0x7f5bfc59b550 "") at src/malloc/mallocng/meta.h:141
#1  0x00007f5bfdf36c2f in __libc_free (p=0x7f5bfc59b550) at src/malloc/mallocng/free.c:105
#2  0x00007f5bfdc7dd0d in OPENSSL_LH_doall_arg () from /lib/libcrypto.so.1.1
#3  0x00007f5bfddeb6d0 in SSL_CTX_flush_sessions () from /lib/libssl.so.1.1
#4  0x00007f5bfde01ad3 in ?? () from /lib/libssl.so.1.1
#5  0x00007f5bfddf5fb4 in ?? () from /lib/libssl.so.1.1
#6  0x000055a5517ca678 in ngx_ssl_handshake (c=c@entry=0x7f5bfd670728) at src/event/ngx_event_openssl.c:1720
#7  0x000055a5517caa91 in ngx_ssl_handshake_handler (ev=0x7f5bfd4d7e90) at src/event/ngx_event_openssl.c:2091
#8  0x000055a5517c5103 in ngx_epoll_process_events (cycle=0x7f5bfbf927d0, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#9  0x000055a5517b8140 in ngx_process_events_and_timers (cycle=cycle@entry=0x7f5bfbf927d0) at src/event/ngx_event.c:257
#10 0x000055a5517c27c8 in ngx_worker_process_cycle (cycle=0x7f5bfbf927d0, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:753
#11 0x000055a5517c0850 in ngx_spawn_process (cycle=cycle@entry=0x7f5bfbf927d0, proc=proc@entry=0x55a5517c26bf <ngx_worker_process_cycle>, data=data@entry=0x0,
    name=name@entry=0x55a5518e7f97 "worker process", respawn=respawn@entry=-4) at src/os/unix/ngx_process.c:199
#12 0x000055a5517c149e in ngx_start_worker_processes (cycle=cycle@entry=0x7f5bfbf927d0, n=1, type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:373
#13 0x000055a5517c3369 in ngx_master_process_cycle (cycle=0x7f5bfbf927d0, cycle@entry=0x7f5bfda95200) at src/os/unix/ngx_process_cycle.c:234
#14 0x000055a551795af9 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386
(gdb) frame 2
#2  0x00007f5bfdc7dd0d in OPENSSL_LH_doall_arg () from /lib/libcrypto.so.1.1
(gdb) p/x $r13
$1 = 0x7f5bfddeabfa
(gdb) info symbol $r13
No symbol matches $r13.
(gdb) p/x $r12
$2 = 0x7f5bfb478b50
(gdb) bt full
#0  get_meta (p=p@entry=0x7f5bfc59b550 "") at src/malloc/mallocng/meta.h:141
        offset = 0
        index = 31
        base = 0x7f5bfc59b540
        meta = 0x0
        area = <optimized out>
#1  0x00007f5bfdf36c2f in __libc_free (p=0x7f5bfc59b550) at src/malloc/mallocng/free.c:105
        g = <optimized out>
        idx = <optimized out>
        stride = <optimized out>
        start = <optimized out>
        end = <optimized out>
        self = <optimized out>
        all = <optimized out>
        mi = {base = <optimized out>, len = <optimized out>}
#2  0x00007f5bfdc7dd0d in OPENSSL_LH_doall_arg () from /lib/libcrypto.so.1.1
No symbol table info available.
#3  0x00007f5bfddeb6d0 in SSL_CTX_flush_sessions () from /lib/libssl.so.1.1
No symbol table info available.
#4  0x00007f5bfde01ad3 in ?? () from /lib/libssl.so.1.1
No symbol table info available.
#5  0x00007f5bfddf5fb4 in ?? () from /lib/libssl.so.1.1
No symbol table info available.
#6  0x000055a5517ca678 in ngx_ssl_handshake (c=c@entry=0x7f5bfd670728) at src/event/ngx_event_openssl.c:1720
        n = <optimized out>
        sslerr = <optimized out>
        err = <optimized out>
        rc = <optimized out>
#7  0x000055a5517caa91 in ngx_ssl_handshake_handler (ev=0x7f5bfd4d7e90) at src/event/ngx_event_openssl.c:2091
        c = 0x7f5bfd670728
...

Another thing i've tried is to build debian with nginx v1.20.1 from #7593 with lua-resty-balancer v0.04 and controller from image v0.49.2.
Still fails, and there is the same issue with libssl symbols in bullseye (maybe better to consider ubuntu as base):

# find-dbgsym-packages /usr/lib/x86_64-linux-gnu/libssl.so.1.1
W: No dbg package for source 'openssl'
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 (08aa8c462037f6809d07ab8b9baaeb9100ece88e)
W: No dbg package for source 'openssl'
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libssl.so.1.1 (847f26496373ebc96c8b34eae3e12390d6c128ca)
I: All needed dbgsym packages are already installed.

# gdb /sbin/nginx /tmp/core.nginx.1056
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /sbin/nginx...

warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing

warning: Can't open file /[aio] (deleted) during file-backed mapping note processing
[New LWP 1056]
[New LWP 1058]
[New LWP 1059]
[New LWP 1060]
[New LWP 1062]
[New LWP 1061]
[New LWP 1063]
[New LWP 1064]
[New LWP 1065]
[New LWP 1067]
[New LWP 1066]
[New LWP 1068]
[New LWP 1070]
[New LWP 1072]
[New LWP 1069]
[New LWP 1073]
[New LWP 1071]
[New LWP 1074]
[New LWP 1077]
[New LWP 1075]
[New LWP 1078]
[New LWP 1085]
[New LWP 1087]
[New LWP 1079]
[New LWP 1076]
[New LWP 1089]
[New LWP 1082]
[New LWP 1081]
[New LWP 1083]
[New LWP 1080]
[New LWP 1084]
[New LWP 1088]
[New LWP 1086]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `nginx: worker process                               '.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f74ca5ed740 (LWP 1056))]
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f74ca775537 in __GI_abort () at abort.c:79
#2  0x00007f74ca7ce768 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f74ca8dce2d "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007f74ca7d5a5a in malloc_printerr (str=str@entry=0x7f74ca8df230 "double free or corruption (!prev)") at malloc.c:5347
#4  0x00007f74ca7d709c in _int_free (av=0x7f74ca90eb80 <main_arena>, p=0x5569c8b9eb20, have_lock=<optimized out>) at malloc.c:4317
#5  0x00007f74cab0365f in OPENSSL_LH_doall_arg () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
#6  0x00007f74caca89c7 in SSL_CTX_flush_sessions () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#7  0x00007f74cacc29f4 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#8  0x00007f74cacb486d in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#9  0x00005569c5ddfd27 in ngx_ssl_handshake (c=c@entry=0x5569cb690440) at src/event/ngx_event_openssl.c:1720
#10 0x00005569c5de0140 in ngx_ssl_handshake_handler (ev=0x5569c957d440) at src/event/ngx_event_openssl.c:2091
#11 0x00005569c5dda889 in ngx_epoll_process_events (cycle=0x5569c933af60, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#12 0x00005569c5dcdaaa in ngx_process_events_and_timers (cycle=cycle@entry=0x5569c933af60) at src/event/ngx_event.c:257
#13 0x00005569c5dd8099 in ngx_worker_process_cycle (cycle=0x5569c933af60, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:753
#14 0x00005569c5dd5fb3 in ngx_spawn_process (cycle=cycle@entry=0x5569c933af60, proc=proc@entry=0x5569c5dd7f90 <ngx_worker_process_cycle>, data=data@entry=0x0,
    name=name@entry=0x5569c5efaf47 "worker process", respawn=respawn@entry=-4) at src/os/unix/ngx_process.c:199
#15 0x00005569c5dd6c3a in ngx_start_worker_processes (cycle=cycle@entry=0x5569c933af60, n=1, type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:373
#16 0x00005569c5dd8af2 in ngx_master_process_cycle (cycle=0x5569c933af60, cycle@entry=0x5569c7c3e400) at src/os/unix/ngx_process_cycle.c:234
#17 0x00005569c5daba83 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386

Please drop me a mail if you have time for the call. Or provide some instructions i can help you with. Thank you.

@ghouscht
Copy link

@sepich
It is likely to be another bug, maybe on the OpenSSL side.
Does it possible to send me a core file? This one seems harder to debug than the previous one. So, a core file will be very appreciated, Thanks!
Also, does anyone other who got segfault have tried the new release? Feedbacks welcome, Thanks!

Running 4 replicas of v0.49.2 since 24h so far without a crash! There is also one replica of an old v0.44.0 image running in the same cluster since 24h which so far also had no crash. The strange thing is that back in April we had like a dozen of crashes per day with v0.44.X and v0.45.X and currently it seems to be not reproducible.

The Cluster is quite large with ~1400 ingress objects and around 200rp/s. From an Infra perspective of view the only significant change was the switch from RHEL 7 (with docker) to Debian 10 (with containerd) but I think this is not related that the problem disappeared. I think it's probably related to some user who previously used a special Ingress configuration (eg. some special annotations), but that's also just a guess...

A bit more than 48h into testing v0.49.2 (4 replicas), no SIGSEGV so far. In the same time 361 SIGSEGV on v0.44.0 (1 replica), so at least for me the issue seems to be gone and the fix in the lua-resty-balancer module helped. I took a look at nearly two dozens of coredumps and all crashed in lua code, not a single one in SSL related code, so I think I can't help any further at the moment.

@doujiang24
Copy link

Hi @ghouscht , thanks very much. The resty.balancer related bug should be fixed then.

@doujiang24
Copy link

Hi @sepich , thanks for your help.

(gdb) p/x $r13
$1 = 0x7f5bfddeabfa
(gdb) info symbol $r13
No symbol matches $r13.

Oh, no, seems I got wrong somewhere. I will take a deeper look.

unfortunately I not allowed to.

Oh, that's bad news, but I also understand it.

We can arrange a call with interactive gdb session if you wish.

Sure, that will be more helpful.
Also, I have to clarify first that I'm not an expert of OpenSSL, just a bit familiar with it, I'm not sure that I can figure it out finally.
My current plan is to get more information firstly, we may still need to report to OpenSSL if we got more evidence.

I will contact you by email when I have a further plan and more free time. Maybe one or two days later. Thanks for your patience.

@rikatz
Copy link
Contributor

rikatz commented Sep 30, 2021

@doujiang24 some related pertinent issue maybe:
#7647

@rikatz
Copy link
Contributor

rikatz commented Sep 30, 2021

And #7080

@rikatz
Copy link
Contributor

rikatz commented Sep 30, 2021

OK, @doujiang24 @sepich can we move the OpenSSL issue to another issue (like 7647) and keep there? This way we can say that the specific issue of this case is solved, and now start digging more into OpenSSL

Thanks!

@doujiang24
Copy link

Hi @rikatz , totally agreed.

@sepich
Copy link
Contributor

sepich commented Oct 1, 2021

Sure, lets keep each issue to the point. But I believe #7080 is more suitable for our case, as we do not have 4x memory increase for worker process (as in #7647)

@rikatz
Copy link
Contributor

rikatz commented Oct 1, 2021

/close

This one has been solved, thank you very much @doujiang24

I'm moving the discussion about ssl endless coredumps to #7080

@k8s-ci-robot
Copy link
Contributor

@rikatz: Closing this issue.

In response to this:

/close

This one has been solved, thank you very much @doujiang24

I'm moving the discussion about ssl endless coredumps to #7080

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lboix
Copy link
Contributor

lboix commented Dec 20, 2021

Upgrading to 0.49.3 helped here too for those symptoms (no more seg fault, no more restarts).
Our volumetry was at the time :

  • Kubernetes 1.19 cluster having around 2000 Ingress ressources
  • 6 Ingress controller pods receiving the traffic of almost 1000 websites

@gerardocorea
Copy link

Was there a way to replicate this issue on 0.45.0 by modifying ingresses? Were observing this issue when ingresses are removed from a namespace but I tried replicating this same behavior on the same version but even when the backend reloads coredumps are not logged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.