Stuck mutex causing resource leak #36

inahga · 2024-01-24T22:03:31Z

We're running into an intermittent problem where one of our caddy replicas goes runaway in memory usage. This is ~50MB baseline to growing over 1GB over time, before eventually failing to service liveness probe requests and being killed by our process orchestrator.

It appears to be a resource leak, looking at the output of lsof on an affected replica reveals that there are several thousand open sockets.

Some inspection of the pprof output reveals that there looks to be a stuck mutex blocking a bunch of other ones.

Goroutine stack dump:

goroutine profile: total 22880
22375 @ 0x43e44e 0x44f938 0x44f90f 0x46d785 0x48d1dd 0x1779949 0x177992b 0x1776fda 0x177884c 0x121ea4c 0x1225d5a 0x120aa69 0x1225ae9 0x120aa69 0x12281a8 0x1201755 0x120aa69 0x122732e 0x76d12e 0x769014 0x471901
#	0x46d784	sync.runtime_SemacquireMutex+0x24										runtime/sema.go:77
#	0x48d1dc	sync.(*Mutex).lockSlow+0x15c											sync/mutex.go:171
#	0x1779948	sync.(*Mutex).Lock+0x48												sync/mutex.go:90
#	0x177992a	github.com/mholt/caddy-ratelimit.(*ringBufferRateLimiter).MaxEvents+0x2a					github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/ringbuffer.go:102
#	0x1776fd9	github.com/mholt/caddy-ratelimit.Handler.distributedRateLimiting+0x79						github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/distributed.go:193
#	0x177884b	github.com/mholt/caddy-ratelimit.Handler.ServeHTTP+0x2ab							github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/handler.go:191
#	0x121ea4b	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*metricsInstrumentedHandler).ServeHTTP+0x54b			github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/metrics.go:137
#	0x1225d59	github.com/caddyserver/caddy/v2/modules/caddyhttp.wrapMiddleware.func1.1+0x39					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/routes.go:331
#	0x120aa68	github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x28					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/caddyhttp.go:58
#	0x1225ae8	github.com/caddyserver/caddy/v2/modules/caddyhttp.RouteList.Compile.wrapRoute.func1.1+0x2c8			github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/routes.go:300
#	0x120aa68	github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x28					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/caddyhttp.go:58
#	0x12281a7	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).enforcementHandler+0x247				github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/server.go:429
#	0x1201754	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*App).Provision.(*Server).wrapPrimaryRoute.func1+0x34	github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/server.go:405
#	0x120aa68	github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x28					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/caddyhttp.go:58
#	0x122732d	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).ServeHTTP+0xf2d					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/server.go:341
#	0x76d12d	net/http.serverHandler.ServeHTTP+0x8d										net/http/server.go:2938
#	0x769013	net/http.(*conn).serve+0x5f3											net/http/server.go:2009

377 @ 0x43e44e 0x436c97 0x46bce5 0x4e11c7 0x4e24ba 0x4e24a8 0x545825 0x558ea5 0x7632cb 0x56d2e3 0x56d413 0x76917c 0x471901
#	0x46bce4	internal/poll.runtime_pollWait+0x84		runtime/netpoll.go:343
#	0x4e11c6	internal/poll.(*pollDesc).wait+0x26		internal/poll/fd_poll_runtime.go:84
#	0x4e24b9	internal/poll.(*pollDesc).waitRead+0x279	internal/poll/fd_poll_runtime.go:89
#	0x4e24a7	internal/poll.(*FD).Read+0x267			internal/poll/fd_unix.go:164
#	0x545824	net.(*netFD).Read+0x24				net/fd_posix.go:55
#	0x558ea4	net.(*conn).Read+0x44				net/net.go:179
#	0x7632ca	net/http.(*connReader).Read+0x14a		net/http/server.go:791
#	0x56d2e2	bufio.(*Reader).fill+0x102			bufio/bufio.go:113
#	0x56d412	bufio.(*Reader).Peek+0x52			bufio/bufio.go:151
#	0x76917b	net/http.(*conn).serve+0x75b			net/http/server.go:2044

40 @ 0x43e44e 0x436c97 0x46bce5 0x4e11c7 0x4e24ba 0x4e24a8 0x545825 0x558ea5 0x78222a 0x56d2e3 0x56d413 0x783019 0x471901
#	0x46bce4	internal/poll.runtime_pollWait+0x84		runtime/netpoll.go:343
#	0x4e11c6	internal/poll.(*pollDesc).wait+0x26		internal/poll/fd_poll_runtime.go:84
#	0x4e24b9	internal/poll.(*pollDesc).waitRead+0x279	internal/poll/fd_poll_runtime.go:89
#	0x4e24a7	internal/poll.(*FD).Read+0x267			internal/poll/fd_unix.go:164
#	0x545824	net.(*netFD).Read+0x24				net/fd_posix.go:55
#	0x558ea4	net.(*conn).Read+0x44				net/net.go:179
#	0x782229	net/http.(*persistConn).Read+0x49		net/http/transport.go:1954
#	0x56d2e2	bufio.(*Reader).fill+0x102			bufio/bufio.go:113
#	0x56d412	bufio.(*Reader).Peek+0x52			bufio/bufio.go:151
#	0x783018	net/http.(*persistConn).readLoop+0x1b8		net/http/transport.go:2118

40 @ 0x43e44e 0x44e905 0x7849e5 0x471901
#	0x7849e4	net/http.(*persistConn).writeLoop+0xe4	net/http/transport.go:2421

11 @ 0x43e44e 0x436c97 0x46bce5 0x4e11c7 0x4e24ba 0x4e24a8 0x545825 0x558ea5 0x762e97 0x471901
#	0x46bce4	internal/poll.runtime_pollWait+0x84		runtime/netpoll.go:343
#	0x4e11c6	internal/poll.(*pollDesc).wait+0x26		internal/poll/fd_poll_runtime.go:84
#	0x4e24b9	internal/poll.(*pollDesc).waitRead+0x279	internal/poll/fd_poll_runtime.go:89
#	0x4e24a7	internal/poll.(*FD).Read+0x267			internal/poll/fd_unix.go:164
#	0x545824	net.(*netFD).Read+0x24				net/fd_posix.go:55
#	0x558ea4	net.(*conn).Read+0x44				net/net.go:179
#	0x762e96	net/http.(*connReader).backgroundRead+0x36	net/http/server.go:683

10 @ 0x43e44e 0x44e905 0x785a19 0x779d7a 0x149623a 0x14961f3 0x149c044 0x149a27a 0x14991b3 0x121ea4c 0x1225d5a 0x120aa69 0x17788e3 0x121ea4c 0x1225d5a 0x120aa69 0x1225ae9 0x120aa69 0x12281a8 0x1201755 0x120aa69 0x122732e 0x76d12e 0x769014 0x471901
#	0x785a18	net/http.(*persistConn).roundTrip+0x978										net/http/transport.go:2652
#	0x779d79	net/http.(*Transport).roundTrip+0x799										net/http/transport.go:604
#	0x1496239	net/http.(*Transport).RoundTrip+0x139										net/http/roundtrip.go:17
#	0x14961f2	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.(*HTTPTransport).RoundTrip+0xf2			github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/reverseproxy/httptransport.go:388
#	0x149c043	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.(*Handler).reverseProxy+0x5e3			github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/reverseproxy/reverseproxy.go:788
#	0x149a279	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.(*Handler).proxyLoopIteration+0xe39		github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/reverseproxy/reverseproxy.go:536
#	0x14991b2	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.(*Handler).ServeHTTP+0x3d2			github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/reverseproxy/reverseproxy.go:443
#	0x121ea4b	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*metricsInstrumentedHandler).ServeHTTP+0x54b			github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/metrics.go:137
#	0x1225d59	github.com/caddyserver/caddy/v2/modules/caddyhttp.wrapMiddleware.func1.1+0x39					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/routes.go:331
#	0x120aa68	github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x28					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/caddyhttp.go:58
#	0x17788e2	github.com/mholt/caddy-ratelimit.Handler.ServeHTTP+0x342							github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/handler.go:197
#	0x121ea4b	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*metricsInstrumentedHandler).ServeHTTP+0x54b			github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/metrics.go:137
#	0x1225d59	github.com/caddyserver/caddy/v2/modules/caddyhttp.wrapMiddleware.func1.1+0x39					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/routes.go:331
#	0x120aa68	github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x28					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/caddyhttp.go:58
#	0x1225ae8	github.com/caddyserver/caddy/v2/modules/caddyhttp.RouteList.Compile.wrapRoute.func1.1+0x2c8			github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/routes.go:300
#	0x120aa68	github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x28					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/caddyhttp.go:58
#	0x12281a7	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).enforcementHandler+0x247				github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/server.go:429
#	0x1201754	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*App).Provision.(*Server).wrapPrimaryRoute.func1+0x34	github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/server.go:405
#	0x120aa68	github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x28					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/caddyhttp.go:58
#	0x122732d	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).ServeHTTP+0xf2d					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/server.go:341
#	0x76d12d	net/http.serverHandler.ServeHTTP+0x8d										net/http/server.go:2938
#	0x769013	net/http.(*conn).serve+0x5f3											net/http/server.go:2009

5 @ 0x43e44e 0x44e905 0x149780a 0x471901
#	0x1497809	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.(*metricsUpstreamsHealthyUpdater).Init.func1+0xc9	github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/reverseproxy/metrics.go:61

4 @ 0x43e44e 0x44f938 0x44f90f 0x46d785 0x48d1dd 0x1779eb6 0x1779e87 0x17760ef 0x48ccc8 0x1776028 0x1775edd 0xa8d623 0x1775dd0 0x1775928 0x471901
#	0x46d784	sync.runtime_SemacquireMutex+0x24						runtime/sema.go:77
#	0x48d1dc	sync.(*Mutex).lockSlow+0x15c							sync/mutex.go:171
#	0x1779eb5	sync.(*Mutex).Lock+0x75								sync/mutex.go:90
#	0x1779e86	github.com/mholt/caddy-ratelimit.(*ringBufferRateLimiter).Count+0x46		github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/ringbuffer.go:178
#	0x17760ee	github.com/mholt/caddy-ratelimit.rlStateForZone.func1+0x6e			github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/distributed.go:113
#	0x48ccc7	sync.(*Map).Range+0x227								sync/map.go:476
#	0x1776027	github.com/mholt/caddy-ratelimit.rlStateForZone+0x87				github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/distributed.go:107
#	0x1775edc	github.com/mholt/caddy-ratelimit.Handler.syncDistributedWrite.func1+0x7c	github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/distributed.go:95
#	0xa8d622	github.com/caddyserver/caddy/v2.(*UsagePool).Range+0x1a2			github.com/caddyserver/caddy/v2@v2.7.6/usagepool.go:158
#	0x1775dcf	github.com/mholt/caddy-ratelimit.Handler.syncDistributedWrite+0xaf		github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/distributed.go:91
#	0x1775927	github.com/mholt/caddy-ratelimit.Handler.syncDistributed+0x1e7			github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/distributed.go:72

4 @ 0x43e44e 0x44f938 0x44f90f 0x46d785 0x48d1dd 0x177a2df 0x177a2bc 0x48ccc8 0x177a209 0xa8d623 0x1778f6c 0x471901
#	0x46d784	sync.runtime_SemacquireMutex+0x24					runtime/sema.go:77
#	0x48d1dc	sync.(*Mutex).lockSlow+0x15c						sync/mutex.go:171
#	0x177a2de	sync.(*Mutex).Lock+0x7e							sync/mutex.go:90
#	0x177a2bb	github.com/mholt/caddy-ratelimit.Handler.sweepRateLimiters.func1.1+0x5b	github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/handler.go:250
#	0x48ccc7	sync.(*Map).Range+0x227							sync/map.go:476
#	0x177a208	github.com/mholt/caddy-ratelimit.Handler.sweepRateLimiters.func1+0x48	github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/handler.go:244
#	0xa8d622	github.com/caddyserver/caddy/v2.(*UsagePool).Range+0x1a2		github.com/caddyserver/caddy/v2@v2.7.6/usagepool.go:158
#	0x1778f6b	github.com/mholt/caddy-ratelimit.Handler.sweepRateLimiters+0x8b		github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/handler.go:240

2 @ 0x43e44e 0x436c97 0x46bce5 0x4e11c7 0x4e66ac 0x4e669a 0x547849 0x56183e 0x5609f0 0x76d584 0x471901
#	0x46bce4	internal/poll.runtime_pollWait+0x84		runtime/netpoll.go:343
#	0x4e11c6	internal/poll.(*pollDesc).wait+0x26		internal/poll/fd_poll_runtime.go:84
#	0x4e66ab	internal/poll.(*pollDesc).waitRead+0x2ab	internal/poll/fd_poll_runtime.go:89
#	0x4e6699	internal/poll.(*FD).Accept+0x299		internal/poll/fd_unix.go:611
#	0x547848	net.(*netFD).accept+0x28			net/fd_unix.go:172
#	0x56183d	net.(*TCPListener).accept+0x1d			net/tcpsock_posix.go:152
#	0x5609ef	net.(*TCPListener).Accept+0x2f			net/tcpsock.go:315
#	0x76d583	net/http.(*Server).Serve+0x363			net/http/server.go:3056

1 @ 0x40f0e9 0x46de29 0xa6e2b3 0x471901
#	0x46de28	os/signal.signal_recv+0x28	runtime/sigqueue.go:152
#	0xa6e2b2	os/signal.loop+0x12		os/signal/signal_unix.go:23

1 @ 0x4332d1 0x46b8bd 0x7d2fb1 0x7d2de5 0x7cf886 0x7e0fc8 0x7e1ac5 0x76a449 0xa6f26f 0x76a449 0x76bd62 0xa73a3f 0xa7345e 0x76d12e 0x769014 0x471901
#	0x46b8bc	runtime/pprof.runtime_goroutineProfileWithLabels+0x1c								runtime/mprof.go:844
#	0x7d2fb0	runtime/pprof.writeRuntimeProfile+0xb0										runtime/pprof/pprof.go:734
#	0x7d2de4	runtime/pprof.writeGoroutine+0x44										runtime/pprof/pprof.go:694
#	0x7cf885	runtime/pprof.(*Profile).WriteTo+0x145										runtime/pprof/pprof.go:329
#	0x7e0fc7	net/http/pprof.handler.ServeHTTP+0x4a7										net/http/pprof/pprof.go:267
#	0x7e1ac4	net/http/pprof.Index+0xe4											net/http/pprof/pprof.go:384
#	0x76a448	net/http.HandlerFunc.ServeHTTP+0x28										net/http/server.go:2136
#	0xa6f26e	github.com/caddyserver/caddy/v2.(*AdminConfig).newAdminHandler.func1.instrumentHandlerCounter.func1+0x6e	github.com/caddyserver/caddy/v2@v2.7.6/metrics.go:47
#	0x76a448	net/http.HandlerFunc.ServeHTTP+0x28										net/http/server.go:2136
#	0x76bd61	net/http.(*ServeMux).ServeHTTP+0x141										net/http/server.go:2514
#	0xa73a3e	github.com/caddyserver/caddy/v2.adminHandler.serveHTTP+0x55e							github.com/caddyserver/caddy/v2@v2.7.6/admin.go:837
#	0xa7345d	github.com/caddyserver/caddy/v2.adminHandler.ServeHTTP+0x7dd							github.com/caddyserver/caddy/v2@v2.7.6/admin.go:789
#	0x76d12d	net/http.serverHandler.ServeHTTP+0x8d										net/http/server.go:2938
#	0x769013	net/http.(*conn).serve+0x5f3											net/http/server.go:2009

1 @ 0x43e44e 0x4099ad 0x4095b2 0xa8e218 0x471901
#	0xa8e217	github.com/caddyserver/caddy/v2.trapSignalsCrossPlatform.func1+0xd7	github.com/caddyserver/caddy/v2@v2.7.6/sigtrap.go:43

1 @ 0x43e44e 0x4099ad 0x4095d2 0xa8dada 0x471901
#	0xa8dad9	github.com/caddyserver/caddy/v2.trapSignalsPosix.func1+0xf9	github.com/caddyserver/caddy/v2@v2.7.6/sigtrap_posix.go:35

1 @ 0x43e44e 0x4099ad 0x4095d2 0xb0adb0 0x471901
#	0xb0adaf	github.com/caddyserver/caddy/v2/cmd.watchConfigFile+0x2af	github.com/caddyserver/caddy/v2@v2.7.6/cmd/main.go:220

1 @ 0x43e44e 0x436c97 0x46bce5 0x4e11c7 0x4e66ac 0x4e669a 0x547849 0x56183e 0x5609f0 0x76d584 0xa70828 0x471901
#	0x46bce4	internal/poll.runtime_pollWait+0x84					runtime/netpoll.go:343
#	0x4e11c6	internal/poll.(*pollDesc).wait+0x26					internal/poll/fd_poll_runtime.go:84
#	0x4e66ab	internal/poll.(*pollDesc).waitRead+0x2ab				internal/poll/fd_poll_runtime.go:89
#	0x4e6699	internal/poll.(*FD).Accept+0x299					internal/poll/fd_unix.go:611
#	0x547848	net.(*netFD).accept+0x28						net/fd_unix.go:172
#	0x56183d	net.(*TCPListener).accept+0x1d						net/tcpsock_posix.go:152
#	0x5609ef	net.(*TCPListener).Accept+0x2f						net/tcpsock.go:315
#	0x76d583	net/http.(*Server).Serve+0x363						net/http/server.go:3056
#	0xa70827	github.com/caddyserver/caddy/v2.replaceLocalAdminServer.func2+0xc7	github.com/caddyserver/caddy/v2@v2.7.6/admin.go:449

1 @ 0x43e44e 0x44e1c6 0xb0517d 0xb1178f 0x5c283c 0x5c3065 0xb099fb 0xb099f0 0x177a6cf 0x43dfdb 0x471901
#	0xb0517c	github.com/caddyserver/caddy/v2/cmd.cmdRun+0xc1c					github.com/caddyserver/caddy/v2@v2.7.6/cmd/commandfuncs.go:283
#	0xb1178e	github.com/caddyserver/caddy/v2/cmd.init.1.func2.WrapCommandFuncForCobra.func1+0x2e	github.com/caddyserver/caddy/v2@v2.7.6/cmd/cobra.go:137
#	0x5c283b	github.com/spf13/cobra.(*Command).execute+0x87b						github.com/spf13/cobra@v1.7.0/command.go:940
#	0x5c3064	github.com/spf13/cobra.(*Command).ExecuteC+0x3a4					github.com/spf13/cobra@v1.7.0/command.go:1068
#	0xb099fa	github.com/spf13/cobra.(*Command).Execute+0x5a						github.com/spf13/cobra@v1.7.0/command.go:992
#	0xb099ef	github.com/caddyserver/caddy/v2/cmd.Main+0x4f						github.com/caddyserver/caddy/v2@v2.7.6/cmd/main.go:66
#	0x177a6ce	main.main+0xe										caddy/main.go:14
#	0x43dfda	runtime.main+0x2ba									runtime/proc.go:267

1 @ 0x43e44e 0x44e905 0x1051693 0x471901
#	0x1051692	github.com/caddyserver/caddy/v2/modules/caddytls.(*TLS).keepStorageClean.func1+0x92	github.com/caddyserver/caddy/v2@v2.7.6/modules/caddytls/tls.go:540

1 @ 0x43e44e 0x44e905 0x8e4425 0x471901
#	0x8e4424	github.com/caddyserver/certmagic.(*Cache).maintainAssets+0x304	github.com/caddyserver/certmagic@v0.20.0/maintain.go:69

1 @ 0x43e44e 0x44e905 0x8eeee6 0x8ee8ab 0x471901
#	0x8eeee5	github.com/caddyserver/certmagic.(*RingBufferRateLimiter).permit+0x85	github.com/caddyserver/certmagic@v0.20.0/ratelimiter.go:217
#	0x8ee8aa	github.com/caddyserver/certmagic.(*RingBufferRateLimiter).loop+0x8a	github.com/caddyserver/certmagic@v0.20.0/ratelimiter.go:89

1 @ 0x43e44e 0x44e905 0xda5ad9 0x471901
#	0xda5ad8	github.com/golang/glog.(*fileSink).flushDaemon+0xb8	github.com/golang/glog@v1.1.2/glog_file.go:351

1 @ 0x43e44e 0x44f938 0x44f90f 0x46d785 0x48d1dd 0x177940c 0x17793ec 0x17787d4 0x121ea4c 0x1225d5a 0x120aa69 0x1225ae9 0x120aa69 0x12281a8 0x1201755 0x120aa69 0x122732e 0x76d12e 0x769014 0x471901
#	0x46d784	sync.runtime_SemacquireMutex+0x24										runtime/sema.go:77
#	0x48d1dc	sync.(*Mutex).lockSlow+0x15c											sync/mutex.go:171
#	0x177940b	sync.(*Mutex).Lock+0x4b												sync/mutex.go:90
#	0x17793eb	github.com/mholt/caddy-ratelimit.(*ringBufferRateLimiter).initialize+0x2b					github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/ringbuffer.go:38
#	0x17787d3	github.com/mholt/caddy-ratelimit.Handler.ServeHTTP+0x233							github.com/mholt/caddy-ratelimit@v0.0.0-20231212035148-3cd04794dd67/handler.go:181
#	0x121ea4b	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*metricsInstrumentedHandler).ServeHTTP+0x54b			github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/metrics.go:137
#	0x1225d59	github.com/caddyserver/caddy/v2/modules/caddyhttp.wrapMiddleware.func1.1+0x39					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/routes.go:331
#	0x120aa68	github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x28					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/caddyhttp.go:58
#	0x1225ae8	github.com/caddyserver/caddy/v2/modules/caddyhttp.RouteList.Compile.wrapRoute.func1.1+0x2c8			github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/routes.go:300
#	0x120aa68	github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x28					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/caddyhttp.go:58
#	0x12281a7	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).enforcementHandler+0x247				github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/server.go:429
#	0x1201754	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*App).Provision.(*Server).wrapPrimaryRoute.func1+0x34	github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/server.go:405
#	0x120aa68	github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x28					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/caddyhttp.go:58
#	0x122732d	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).ServeHTTP+0xf2d					github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/server.go:341
#	0x76d12d	net/http.serverHandler.ServeHTTP+0x8d										net/http/server.go:2938
#	0x769013	net/http.(*conn).serve+0x5f3											net/http/server.go:2009

This manifests suddenly (ostensibly randomly) in the course of normal operation:

I don't see any pertinent log messages related to this, only normal access logs and a warning when the process is killed.

Caddy is built like so:

FROM caddy:2.7.6-builder-alpine AS builder

RUN xcaddy build \
    # https://github.com/caddyserver/caddy/pull/5979
    --with github.com/caddyserver/caddy/v2=github.com/divviup/caddy/v2@v2.7.6-divviup-1 \
    --with github.com/pberkel/caddy-storage-redis@v1.1.0 \
    # https://github.com/mholt/caddy-ratelimit/pull/34
    --with github.com/mholt/caddy-ratelimit=github.com/divviup/caddy-ratelimit@93aba685422d0e3efcb10e15ac9c6c3d1766db8b

FROM caddy:2.7.6-alpine

ARG GIT_REVISION=unknown
LABEL revision ${GIT_REVISION}
COPY --from=builder /usr/bin/caddy /usr/bin/caddy

(Apologies, we are using forks to get some additional functionality/fixes. I didn't notice any pertinent changes in upstream that would affect this issue).

The OS is:

/srv # cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.18.5
PRETTY_NAME="Alpine Linux v3.18"
...

Our caddy config:

{
  "admin": {
    "config": {
      "persist": false
    }
  },
  "apps": {
    "http": {
      "servers": {
        "metrics": {
          "listen": [
            ":9465"
          ],
          "metrics": {},
          "routes": [
            {
              "handle": [
                {
                  "handler": "metrics"
                }
              ],
              "match": [
                {
                  "path": [
                    "/metrics"
                  ]
                }
              ]
            }
          ]
        },
        "rate_limiter": {
          "listen": [
            ":80"
          ],
          "logs": {},
          "metrics": {},
          "routes": [
            {
              "handle": [
                {
                  "distributed": {
                    "read_interval": "5s",
                    "write_interval": "5s"
                  },
                  "handler": "rate_limit",
                  "rate_limits": {
                    "report-upload": {
                      "key": "{http.regexp.report-upload.1}",
                      "max_events": 15000,
                      "window": "15s"
                    }
                  }
                },
                {
                  "handler": "reverse_proxy",
                  "upstreams": [
                    {
                      "dial": ":8080"
                    }
                  ]
                }
              ],
              "match": [
                {
                  "method": [
                    "PUT"
                  ],
                  "path_regexp": {
                    "name": "report-upload",
                    "pattern": "/tasks/([0-9A-Za-z_-]{43})/reports"
                  }
                }
              ]
            },
            {
              "handle": [
                {
                  "distributed": {
                    "read_interval": "5s",
                    "write_interval": "5s"
                  },
                  "handler": "rate_limit",
                  "rate_limits": {
                    "aggregation-job": {
                      "key": "{http.regexp.aggregation-job.1}",
                      "max_events": 1500,
                      "window": "15s"
                    }
                  }
                },
                {
                  "handler": "reverse_proxy",
                  "upstreams": [
                    {
                      "dial": ":8080"
                    }
                  ]
                }
              ],
              "match": [
                {
                  "method": [
                    "PUT",
                    "POST"
                  ],
                  "path_regexp": {
                    "name": "aggregation-job",
                    "pattern": "/tasks/([0-9A-Za-z_-]{43})/aggregation_jobs/[0-9A-Za-z_-]{22}"
                  }
                }
              ]
            },
            {
              "handle": [
                {
                  "distributed": {
                    "read_interval": "5s",
                    "write_interval": "5s"
                  },
                  "handler": "rate_limit",
                  "rate_limits": {
                    "collection-job": {
                      "key": "{http.regexp.collection-job.1}",
                      "max_events": 1500,
                      "window": "15s"
                    }
                  }
                },
                {
                  "handler": "reverse_proxy",
                  "upstreams": [
                    {
                      "dial": ":8080"
                    }
                  ]
                }
              ],
              "match": [
                {
                  "method": [
                    "PUT",
                    "POST",
                    "DELETE"
                  ],
                  "path_regexp": {
                    "name": "collection-job",
                    "pattern": "/tasks/([0-9A-Za-z_-]{43})/collection_jobs/[0-9A-Za-z_-]{22}"
                  }
                }
              ]
            },
            {
              "handle": [
                {
                  "distributed": {
                    "read_interval": "5s",
                    "write_interval": "5s"
                  },
                  "handler": "rate_limit",
                  "rate_limits": {
                    "aggregate-share": {
                      "key": "{http.regexp.aggregate-share.1}",
                      "max_events": 1500,
                      "window": "15s"
                    }
                  }
                },
                {
                  "handler": "reverse_proxy",
                  "upstreams": [
                    {
                      "dial": ":8080"
                    }
                  ]
                }
              ],
              "match": [
                {
                  "method": [
                    "POST"
                  ],
                  "path_regexp": {
                    "name": "aggregate-share",
                    "pattern": "/tasks/([0-9A-Za-z_-]{43})/aggregate_shares"
                  }
                }
              ]
            },
            {
              "handle": [
                {
                  "handler": "vars",
                  "skip_log": true
                },
                {
                  "handler": "static_response",
                  "status_code": 200
                }
              ],
              "match": [
                {
                  "method": [
                    "GET"
                  ],
                  "path": [
                    "/healthz"
                  ]
                }
              ]
            },
            {
              "handle": [
                {
                  "handler": "reverse_proxy",
                  "upstreams": [
                    {
                      "dial": ":8080"
                    }
                  ]
                }
              ],
              "match": [
                {
                  "path": [
                    "/*"
                  ]
                }
              ]
            }
          ]
        }
      }
    }
  },
  "logging": {
    "logs": {
      "default": {
        "encoder": {
          "format": "json",
          "level_format": "upper",
          "time_format": "rfc3339_nano",
          "time_key": "timestamp"
        },
        "include": [
          "http.handlers.rate_limit",
          "http.log.access"
        ],
        "level": "info",
        "sampling": {
          "first": 10,
          "thereafter": 100
        },
        "writer": {
          "output": "stdout"
        }
      }
    }
  },
  "storage": {
    "host": [
      "[redacted]"
    ],
    "key_prefix": "[redacted]",
    "module": "redis",
    "port": [
      "6378"
    ],
    "tls_enabled": true,
    "tls_insecure": false,
    "tls_server_certs_pem": "[redacted]"
  }
}

The text was updated successfully, but these errors were encountered:

mholt · 2024-01-24T23:20:36Z

Thank you for the report!

Let me look into this. I'll look for code paths that may unintentionally try to acquire the lock on a locked lock.

Do the block and mutex profiles have anything in them?

divergentdave · 2024-01-24T23:32:19Z

They currently report zero events, I think this is because we need to patch in calls to runtime.SetBlockProfileRate() and runtime.SetMutexProfileFraction() first.

mholt · 2024-01-25T03:14:31Z

Ah, of course.

Well, I have been thoroughly reviewing all uses of the mutex here:

caddy-ratelimit/ringbuffer.go

Line 27 in 89a7fec

mu sync.Mutex

And I cannot find any place that would not reliably unlock. It seems that all critical sections of this mutex either defer the unlock (and in the rest of the function, do not make any function calls that would be recursive or lead to another call to Lock), or lock and unlock right away after just one or a few LoC that do basic operations like variable assignments, computation, etc.

My next idea was to examine the profile above and see if I could find anything suspicious. I'd be looking for a stack that includes the ratelimit package -- specifically a code path that has a lock on that mutex -- that might be the culprit for blocking the other 20k+ goroutines... but the stack traces I can see there don't seem to be capable of creating a deadlock...

I also looked through your caddy-ratelimit fork and don't see anything suspicious... nothing stands out to me.

Granted, I did this all while supervising a newborn, so it's possible I missed something (I'm still getting used to this life, heh) BUT I definitely want to rack down where that lock is not being released.

mholt · 2024-01-25T03:15:56Z

Just as I submitted my last reply, I had another realization: the rate limiter is pooled, so I wonder... is it possible that one is going back into the pool before it is done being used / before it is unlocked? I will need to look at that tomorrow (unless you beat me to it).

inahga · 2024-01-25T23:26:07Z

Giving the source code a close read I've reached similar conclusions as you. I also can't see an angle where a ringBufferRateLimiter is placed back into the pool but also being used--AFAICT anything coming out of the pool should be ready for initialization. But it's the most suspicious code so far, especially since there's one ringBufferRateLimiter.initialize() call that is stuck.

I think my next step is to try to reproduce this in a development environment with some artificial load and a custom caddy build with mutex profiling enabled. (I'll leave debugging in production as a last resort :D)

mholt · 2024-01-27T19:04:18Z

Glad I'm not the only one at that conclusion!

Another option would be to eliminate the pool and just allocate every time (as a test). If the issue goes away, then it's almost certainly the pool, I bet.

inahga · 2024-01-30T21:09:54Z

We haven't had much luck reproducing it with artificial load so we're moving on to Testing in Production ™️, with the custom profiling build. If that proves unenlightening I'll also make a build with the pool removed.

mholt · 2024-01-30T22:42:01Z

Dang, okay sounds good. Keep me posted, and I'll try to revisit this soon!

DzigaV · 2024-02-15T21:16:53Z

Caddy was OOM Killed for the first time this week, with no obvious cause. I use ratelimit and found this issue while looking for suspects. Should it occur again, what steps do I need to take to be helpful in troubleshooting the problem?

mholt · 2024-02-15T23:17:16Z

@DzigaV Thanks for chiming in. So yeah, if you'd be able to capture a profile when the memory use starts going up, that'd be helpful -- either that, or modify the code a little bit to disable the pool as described above. If it doesn't recur without the pool, then that's a good indication the pool is the bug.

inahga · 2024-02-16T16:08:33Z

Sorry for taking a while to get back to this, vacations and competing priorities and whatnot.

We were able to reproduce the issue once about a week ago, but unfortunately weren't able to get profiling out of it. Otherwise we haven't been able to reproduce. What's interesting is that we've not been able to reproduce since making some bug fixes to our upstream server. One in particular involved fixing a panic bug, i.e. a situation where the upstream suddenly drops a connection.

Might suggest needing to do some fuzzing/fault injection tests locally. I don't think I'll have time to do this though--as long as Caddy is now behaving.

As for troubleshooting @DzigaV, if you're able to make a build of caddy with mutex and block profiling turned up. Note this increases the overall (stable) memory usage of caddy.

Then we collect mutex, block, and stack profiles periodically.

for profile in mutex goroutine block; do
  curl -sL localhost:2019/debug/pprof/${profile}?debug=1 >${profile}.txt
done

mohammed90 · 2024-02-16T20:17:13Z

For profiling, you might benefit from this article: Profiling Caddy. If you're using Grafana, you can point the Pyroscope scraper at Caddy for continuous profiling collection.

inahga · 2024-02-28T16:55:49Z

Huzzah! We were able to capture block and mutex profiles for a couple of instances that had the issue.

Haven't looked into these yet though, looking now.
pprof-aggregator-profiling-796d76dbb8-hvjqp-2024-02-28T16-47-42+00-00.tgz
pprof-aggregator-profiling-796d76dbb8-qq9nc-2024-02-28T07-31-23+00-00.tgz

inahga · 2024-02-28T17:12:42Z

Here's a healthy replica, for comparison.
pprof-aggregator-profiling-796d76dbb8-6wfrr-2024-02-28T17-10-09+00-00.tgz

divergentdave · 2024-02-28T17:27:47Z

I think this stack trace from a goroutine profile is what we're looking for. (The number of goroutines has been going up in tandem with open file descriptors when this gets stuck, and this profile contains a total of 4960 goroutines.)

4407 @ 0x43e52e 0x44fa38 0x44fa0f 0x46d8e5 0x48d33d 0x1779ae9 0x1779acb 0x177717a 0x17789ec 0x121ebec 0x1225efa 0x120ac09 0x1225c89 0x120ac09 0x1228348 0x12018f5 0x120ac09 0x12274ce 0x76d2ce 0x7691b4 0x471a61
#       0x46d8e4        sync.runtime_SemacquireMutex+0x24                                                                               runtime/sema.go:77
#       0x48d33c        sync.(*Mutex).lockSlow+0x15c                                                                                    sync/mutex.go:171
#       0x1779ae8       sync.(*Mutex).Lock+0x48                                                                                         sync/mutex.go:90
#       0x1779aca       github.com/mholt/caddy-ratelimit.(*ringBufferRateLimiter).MaxEvents+0x2a                                        github.com/mholt/caddy-ratelimit@v0.0.0-20240115171705-89a7fece9add/ringbuffer.go:102
#       0x1777179       github.com/mholt/caddy-ratelimit.Handler.distributedRateLimiting+0x79                                           github.com/mholt/caddy-ratelimit@v0.0.0-20240115171705-89a7fece9add/distributed.go:193
#       0x17789eb       github.com/mholt/caddy-ratelimit.Handler.ServeHTTP+0x2ab                                                        github.com/mholt/caddy-ratelimit@v0.0.0-20240115171705-89a7fece9add/handler.go:191
#       0x121ebeb       github.com/caddyserver/caddy/v2/modules/caddyhttp.(*metricsInstrumentedHandler).ServeHTTP+0x54b                 github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/metrics.go:137
#       0x1225ef9       github.com/caddyserver/caddy/v2/modules/caddyhttp.wrapMiddleware.func1.1+0x39                                   github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/routes.go:331
#       0x120ac08       github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x28                                    github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/caddyhttp.go:58
#       0x1225c88       github.com/caddyserver/caddy/v2/modules/caddyhttp.RouteList.Compile.wrapRoute.func1.1+0x2c8                     github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/routes.go:300
#       0x120ac08       github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x28                                    github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/caddyhttp.go:58
#       0x1228347       github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).enforcementHandler+0x247                            github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/server.go:429
#       0x12018f4       github.com/caddyserver/caddy/v2/modules/caddyhttp.(*App).Provision.(*Server).wrapPrimaryRoute.func1+0x34        github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/server.go:405
#       0x120ac08       github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x28                                    github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/caddyhttp.go:58
#       0x12274cd       github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).ServeHTTP+0xf2d                                     github.com/caddyserver/caddy/v2@v2.7.6/modules/caddyhttp/server.go:341
#       0x76d2cd        net/http.serverHandler.ServeHTTP+0x8d                                                                           net/http/server.go:2938
#       0x7691b3        net/http.(*conn).serve+0x5f3                                                                                    net/http/server.go:2009

inahga · 2024-02-29T00:08:00Z

Alright I have a plausible theory. TL;DR there's a data race between sync.Pool creating a new a ringBufferRateLimiter and .initialize()ing it.

Let's look at https://github.com/mholt/caddy-ratelimit/blob/master/handler.go#L174. For clarity, I've destructured the if statement.

		limiter := ringBufPool.Get().(*ringBufferRateLimiter)
		val, loaded := rl.limiters.LoadOrStore(key, limiter)
		if loaded {
			ringBufPool.Put(limiter) // didn't use; save for next time
			limiter = val.(*ringBufferRateLimiter)
		} else {
			// as another side-effect of sync.Map's bad API, avoid all the
			// work of initializing the ring buffer unless we have to
			limiter.initialize(rl.MaxEvents, time.Duration(rl.Window))
		}

Suppose we've just began the program. Goroutine A runs LoadOrStore() and stores a new ringBufferRateLimiter into the map. Now, suppose the Go scheduler immediately then preempts goroutine A and moves onto other work. Notably, it doesn't get to run limiter.initialize().

Goroutine B comes along and services a similar request. B runs LoadOrStore() with the same key, and retrieves the limiter that was stored by A. This limiter hasn't been initialized!

Suppose the scheduler is giving broad latitudes to goroutine B. B proceeds to:

			if err := h.distributedRateLimiting(w, repl, limiter, key, rl.zoneName); err != nil {
				return err
			}

Which then invokes https://github.com/mholt/caddy-ratelimit/blob/master/distributed.go#L226:

	limiter.mu.Lock()
	count, oldestLocalEvent := limiter.countUnsynced(now())

Notice it's taking the rate limiter lock. Let's consider what happens when we run countUnsynced on an initialized ratelimiter.

func (r *ringBufferRateLimiter) countUnsynced(ref time.Time) (int, time.Time) {
	var zeroTime time.Time
	beginningOfWindow := ref.Add(-r.window)

        // ...
	for eventsInWindow := 0; eventsInWindow < len(r.ring); eventsInWindow++ {
		// ... doesn't matter since len(r.ring) == 0 and this loop is never run.
	}

	// if we looped the entire ring, all events are within the window
	return len(r.ring), r.ring[r.cursor]
}

We get an out of bounds panic on return len(r.ring), r.ring[r.cursor], since r.ring is an empty slice and r.cursor is the default value of 0. Goroutine B panics while holding the lock, leaving it locked and blocking everyone else using the same rate limiter.

This explains why there's one .initialize() call hanging out in the pprof goroutine trace. Probably goroutine A resumes and attempts to initialize the rate limiter, and is forever blocked.

Unfortunately I can only validate that countUnsynced panics on an uninitialized ringBufferRateLimiter. Whether this is what's happening is more difficult to prove. I haven't seen any messages about panics our out of bounds access in our logs. This may be representative of our logging configuration, however.

mholt · 2024-02-29T22:22:44Z

Nice -- that could possibly be it. I was skeptical of that pool, but didn't have time to dive into it further.

If that is what's happening, I wonder if we need to initialize it before putting it into the UsagePool, or use LoadOrNew instead of LoadOrStore. That might allow us to ensure initialization always happens. (Perhaps the performance penalty will be minimal since we're reusing the rate limiters anyway.)

inahga · 2024-03-04T20:00:26Z

I'm working through a patch now.

One thing I've noticed while iterating is I think a panic in the rate limiter handler isn't logged at a high enough level. It seems to be logged at DEBUG. Here's an example of me running go test while inducing an artificial panic in Handler.ServeHTTP() and setting the default log level to DEBUG.

2024/03/04 19:56:39.423 DEBUG   http.stdlib     http: panic serving 127.0.0.1:59932: kaboom!
goroutine 265 [running]:
net/http.(*conn).serve.func1()
        /nix/store/y0b9igzsns6mx4gyj7ivr8l16vr397wq-go-1.21.5/share/go/src/net/http/server.go:1868 +0xb0
panic({0x101c33480?, 0x101f3bf80?})
        /nix/store/y0b9igzsns6mx4gyj7ivr8l16vr397wq-go-1.21.5/share/go/src/runtime/panic.go:920 +0x26c
github.com/mholt/caddy-ratelimit.Handler.ServeHTTP({0x140002a3590, 0x0, 0xdf8475800, 0x140005ce7d0, {0x0, 0x0, 0x0}, {0x14000139190, 0x1, 0x1}, ...}, ...)
        /Users/inahga-work/Projects/caddy-ratelimit/handler.go:168 +0x174
... more stack trace

This would explain why I haven't seen panics in the logs. This seems like a bug to me, as I think panic should complain very loudly. However, I'm not sure if I'm holding it wrong.

mholt · 2024-03-05T16:28:35Z

@inahga Ah, I see the difficulty here. A while back (big-O of "years") we made logs from net/http standard library emit at the DEBUG level because they're quite noisy and often unnecessary.

I didn't consider that the std lib would panic, and that it would also be hidden behind a DEBUG log 😅 Wow.

So yes, I agree panics should be louder, but ideally without opening the firehose from the net/http package as a whole.

Amazing work figuring that out!!

inahga · 2024-03-05T17:17:09Z

I suspect a Handler further up in the chain can be used to recover() from the panic and verbosely log, before net/http gets the chance to recover() and swallow the panic. Sort of like this.

This would also allow for a mode where panics abort the program. Perhaps generally useful for debugging and development.

Either way this would have to be done in Caddy itself, so I'll take this to a different issue/PR.

mholt · 2024-03-06T20:41:12Z

@inahga FWIW. we set the ErrorLogger on the HTTP server to a zap log at DEBUG level here:

https://github.com/caddyserver/caddy/blob/258d9061401091c014c247d39d33d07bc178c66c/modules/caddyhttp/app.go#L355

I've done recover wrapping like that before -- if that in fact works for this situation, I'd be open to a PR!

This closes mholt#36. Fixes a race condition between ringBufferRateLimiter creation and its insertion into a map. Do this by locking the entire map when we get or insert a ringBufferRateLimiter. I have replaced use of sync.Map with a normal `map[string]*ringBufferRateLimiter` and a `sync.Mutex`. They are passed around with a `rateLimitersMap` struct. I've factored out logic into methods of rateLimitersMap, which enables some careful use of defer `rlm.mu.Unlock()`` to avoid leaving a lock held open on `panic()`. We didn't see a need for a sync.Map. The docs suggest against using it for type safety, and none of the suggested use cases apply. https://pkg.go.dev/sync#Map. Let me know if I'm misunderstanding the use case (very possible!). I've removed the sync.Pool, for now. Since ringBufferRateLimiter creation and insertion is fully synchronized, I didn't see a need for it. Note that some of the defensive refactoring is not strictly required--I have a change that preserves the existing data structures, but I think the suggested changeset is an overall improvement in maintainability. 65ad951. Some discussion of the performance impact and profiles is here #1. TL;DR, no meaningful impact to CPU, memory, or latency. This implementation could be optimized by replacing the normal mutex with a RWMutex, but it would be a marginal improvement (if any) in exchange for much more complicated locking semantics.

This closes #36. Fixes a race condition between ringBufferRateLimiter creation and its insertion into a map. Do this by locking the entire map when we get or insert a ringBufferRateLimiter. I have replaced use of sync.Map with a normal `map[string]*ringBufferRateLimiter` and a `sync.Mutex`. They are passed around with a `rateLimitersMap` struct. I've factored out logic into methods of rateLimitersMap, which enables some careful use of defer `rlm.mu.Unlock()`` to avoid leaving a lock held open on `panic()`. We didn't see a need for a sync.Map. The docs suggest against using it for type safety, and none of the suggested use cases apply. https://pkg.go.dev/sync#Map. Let me know if I'm misunderstanding the use case (very possible!). I've removed the sync.Pool, for now. Since ringBufferRateLimiter creation and insertion is fully synchronized, I didn't see a need for it. Note that some of the defensive refactoring is not strictly required--I have a change that preserves the existing data structures, but I think the suggested changeset is an overall improvement in maintainability. divviup@65ad951. Some discussion of the performance impact and profiles is here divviup#1. TL;DR, no meaningful impact to CPU, memory, or latency. This implementation could be optimized by replacing the normal mutex with a RWMutex, but it would be a marginal improvement (if any) in exchange for much more complicated locking semantics.

inahga mentioned this issue Mar 4, 2024

Fix race when initializing ringBufferRateLimiter divviup/caddy-ratelimit#1

Merged

inahga mentioned this issue Mar 6, 2024

Fix race when initializing ringBufferRateLimiter #43

Merged

mholt closed this as completed in #43 Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck mutex causing resource leak #36

Stuck mutex causing resource leak #36

inahga commented Jan 24, 2024 •

edited

Loading

mholt commented Jan 24, 2024

divergentdave commented Jan 24, 2024

mholt commented Jan 25, 2024

mholt commented Jan 25, 2024

inahga commented Jan 25, 2024 •

edited

Loading

mholt commented Jan 27, 2024 •

edited

Loading

inahga commented Jan 30, 2024

mholt commented Jan 30, 2024

DzigaV commented Feb 15, 2024

mholt commented Feb 15, 2024

inahga commented Feb 16, 2024

mohammed90 commented Feb 16, 2024

inahga commented Feb 28, 2024

inahga commented Feb 28, 2024

divergentdave commented Feb 28, 2024

inahga commented Feb 29, 2024

mholt commented Feb 29, 2024

inahga commented Mar 4, 2024

mholt commented Mar 5, 2024

inahga commented Mar 5, 2024

mholt commented Mar 6, 2024 •

edited

Loading

Stuck mutex causing resource leak #36

Stuck mutex causing resource leak #36

Comments

inahga commented Jan 24, 2024 • edited Loading

mholt commented Jan 24, 2024

divergentdave commented Jan 24, 2024

mholt commented Jan 25, 2024

mholt commented Jan 25, 2024

inahga commented Jan 25, 2024 • edited Loading

mholt commented Jan 27, 2024 • edited Loading

inahga commented Jan 30, 2024

mholt commented Jan 30, 2024

DzigaV commented Feb 15, 2024

mholt commented Feb 15, 2024

inahga commented Feb 16, 2024

mohammed90 commented Feb 16, 2024

inahga commented Feb 28, 2024

inahga commented Feb 28, 2024

divergentdave commented Feb 28, 2024

inahga commented Feb 29, 2024

mholt commented Feb 29, 2024

inahga commented Mar 4, 2024

mholt commented Mar 5, 2024

inahga commented Mar 5, 2024

mholt commented Mar 6, 2024 • edited Loading

inahga commented Jan 24, 2024 •

edited

Loading

inahga commented Jan 25, 2024 •

edited

Loading

mholt commented Jan 27, 2024 •

edited

Loading

mholt commented Mar 6, 2024 •

edited

Loading