-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck mutex causing resource leak #36
Comments
Thank you for the report! Let me look into this. I'll look for code paths that may unintentionally try to acquire the lock on a locked lock. Do the block and mutex profiles have anything in them? |
They currently report zero events, I think this is because we need to patch in calls to |
Ah, of course. Well, I have been thoroughly reviewing all uses of the mutex here: Line 27 in 89a7fec
And I cannot find any place that would not reliably unlock. It seems that all critical sections of this mutex either My next idea was to examine the profile above and see if I could find anything suspicious. I'd be looking for a stack that includes the ratelimit package -- specifically a code path that has a lock on that mutex -- that might be the culprit for blocking the other 20k+ goroutines... but the stack traces I can see there don't seem to be capable of creating a deadlock... I also looked through your caddy-ratelimit fork and don't see anything suspicious... nothing stands out to me. Granted, I did this all while supervising a newborn, so it's possible I missed something (I'm still getting used to this life, heh) BUT I definitely want to rack down where that lock is not being released. |
Just as I submitted my last reply, I had another realization: the rate limiter is pooled, so I wonder... is it possible that one is going back into the pool before it is done being used / before it is unlocked? I will need to look at that tomorrow (unless you beat me to it). |
Giving the source code a close read I've reached similar conclusions as you. I also can't see an angle where a I think my next step is to try to reproduce this in a development environment with some artificial load and a custom caddy build with mutex profiling enabled. (I'll leave debugging in production as a last resort :D) |
Glad I'm not the only one at that conclusion! Another option would be to eliminate the pool and just allocate every time (as a test). If the issue goes away, then it's almost certainly the pool, I bet. |
We haven't had much luck reproducing it with artificial load so we're moving on to Testing in Production ™️, with the custom profiling build. If that proves unenlightening I'll also make a build with the pool removed. |
Dang, okay sounds good. Keep me posted, and I'll try to revisit this soon! |
Caddy was OOM Killed for the first time this week, with no obvious cause. I use ratelimit and found this issue while looking for suspects. Should it occur again, what steps do I need to take to be helpful in troubleshooting the problem? |
@DzigaV Thanks for chiming in. So yeah, if you'd be able to capture a profile when the memory use starts going up, that'd be helpful -- either that, or modify the code a little bit to disable the pool as described above. If it doesn't recur without the pool, then that's a good indication the pool is the bug. |
Sorry for taking a while to get back to this, vacations and competing priorities and whatnot. We were able to reproduce the issue once about a week ago, but unfortunately weren't able to get profiling out of it. Otherwise we haven't been able to reproduce. What's interesting is that we've not been able to reproduce since making some bug fixes to our upstream server. One in particular involved fixing a panic bug, i.e. a situation where the upstream suddenly drops a connection. Might suggest needing to do some fuzzing/fault injection tests locally. I don't think I'll have time to do this though--as long as Caddy is now behaving. As for troubleshooting @DzigaV, if you're able to make a build of caddy with mutex and block profiling turned up. Note this increases the overall (stable) memory usage of caddy. Then we collect mutex, block, and stack profiles periodically. for profile in mutex goroutine block; do
curl -sL localhost:2019/debug/pprof/${profile}?debug=1 >${profile}.txt
done |
For profiling, you might benefit from this article: Profiling Caddy. If you're using Grafana, you can point the Pyroscope scraper at Caddy for continuous profiling collection. |
Huzzah! We were able to capture Haven't looked into these yet though, looking now. |
Here's a healthy replica, for comparison. |
I think this stack trace from a goroutine profile is what we're looking for. (The number of goroutines has been going up in tandem with open file descriptors when this gets stuck, and this profile contains a total of 4960 goroutines.)
|
Alright I have a plausible theory. TL;DR there's a data race between Let's look at https://github.com/mholt/caddy-ratelimit/blob/master/handler.go#L174. For clarity, I've destructured the limiter := ringBufPool.Get().(*ringBufferRateLimiter)
val, loaded := rl.limiters.LoadOrStore(key, limiter)
if loaded {
ringBufPool.Put(limiter) // didn't use; save for next time
limiter = val.(*ringBufferRateLimiter)
} else {
// as another side-effect of sync.Map's bad API, avoid all the
// work of initializing the ring buffer unless we have to
limiter.initialize(rl.MaxEvents, time.Duration(rl.Window))
} Suppose we've just began the program. Goroutine A runs Goroutine B comes along and services a similar request. B runs Suppose the scheduler is giving broad latitudes to goroutine B. B proceeds to: if err := h.distributedRateLimiting(w, repl, limiter, key, rl.zoneName); err != nil {
return err
} Which then invokes https://github.com/mholt/caddy-ratelimit/blob/master/distributed.go#L226: limiter.mu.Lock()
count, oldestLocalEvent := limiter.countUnsynced(now()) Notice it's taking the rate limiter lock. Let's consider what happens when we run func (r *ringBufferRateLimiter) countUnsynced(ref time.Time) (int, time.Time) {
var zeroTime time.Time
beginningOfWindow := ref.Add(-r.window)
// ...
for eventsInWindow := 0; eventsInWindow < len(r.ring); eventsInWindow++ {
// ... doesn't matter since len(r.ring) == 0 and this loop is never run.
}
// if we looped the entire ring, all events are within the window
return len(r.ring), r.ring[r.cursor]
} We get an out of bounds panic on This explains why there's one Unfortunately I can only validate that |
Nice -- that could possibly be it. I was skeptical of that pool, but didn't have time to dive into it further. If that is what's happening, I wonder if we need to initialize it before putting it into the UsagePool, or use LoadOrNew instead of LoadOrStore. That might allow us to ensure initialization always happens. (Perhaps the performance penalty will be minimal since we're reusing the rate limiters anyway.) |
I'm working through a patch now. One thing I've noticed while iterating is I think a panic in the rate limiter handler isn't logged at a high enough level. It seems to be logged at DEBUG. Here's an example of me running
This would explain why I haven't seen panics in the logs. This seems like a bug to me, as I think |
@inahga Ah, I see the difficulty here. A while back (big-O of "years") we made logs from I didn't consider that the std lib would panic, and that it would also be hidden behind a DEBUG log 😅 Wow. So yes, I agree panics should be louder, but ideally without opening the firehose from the Amazing work figuring that out!! |
I suspect a This would also allow for a mode where panics abort the program. Perhaps generally useful for debugging and development. Either way this would have to be done in Caddy itself, so I'll take this to a different issue/PR. |
@inahga FWIW. we set the ErrorLogger on the HTTP server to a zap log at DEBUG level here: I've done recover wrapping like that before -- if that in fact works for this situation, I'd be open to a PR! |
This closes mholt#36. Fixes a race condition between ringBufferRateLimiter creation and its insertion into a map. Do this by locking the entire map when we get or insert a ringBufferRateLimiter. I have replaced use of sync.Map with a normal `map[string]*ringBufferRateLimiter` and a `sync.Mutex`. They are passed around with a `rateLimitersMap` struct. I've factored out logic into methods of rateLimitersMap, which enables some careful use of defer `rlm.mu.Unlock()`` to avoid leaving a lock held open on `panic()`. We didn't see a need for a sync.Map. The docs suggest against using it for type safety, and none of the suggested use cases apply. https://pkg.go.dev/sync#Map. Let me know if I'm misunderstanding the use case (very possible!). I've removed the sync.Pool, for now. Since ringBufferRateLimiter creation and insertion is fully synchronized, I didn't see a need for it. Note that some of the defensive refactoring is not strictly required--I have a change that preserves the existing data structures, but I think the suggested changeset is an overall improvement in maintainability. 65ad951. Some discussion of the performance impact and profiles is here #1. TL;DR, no meaningful impact to CPU, memory, or latency. This implementation could be optimized by replacing the normal mutex with a RWMutex, but it would be a marginal improvement (if any) in exchange for much more complicated locking semantics.
This closes #36. Fixes a race condition between ringBufferRateLimiter creation and its insertion into a map. Do this by locking the entire map when we get or insert a ringBufferRateLimiter. I have replaced use of sync.Map with a normal `map[string]*ringBufferRateLimiter` and a `sync.Mutex`. They are passed around with a `rateLimitersMap` struct. I've factored out logic into methods of rateLimitersMap, which enables some careful use of defer `rlm.mu.Unlock()`` to avoid leaving a lock held open on `panic()`. We didn't see a need for a sync.Map. The docs suggest against using it for type safety, and none of the suggested use cases apply. https://pkg.go.dev/sync#Map. Let me know if I'm misunderstanding the use case (very possible!). I've removed the sync.Pool, for now. Since ringBufferRateLimiter creation and insertion is fully synchronized, I didn't see a need for it. Note that some of the defensive refactoring is not strictly required--I have a change that preserves the existing data structures, but I think the suggested changeset is an overall improvement in maintainability. divviup@65ad951. Some discussion of the performance impact and profiles is here divviup#1. TL;DR, no meaningful impact to CPU, memory, or latency. This implementation could be optimized by replacing the normal mutex with a RWMutex, but it would be a marginal improvement (if any) in exchange for much more complicated locking semantics.
We're running into an intermittent problem where one of our caddy replicas goes runaway in memory usage. This is ~50MB baseline to growing over 1GB over time, before eventually failing to service liveness probe requests and being killed by our process orchestrator.
It appears to be a resource leak, looking at the output of
lsof
on an affected replica reveals that there are several thousand open sockets.Some inspection of the
pprof
output reveals that there looks to be a stuck mutex blocking a bunch of other ones.Goroutine stack dump:
This manifests suddenly (ostensibly randomly) in the course of normal operation:
I don't see any pertinent log messages related to this, only normal access logs and a warning when the process is killed.
Caddy is built like so:
(Apologies, we are using forks to get some additional functionality/fixes. I didn't notice any pertinent changes in upstream that would affect this issue).
The OS is:
Our caddy config:
The text was updated successfully, but these errors were encountered: