-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 prevent controller.New and manager's errChan from leaking goroutines #651
🐛 prevent controller.New and manager's errChan from leaking goroutines #651
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: DirectXMan12 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
cc @droot |
This quashes a goroutine leak caused by calling controller.New repeatedly without calling Start. controller.New was creating a new workqueue, which was starting goroutines and then expecting to be shut down (by the shutdown method, which is only called at the end of Start). To solve that, we move workqueue initialization to controller.Start. This means that we also move watch starting to controller.Start, but this seems pretty sensible anyway.
55ca61c
to
d98ce09
Compare
FYI, there's a subtle change in behavior here. I think it's the correct one and the old one was a bug, but we used to send nil results from controllers on the error channel, meaning a returned controller is treated as an error. This should never happen except in graceful shutdown, but it's still a behavior change. |
type errSignaler struct { | ||
// errSignal indicates that an error occurred, when closed. It shouldn't | ||
// be written to. | ||
errSignal chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it's chan struct{}
instead of chan error
?
nil
can be treated as error
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is being used purely for signaling that some error has occurred, it is fine.
errSignal chan struct{} | ||
|
||
// err is the received error | ||
err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to change it to []error
?
In SignalError()
, we append errors.
In Error()
, we aggregate errors into one error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm...since mgr bails out on seeing first error, converting to slice of errors is not going to be of much help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. One minor comment.
type errSignaler struct { | ||
// errSignal indicates that an error occurred, when closed. It shouldn't | ||
// be written to. | ||
errSignal chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is being used purely for signaling that some error has occurred, it is fine.
errSignal chan struct{} | ||
|
||
// err is the received error | ||
err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm...since mgr bails out on seeing first error, converting to slice of errors is not going to be of much help.
49a24a1
to
5c7efdd
Compare
Instead of using an error channel in the manager, we just use a mutex and store one error. The "waiting" is done by blocking on a channel which gets closed to signal errors. This achives the same effect (only return one error) without having the chance of blocking goroutines; with the old code, goroutines could block trying to return their results (especially since a few tried to return nil results), and since runnables can be added after start, there's no way to appropriately size the channel to avoid this happening (plus no point, since we only report the first error anyway). We also only report errors when the occurred, never signaling for errors with a nil error value.
Since we changed the behavior so that non-erroring runnable completion doesn't stop the manager, this logs a log line to make it more obvious that this has occurred, just in case.
5c7efdd
to
4d81992
Compare
This prevents a couple of goroutine leaks related to the controller and manager logic.
The first is in controller:
controller.New
previously created a new workqueue, which launched goroutines expecting to be stopped withShutdown
. This is semi-unexpected in CR, since we generally state that things don't happen until Start is called (and, indeed, we don't call shutdown unless start is called). This means that callingcontroller.New
without ever starting the controller will cause a goroutine leak. We fix this by delaying workqueue initialization (and consequently source starting) until the controller's Start is called.The second is in the manager's handling of errors -- we used to use an error channel to report errors from internal goroutines and runnables. However, we only ever read the first error from the channel, and it was a non-buffered channel. This meant that if multiple things tried to return results on the channel, all but one would block forever. This was exacerbated by the fact that the goroutines that launched controllers tried to return all results (nil values) on the channel, not just errors. We fix this by using a mutex to save a single error, using a channel's closing to signal shutdown only, and by only ever returning non-nil errors via the error signaler. This means that the only blocking is ever on a short-duration mutex.
Fixes #649