-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for delaying webhook shutdown #3170
feat: Add support for delaying webhook shutdown #3170
Conversation
Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
c37453b
to
76718fa
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3170 +/- ##
==========================================
- Coverage 53.81% 53.72% -0.10%
==========================================
Files 136 136
Lines 12198 12198
==========================================
- Hits 6564 6553 -11
- Misses 5134 5141 +7
- Partials 500 504 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
55c2900
to
c7fe1fb
Compare
Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
main.go
Outdated
// Cancel the context once the wait period expires but avoid blocking if | ||
// a second signal is received | ||
go func() { | ||
<-time.After(waitPeriod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my reading of the PR, it seems that you intercept the signal and then wait 10 seconds to cancel the context. The issue is that the context being cancelled is what tells all the goroutines to shutdown gracefully. I don't know if this is 100% the case in this codebase, but as a rule of thumb, that is how it's done.
In my opinion, the ideal way to handle this is to let the context be cancelled as is currently done, but use a WaitGroup and have each goroutine call Done
on the WaitGroup
after they stop. You would then block in innerMain
until the WaitGroup
has completed or your timeout of 10 seconds. That being said, Kubernetes can already force kill the process after a timeout, so you could choose to rely on that instead of custom logic.
Additionally, you could create a new context with context.WithCancel
from the controller-runtime context and pass that to all the goroutines so that when they exit with an error, they cancel that context thus signaling everything to shutdown gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mprahl, I might be misunderstanding some of the above, but delaying graceful shutdown was the intention here because as soon as graceful shutdown is triggered all webhook server listeners are closed and any TCP connections that are attempted by the API server just after this point will be rejected: https://github.com/kubernetes-sigs/controller-runtime/blob/2154ffbc22e26ffd8c3b713927f0df2fa40841f2/pkg/webhook/server.go#L248-L253
As a consumer of controller-runtime, the only thing we have control over when it comes to webhook server shutdown is the context and when it gets cancelled.
If I understand correctly, you are talking about configuring the manager to wait until all Go routines are shutdown gracefully, which is already happening (defaults to 30 seconds):
- https://github.com/kubernetes-sigs/controller-runtime/blob/2154ffbc22e26ffd8c3b713927f0df2fa40841f2/pkg/manager/manager.go#L255-L259
- https://github.com/kubernetes-sigs/controller-runtime/blob/2154ffbc22e26ffd8c3b713927f0df2fa40841f2/pkg/manager/internal.go#L55
Perhaps this is something that should supported properly in controller-runtime instead of manipulating the context here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dippynark, I'm not quite understanding the intent of this change because cancelling the context is the only way to signal to controller-runtime to shutdown.
Are you saying that there are other components within Gatekeeper that intercept the OS signal directly as well instead of relying on that parent context associated with the OS signal and it's child contexts? If not, then I read this code as sleeping for 10 seconds after the exit signal is received and then it instructs controller-runtime to shutdown, which just delays the existing behavior by 10 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mprahl The intent is to wait for some time after the TERM signal is received to get the following behaviour:
- TERM signal is sent to the Gatekeeper process by the kubelet
- API server stops sending new connections to Gatekeeper webhooks
- Gatekeeper shuts down webhook server listeners
Currently step 2 can come after step 3 if we are unlucky and some connections get rejected (i.e. there is a race condition) so this change is to try to ensure that step 3 always comes after step 2 by delaying shutdown for a short amount of time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. That makes sense to me now.
Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
8509e10
to
40a3247
Compare
Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting! I like the idea, though I'm wondering if the best route forward is to submit a PR to controller-runtime so all projects can benefit?
// setupSignalHandler registers a signal handler for SIGTERM and SIGINT and | ||
// returns a context which is canceled when one of these signals is caught after | ||
// waiting for the specified duration. If a second signal is caught then we | ||
// terminate immediately with exit code 1. The implementation has been stolen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid misinterpretation, let's avoid "stolen". "This augments controller-runtime's model to support the delay".
Also, have you considered submitting this to controller-runtime? Generally it's better to avoid forks when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxsmythe Thanks for the review, good point! I have created a controller-runtime PR here: kubernetes-sigs/controller-runtime#2601
Happy for this PR to be closed for now and I will create a new one if/when the controller-runtime PR is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for contributing!
Closing in favor of kubernetes-sigs/controller-runtime#2601 |
What this PR does / why we need it:
Currently, as far as I can tell, Gatekeeper shuts down immediately after a TERM signal is received. If a Gatekeeper Pod happens to shutdown just as the API server is setting up a connection for a webhook request (but before it realises the Pod has been terminated) then the request might fail trying to connect to a server that is no longer accepting connections.
To fix this, this PR adds a
--shutdown-delay-duration
flag with a default of 10 seconds. This matches with the similar API Server flagSpecial notes for your reviewer: