-
Notifications
You must be signed in to change notification settings - Fork 744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for delaying webhook shutdown #3170
Changes from 4 commits
76718fa
f73f60c
c7fe1fb
7cdefb4
b2e1593
1c2ad5b
40a3247
f95e90b
327ac16
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,7 +26,9 @@ import ( | |
"net/http" | ||
_ "net/http/pprof" | ||
"os" | ||
"os/signal" | ||
"path/filepath" | ||
"syscall" | ||
"time" | ||
|
||
"github.com/go-logr/zapr" | ||
|
@@ -115,6 +117,7 @@ var ( | |
disabledBuiltins = util.NewFlagSet() | ||
enableK8sCel = flag.Bool("experimental-enable-k8s-native-validation", false, "PROTOTYPE (not stable): enable the validating admission policy driver") | ||
externaldataProviderResponseCacheTTL = flag.Duration("external-data-provider-response-cache-ttl", 3*time.Minute, "TTL for the external data provider response cache. Specify the duration in 'h', 'm', or 's' for hours, minutes, or seconds respectively. Defaults to 3 minutes if unspecified. Setting the TTL to 0 disables the cache.") | ||
shutdownWaitPeriod = flag.Duration("shutdown-wait-period", 10*time.Second, "The amount of time to wait before shutting down") | ||
) | ||
|
||
func init() { | ||
|
@@ -309,7 +312,7 @@ func innerMain() int { | |
|
||
// Setup controllers asynchronously, they will block for certificate generation if needed. | ||
setupErr := make(chan error) | ||
ctx := ctrl.SetupSignalHandler() | ||
ctx := setupSignalHandler(*shutdownWaitPeriod) | ||
go func() { | ||
setupErr <- setupControllers(ctx, mgr, sw, tracker, setupFinished) | ||
}() | ||
|
@@ -579,3 +582,29 @@ func setLoggerForProduction(encoder zapcore.LevelEncoder, dest io.Writer) { | |
ctrl.SetLogger(newlogger) | ||
klog.SetLogger(newlogger) | ||
} | ||
|
||
// setupSignalHandler registers a signal handler for SIGTERM and SIGINT and | ||
// returns a context which is canceled when one of these signals is caught after | ||
// waiting for the specified period. If a second signal is caught then we | ||
// terminate immediately with exit code 1. The implementation has been stolen | ||
// from controller-runtime and then extended with wait period support: | ||
// https://github.com/kubernetes-sigs/controller-runtime/blob/2154ffbc22e26ffd8c3b713927f0df2fa40841f2/pkg/manager/signals/signal.go#L27-L45 | ||
func setupSignalHandler(waitPeriod time.Duration) context.Context { | ||
ctx, cancel := context.WithCancel(context.Background()) | ||
|
||
c := make(chan os.Signal, 2) | ||
signal.Notify(c, os.Interrupt, syscall.SIGTERM) | ||
go func() { | ||
<-c | ||
// Cancel the context once the wait period expires but avoid blocking if | ||
// a second signal is received | ||
go func() { | ||
<-time.After(waitPeriod) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From my reading of the PR, it seems that you intercept the signal and then wait 10 seconds to cancel the context. The issue is that the context being cancelled is what tells all the goroutines to shutdown gracefully. I don't know if this is 100% the case in this codebase, but as a rule of thumb, that is how it's done. In my opinion, the ideal way to handle this is to let the context be cancelled as is currently done, but use a WaitGroup and have each goroutine call Additionally, you could create a new context with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @mprahl, I might be misunderstanding some of the above, but delaying graceful shutdown was the intention here because as soon as graceful shutdown is triggered all webhook server listeners are closed and any TCP connections that are attempted by the API server just after this point will be rejected: https://github.com/kubernetes-sigs/controller-runtime/blob/2154ffbc22e26ffd8c3b713927f0df2fa40841f2/pkg/webhook/server.go#L248-L253 As a consumer of controller-runtime, the only thing we have control over when it comes to webhook server shutdown is the context and when it gets cancelled. If I understand correctly, you are talking about configuring the manager to wait until all Go routines are shutdown gracefully, which is already happening (defaults to 30 seconds):
Perhaps this is something that should supported properly in controller-runtime instead of manipulating the context here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dippynark, I'm not quite understanding the intent of this change because cancelling the context is the only way to signal to controller-runtime to shutdown. Are you saying that there are other components within Gatekeeper that intercept the OS signal directly as well instead of relying on that parent context associated with the OS signal and it's child contexts? If not, then I read this code as sleeping for 10 seconds after the exit signal is received and then it instructs controller-runtime to shutdown, which just delays the existing behavior by 10 seconds. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mprahl The intent is to wait for some time after the TERM signal is received to get the following behaviour:
Currently step 2 can come after step 3 if we are unlucky and some connections get rejected (i.e. there is a race condition) so this change is to try to ensure that step 3 always comes after step 2 by delaying shutdown for a short amount of time There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. That makes sense to me now. |
||
cancel() | ||
}() | ||
<-c | ||
os.Exit(1) // Second signal received, exit directly... | ||
}() | ||
|
||
return ctx | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid misinterpretation, let's avoid "stolen". "This augments controller-runtime's model to support the delay".
Also, have you considered submitting this to controller-runtime? Generally it's better to avoid forks when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxsmythe Thanks for the review, good point! I have created a controller-runtime PR here: kubernetes-sigs/controller-runtime#2601
Happy for this PR to be closed for now and I will create a new one if/when the controller-runtime PR is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for contributing!