Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for delaying webhook shutdown #3170

Closed
31 changes: 30 additions & 1 deletion main.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,9 @@ import (
"net/http"
_ "net/http/pprof"
"os"
"os/signal"
"path/filepath"
"syscall"
"time"

"github.com/go-logr/zapr"
Expand Down Expand Up @@ -115,6 +117,7 @@ var (
disabledBuiltins = util.NewFlagSet()
enableK8sCel = flag.Bool("experimental-enable-k8s-native-validation", false, "PROTOTYPE (not stable): enable the validating admission policy driver")
externaldataProviderResponseCacheTTL = flag.Duration("external-data-provider-response-cache-ttl", 3*time.Minute, "TTL for the external data provider response cache. Specify the duration in 'h', 'm', or 's' for hours, minutes, or seconds respectively. Defaults to 3 minutes if unspecified. Setting the TTL to 0 disables the cache.")
shutdownWaitPeriod = flag.Duration("shutdown-wait-period", 10*time.Second, "The amount of time to wait before shutting down")
)

func init() {
Expand Down Expand Up @@ -309,7 +312,7 @@ func innerMain() int {

// Setup controllers asynchronously, they will block for certificate generation if needed.
setupErr := make(chan error)
ctx := ctrl.SetupSignalHandler()
ctx := setupSignalHandler(*shutdownWaitPeriod)
go func() {
setupErr <- setupControllers(ctx, mgr, sw, tracker, setupFinished)
}()
Expand Down Expand Up @@ -579,3 +582,29 @@ func setLoggerForProduction(encoder zapcore.LevelEncoder, dest io.Writer) {
ctrl.SetLogger(newlogger)
klog.SetLogger(newlogger)
}

// setupSignalHandler registers a signal handler for SIGTERM and SIGINT and
// returns a context which is canceled when one of these signals is caught after
// waiting for the specified period. If a second signal is caught then we
// terminate immediately with exit code 1. The implementation has been stolen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid misinterpretation, let's avoid "stolen". "This augments controller-runtime's model to support the delay".

Also, have you considered submitting this to controller-runtime? Generally it's better to avoid forks when possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxsmythe Thanks for the review, good point! I have created a controller-runtime PR here: kubernetes-sigs/controller-runtime#2601

Happy for this PR to be closed for now and I will create a new one if/when the controller-runtime PR is merged

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing!

// from controller-runtime and then extended with wait period support:
// https://github.com/kubernetes-sigs/controller-runtime/blob/2154ffbc22e26ffd8c3b713927f0df2fa40841f2/pkg/manager/signals/signal.go#L27-L45
func setupSignalHandler(waitPeriod time.Duration) context.Context {
ctx, cancel := context.WithCancel(context.Background())

c := make(chan os.Signal, 2)
signal.Notify(c, os.Interrupt, syscall.SIGTERM)
go func() {
<-c
// Cancel the context once the wait period expires but avoid blocking if
// a second signal is received
go func() {
<-time.After(waitPeriod)
Copy link

@mprahl mprahl Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my reading of the PR, it seems that you intercept the signal and then wait 10 seconds to cancel the context. The issue is that the context being cancelled is what tells all the goroutines to shutdown gracefully. I don't know if this is 100% the case in this codebase, but as a rule of thumb, that is how it's done.

In my opinion, the ideal way to handle this is to let the context be cancelled as is currently done, but use a WaitGroup and have each goroutine call Done on the WaitGroup after they stop. You would then block in innerMain until the WaitGroup has completed or your timeout of 10 seconds. That being said, Kubernetes can already force kill the process after a timeout, so you could choose to rely on that instead of custom logic.

Additionally, you could create a new context with context.WithCancel from the controller-runtime context and pass that to all the goroutines so that when they exit with an error, they cancel that context thus signaling everything to shutdown gracefully.

Copy link
Contributor Author

@dippynark dippynark Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mprahl, I might be misunderstanding some of the above, but delaying graceful shutdown was the intention here because as soon as graceful shutdown is triggered all webhook server listeners are closed and any TCP connections that are attempted by the API server just after this point will be rejected: https://github.com/kubernetes-sigs/controller-runtime/blob/2154ffbc22e26ffd8c3b713927f0df2fa40841f2/pkg/webhook/server.go#L248-L253

As a consumer of controller-runtime, the only thing we have control over when it comes to webhook server shutdown is the context and when it gets cancelled.

If I understand correctly, you are talking about configuring the manager to wait until all Go routines are shutdown gracefully, which is already happening (defaults to 30 seconds):

Perhaps this is something that should supported properly in controller-runtime instead of manipulating the context here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dippynark, I'm not quite understanding the intent of this change because cancelling the context is the only way to signal to controller-runtime to shutdown.

Are you saying that there are other components within Gatekeeper that intercept the OS signal directly as well instead of relying on that parent context associated with the OS signal and it's child contexts? If not, then I read this code as sleeping for 10 seconds after the exit signal is received and then it instructs controller-runtime to shutdown, which just delays the existing behavior by 10 seconds.

Copy link
Contributor Author

@dippynark dippynark Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mprahl The intent is to wait for some time after the TERM signal is received to get the following behaviour:

  1. TERM signal is sent to the Gatekeeper process by the kubelet
  2. API server stops sending new connections to Gatekeeper webhooks
  3. Gatekeeper shuts down webhook server listeners

Currently step 2 can come after step 3 if we are unlucky and some connections get rejected (i.e. there is a race condition) so this change is to try to ensure that step 3 always comes after step 2 by delaying shutdown for a short amount of time

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. That makes sense to me now.

cancel()
}()
<-c
os.Exit(1) // Second signal received, exit directly...
}()

return ctx
}
Loading