Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add support for controller-manager webhook shutdown delay #2601

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
13 changes: 13 additions & 0 deletions pkg/webhook/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,10 @@ type Options struct {

// WebhookMux is the multiplexer that handles different webhooks.
WebhookMux *http.ServeMux

// ShutdownDelay delays server shutdown to wait for clients to stop opening
// new connections before closing server listeners. Defaults to 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question:
Would it make sense to set a > 0 default, eg. 15s which is less than the 30 seconds default graceful shutdown delay?

Copy link
Author

@dippynark dippynark Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tricky one for sure, but because it's so specific to the Kubernetes cluster you're running on it doesn't feel right to align to any particular cluster or cloud provider (e.g. I'm not sure what value EKS and AKS use for example).

IMO we should keep it at 0 for backwards compatibility and to ensure that connection draining 'doesn't work' consistently across clusters by default. I think this is also more likely to encourage applications that care a lot about this (e.g. Gatekeeper) to expose a flag to change this to make it easier to reconfigure for the cluster you're running on (i.e. push the decision onto users of controller-runtime since an optimal default isn't clear).

No problems with changing to 15 seconds by default though if you were leaning towards that (that would be better for us anyway because we're using GKE 😄). We could alternatively align to the kube-proxy default of 1 second? But on GKE at least this won't help much.

ShutdownDelay time.Duration
}

// NewServer constructs a new webhook.Server from the provided options.
Expand Down Expand Up @@ -246,6 +250,15 @@ func (s *DefaultServer) Start(ctx context.Context) error {
idleConnsClosed := make(chan struct{})
go func() {
<-ctx.Done()

// Disable HTTP keep-alives to close persistent connections after next
// HTTP request. Clients may reconnect until routes are updated and
// server listeners are closed however this should start gradually
// migrating clients to server instances that are not about to shutdown
srv.SetKeepAlivesEnabled(false)
// Wait before shutting down webhook server
<-time.After(s.Options.ShutdownDelay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<-time.After(s.Options.ShutdownDelay)
time.Sleep(s.Options.ShutdownDelay)

We don't need to create a channel, I think.

Copy link
Author

@dippynark dippynark Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, fixed


log.Info("Shutting down webhook server with timeout of 1 minute")

ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute)
Expand Down