Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for delaying webhook shutdown #3170

Closed

Conversation

dippynark
Copy link
Contributor

@dippynark dippynark commented Nov 23, 2023

What this PR does / why we need it:

Currently, as far as I can tell, Gatekeeper shuts down immediately after a TERM signal is received. If a Gatekeeper Pod happens to shutdown just as the API server is setting up a connection for a webhook request (but before it realises the Pod has been terminated) then the request might fail trying to connect to a server that is no longer accepting connections.

To fix this, this PR adds a --shutdown-delay-duration flag with a default of 10 seconds. This matches with the similar API Server flag

Special notes for your reviewer:

  • I wasn't sure how best to unit test this given it interacts with signals
  • It reimplements the controller-runtime signal handler but does not differentiate between Windows and Unix
  • Very happy to completely reimplement this or remove it depending on the opinions of the reviewer, but just thought I'd give it a go first as it's a small change

@dippynark dippynark requested a review from a team as a code owner November 23, 2023 22:17
@dippynark dippynark changed the title feat: Add shutdown grace period flag feat: Add support for graceful shutdown Nov 23, 2023
Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
@codecov-commenter
Copy link

codecov-commenter commented Nov 23, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2370c50) 53.81% compared to head (f95e90b) 53.72%.

❗ Current head f95e90b differs from pull request most recent head 327ac16. Consider uploading reports for the commit 327ac16 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3170      +/-   ##
==========================================
- Coverage   53.81%   53.72%   -0.10%     
==========================================
  Files         136      136              
  Lines       12198    12198              
==========================================
- Hits         6564     6553      -11     
- Misses       5134     5141       +7     
- Partials      500      504       +4     
Flag Coverage Δ
unittests 53.72% <ø> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
main.go Outdated
// Cancel the context once the wait period expires but avoid blocking if
// a second signal is received
go func() {
<-time.After(waitPeriod)
Copy link

@mprahl mprahl Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my reading of the PR, it seems that you intercept the signal and then wait 10 seconds to cancel the context. The issue is that the context being cancelled is what tells all the goroutines to shutdown gracefully. I don't know if this is 100% the case in this codebase, but as a rule of thumb, that is how it's done.

In my opinion, the ideal way to handle this is to let the context be cancelled as is currently done, but use a WaitGroup and have each goroutine call Done on the WaitGroup after they stop. You would then block in innerMain until the WaitGroup has completed or your timeout of 10 seconds. That being said, Kubernetes can already force kill the process after a timeout, so you could choose to rely on that instead of custom logic.

Additionally, you could create a new context with context.WithCancel from the controller-runtime context and pass that to all the goroutines so that when they exit with an error, they cancel that context thus signaling everything to shutdown gracefully.

Copy link
Contributor Author

@dippynark dippynark Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mprahl, I might be misunderstanding some of the above, but delaying graceful shutdown was the intention here because as soon as graceful shutdown is triggered all webhook server listeners are closed and any TCP connections that are attempted by the API server just after this point will be rejected: https://github.com/kubernetes-sigs/controller-runtime/blob/2154ffbc22e26ffd8c3b713927f0df2fa40841f2/pkg/webhook/server.go#L248-L253

As a consumer of controller-runtime, the only thing we have control over when it comes to webhook server shutdown is the context and when it gets cancelled.

If I understand correctly, you are talking about configuring the manager to wait until all Go routines are shutdown gracefully, which is already happening (defaults to 30 seconds):

Perhaps this is something that should supported properly in controller-runtime instead of manipulating the context here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dippynark, I'm not quite understanding the intent of this change because cancelling the context is the only way to signal to controller-runtime to shutdown.

Are you saying that there are other components within Gatekeeper that intercept the OS signal directly as well instead of relying on that parent context associated with the OS signal and it's child contexts? If not, then I read this code as sleeping for 10 seconds after the exit signal is received and then it instructs controller-runtime to shutdown, which just delays the existing behavior by 10 seconds.

Copy link
Contributor Author

@dippynark dippynark Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mprahl The intent is to wait for some time after the TERM signal is received to get the following behaviour:

  1. TERM signal is sent to the Gatekeeper process by the kubelet
  2. API server stops sending new connections to Gatekeeper webhooks
  3. Gatekeeper shuts down webhook server listeners

Currently step 2 can come after step 3 if we are unlucky and some connections get rejected (i.e. there is a race condition) so this change is to try to ensure that step 3 always comes after step 2 by delaying shutdown for a short amount of time

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. That makes sense to me now.

@dippynark dippynark changed the title feat: Add support for graceful shutdown feat: Add support for delaying webhook shutdown Nov 29, 2023
Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
Signed-off-by: Luke Addison <lukeaddison785@gmail.com>
Copy link
Contributor

@maxsmythe maxsmythe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting! I like the idea, though I'm wondering if the best route forward is to submit a PR to controller-runtime so all projects can benefit?

// setupSignalHandler registers a signal handler for SIGTERM and SIGINT and
// returns a context which is canceled when one of these signals is caught after
// waiting for the specified duration. If a second signal is caught then we
// terminate immediately with exit code 1. The implementation has been stolen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid misinterpretation, let's avoid "stolen". "This augments controller-runtime's model to support the delay".

Also, have you considered submitting this to controller-runtime? Generally it's better to avoid forks when possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxsmythe Thanks for the review, good point! I have created a controller-runtime PR here: kubernetes-sigs/controller-runtime#2601

Happy for this PR to be closed for now and I will create a new one if/when the controller-runtime PR is merged

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing!

@ritazh
Copy link
Member

ritazh commented Dec 6, 2023

Closing in favor of kubernetes-sigs/controller-runtime#2601

#3170 (comment)

@ritazh ritazh closed this Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants