✨ Implement graceful shutdown #967

alvaroaleman · 2020-05-26T21:16:30Z

This PR implements graceful shutdown.

It is a rebased version of #664 plus a set of "Verschlimmbesserungen":

There was a deadlock in case the cache start failed, because we required the lock to close(cm.internalStopper) but the cache kept it while waiting for sync, so if we shutdown before sync completed, this is a deadlock
Godocs: https://github.com/kubernetes-sigs/controller-runtime/pull/664/files#r355773233
Make the things we start Runnables as well so they are part of the graceful shutdown: https://github.com/kubernetes-sigs/controller-runtime/pull/664/files#r355783939
Use a defer to do the graceful shutdown to make sure it always happens
Return both the graceful shutdown and the run error, rather than only logging the former
Rename runnableTearDownTimeout to gracefulShutdownTimeout, IMHO the latter makes the idea clearer as its not using terms that are bound to concepts that only exist within c-r
exit 1 on leader election loss. This is important, because otherwise controllers might continue to run after the leader lease was given up. Maybe we should only cancel the leader election context after graceful shutdown was finished?

Supersedes and thereby closes #664
Fixes #764

/assign @DirectXMan12 @alexeldeib

pkg/manager/internal.go

DirectXMan12 · 2020-06-11T00:30:41Z

"Verschlimmbesserungen"

wooo! compound German words for things are are weird/hard/take many words to express in English!

Sorry for the delay in reviewing this, gonna take a peek today/tomorrow

vincepri · 2020-06-11T14:10:29Z

Last changes look good to me

DirectXMan12

comments inline. Most important one is prob around goroutine leaks caused by multiple errors simulanteously

pkg/manager/internal.go

DirectXMan12 · 2020-06-11T22:07:53Z

pkg/manager/internal.go

+	case <-allStopped:
+		return nil
+	case <-gracefulShutdownTimer.C:
+		return fmt.Errorf("not all runnables have stopped within the grace period of %s", cm.gracefulShutdownTimeout.String())


make this a typed error (see above note about detected deadline-exceeded errors)

(also, no need for the call to .String if the thing is already stringable)

DirectXMan12 · 2020-06-11T22:09:25Z

pkg/manager/internal.go

+			// else that keeps the binary running might allow controllers that
+			// need leader election to run after we lost the lease.
+			log.Info("leader lease lost, exiting")
+			os.Exit(0)


raise a panic in case we've got defers and other stuff folks want to finish? Nice for cleanup of temporary files and such, esp when testing

The problem is that entering this is part of a normal shutdown of a Manager that has leader election enabled, so panicing here would mean that all controller-managers that have LE enabled will panic on shutdown. For tests, the whole func can be overriden.

I am not really sure what best to do here. The os.Exit is not nice and we also always exit with the same code, because there are different scenarios this has to cover for but which we can not distinguish:

We lost the leader lock unexpectedly

We just ended normally without any error

We had an error during graceful shutdown (we will just swallow that because its returned from the manager but by the time that happens we will already have exited the binary here)

Maybe push the problem down by documenting "You must exit the binary immediately after the manager ended when leader election is enabled" (and removing the os.Exit())?

Alternatively provide a way for external users to configure this, and default it to os.Exit? The immediate Exit is just annoying when dealing with stuff like defers.

Okay so after thinking about this some more, I still think what we do today is not correct and unsafe, but its not trivial to improve this so I would like to have a dedicated discussion and PR for that.

I have updated this PR to disable gracefulShutdown when the leader election is lost.

pkg/manager/internal.go

alvaroaleman · 2020-06-19T16:21:03Z

@DirectXMan12 @michaelgugino @vincepri updated, PTAL

michaelgugino · 2020-06-20T16:01:13Z

pkg/manager/internal.go

+// from the errChan and waits for them to end. It must not be called more than once.
+func (cm *controllerManager) engageStopProcedure(stopComplete chan struct{}) error {
+	var cancel context.CancelFunc
+	if cm.gracefulShutdownTimeout > 0 {


This probably should be >= 0 to be in line with the comments.

This doesn't really matter, we return if its zero and then close the context via the defer cancel()

michaelgugino

I checked out this patch locally, and I can confirm startNonLeaderElectionRunnables now waits for webhooks as it should with these changes with default Graceperiod.

However, if I set graceperiod to short amount (one second or less, I have tested), my tests hang indefinitely.

https://gist.github.com/michaelgugino/69757f35817a6ce7899874ccc8b24ba1

Also, I'm not sure I want an option that stops waiting (GracePeriod), I want an option that will wait for X seconds, and then force-terminate the child processes. Ignoring them is fruitless for a parent process that does not terminate (such as the unit tests I'm working with). I want to hard-close the webhook processes, I don't care if they finish or not.

Here's the PR where I'm attempting to utilize this. I've tried with out my patch, with my patch, and now this patch of controller-runtime.

https://github.com/openshift/machine-api-operator/pull/617/files

Anyway, even if we decide to use GracePeriod as we're doing in this patch, we should be more clear about goroutines left running in the background, and I think we should wait indefinitely by default (least surprise, I expect the manager to block until it's actually finished unless I specify otherwise). In any case, we need to investigate why these short waiting periods lock us up as it's having the exact opposite desired effect.

alvaroaleman · 2020-06-21T16:02:40Z

However, if I set graceperiod to short amount (one second or less, I have tested), my tests hang indefinitely.

I tried to reproduce this here: alvaroaleman/machine-api-operator@b8ba348 However tests pass. Furthermore in the stacktrace I can not find any reference to controller-runtime. Can you provide a repro for this?
In the PR here are multiple tests that use a sub-second gracefulShutdownPeriod and they pass and don't hang.

Also, I'm not sure I want an option that stops waiting (GracePeriod), I want an option that will wait for X seconds, and then force-terminate the child processes. Ignoring them is fruitless for a parent process that does not terminate (such as the unit tests I'm working with). I want to hard-close the webhook processes, I don't care if they finish or not.

Yup I agree with that sentiment, but to my knowledge its not possible to terminate goroutines.

Anyway, even if we decide to use GracePeriod as we're doing in this patch, we should be more clear about goroutines left running in the background,

How would you make this more clear? Extend the godocs?

and I think we should wait indefinitely by default (least surprise, I expect the manager to block until it's actually finished unless I specify otherwise).

I disagree with this, because the result in case of a Runnable that doesn't end is that the manager will just hang indefinitely without doing anything which is a hard to debug situation. Returning an error stating It didn't finish shows where the problem is.

michaelgugino · 2020-06-21T20:58:03Z

@alvaroaleman here's the reproducer: https://github.com/openshift/machine-api-operator/pull/620/files

Make sure you have kubebuilder installed

$ ls /usr/local/kubebuilder/bin/
etcd  kube-apiserver  kubebuilder  kubectl

and you can run the tests with

go test ./pkg/apis/machine/v1beta1/

So, what I think is the essence of the problem here is that the kubeapi server is holding connections open to the webhook goroutine for one reason or another. There's not really any fundamental differences between my patch and yours, other than I'm running additional webhook handlers.

Yup I agree with that sentiment, but to my knowledge its not possible to terminate goroutines.

I was thinking, there must be some way to signal the go http.Server to stop immediately via a context or some method, or possibly some configurable on http.Server or one of the related components.

How would you make this more clear? Extend the godocs?

Yeah, in the documentation, be explicit that the goroutines maybe (will be if we hit the timeout?) left running in the background. It's kind of not really what I think most would expect by the name of the option. Maybe something like 'orphanGoRoutinesAfter' or something like that to indicate that we're doing something bad. Personally, I would prefer to block forever than leak goroutines. GracePeriod implies that we're going to wait for a connection to finish/cleanup, but if that time expires, we close the connection immediately.

I'll see if I can create another reproducer to try to eliminate my changes specifically.

alvaroaleman · 2020-06-21T21:17:56Z

@michaelgugino those are just bugs in your tests, you are waiting for channels that are never closed when the manager errors: alvaroaleman/machine-api-operator@4da76a6

I was thinking, there must be some way to signal the go http.Server to stop immediately via a context or some method, or possibly some configurable on http.Server or one of the related components.

That might be possible for the http server but is not possible for other Runnables.

Personally, I would prefer to block forever than leak goroutines. GracePeriod implies that we're going to wait for a connection to finish/cleanup, but if that time expires, we close the connection immediately.

Rename it to WaitPeriod maybe? As seen in the isssues in your PR, blocking forever just makes debugging harder.

Generally it is expected that the binary will end if the manager ends and doing so is a mandatory requirement when LeaderElection is enabled. Hence returning means the binary ends which means all go routines stop.

michaelgugino · 2020-06-22T14:02:16Z

@michaelgugino those are just bugs in your tests, you are waiting for channels that are never closed when the manager errors: alvaroaleman/machine-api-operator@4da76a6

You're right. We weren't hitting this before as the manager never produced an error, and we never expected it to, but this new patch set returns an error.

I'm kind of on the fence as to what the right thing is. On the one hand, the manager never really returned an error after the stop channel was closed, and the manager ended quickly, pretty much instantly. For webhooks, this is probably okay, as we probably just want to tear it down as quickly as possible, nobody is going to be too upset if a request goes unfulfilled. On the other hand, the existing behavior is definitely a bug, we should wait for the webhooks to stop.

Let's consider the case where we wait indefinitely by default. The manager will block forever, and that might be unexpected and hard to diagnose. This might impact things that are otherwise working today, and a simple rebase onto this patch set is not going to signal any change to the end user as we're not breaking the interface. However, it would be pretty straight forward to see that the manager is blocking, though it might be hard to determine what it's waiting on. The flip side to this coin is, if anyone is eagerly restarting a manager without terminating a container (eg, they have the manager in some while loop, and they're catching a signal or something to break that loop), we're going to leak goroutines. I can't think of a great reason to do this in reality, but some users might be doing that. IMO, leaking goroutines is definitely worse than blocking forever as it's going to happen silently and eventually your containers probably going to crash.

My preference is to block indefinitely by default, make the timeout opt-in, and specify in some reasonable way that the timeout is going to orphan goroutines, and those goroutines could potentially stick around forever unless the process is stopped.

Personally, since I'm aware of this patch, I can live with either scenario. I think we should just consider the impact to other users that might not be aware of the change potentially. I'm not super firm on this position, but I do want to highlight it for consideration.

Otherwise, this patch works well for me.

alvaroaleman · 2020-06-22T14:49:02Z

I'm kind of on the fence as to what the right thing is

For the record, leaking goroutines is what we already do today and not something this PR introduces. This PR just makes that visible.

In my opinion, if a runnable doesn't stop within 30 seconds its unlikely to stop later on. We can not kill goroutines and just hanging is guaranteed to result in a lot of bug reports and questions, since it looks like a bug in c-r. Returning an error OTOH gives users a clue as to how to debug problems we can not solve for them (get all goroutines to stop) .

Co-authored-by: david.benque <david.benque@datadoghq.com>

vincepri

/approve

Gave another review today, the code changes look good and this should be ready to go.

k8s-ci-robot · 2020-07-20T15:53:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vincepri · 2020-07-21T15:11:36Z

/lgtm

alvaroaleman · 2020-07-21T15:26:29Z

/retest

vincepri · 2020-07-21T15:48:30Z

/test pull-controller-runtime-test-master

k8s-ci-robot assigned alexeldeib and DirectXMan12 May 26, 2020

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 26, 2020

k8s-ci-robot requested review from droot and mengqiy May 26, 2020 21:16

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 26, 2020

alvaroaleman force-pushed the wait-for-stop branch 4 times, most recently from 01d7292 to 551da56 Compare May 26, 2020 22:21

vincepri reviewed May 28, 2020

View reviewed changes

pkg/manager/internal.go Show resolved Hide resolved

pkg/manager/internal.go Outdated Show resolved Hide resolved

pkg/manager/internal.go Outdated Show resolved Hide resolved

alvaroaleman force-pushed the wait-for-stop branch from 551da56 to b3556ca Compare June 11, 2020 02:15

vincepri added this to the v0.6.x milestone Jun 11, 2020

DirectXMan12 suggested changes Jun 11, 2020

View reviewed changes

alvaroaleman force-pushed the wait-for-stop branch from b3556ca to 7889b3e Compare June 13, 2020 01:11

michaelgugino mentioned this pull request Jun 18, 2020

🐛 Wait for NonLeaderElectionRunnables to terminate #1005

Closed

alvaroaleman force-pushed the wait-for-stop branch 2 times, most recently from 0953054 to 4a802fb Compare June 19, 2020 15:28

michaelgugino reviewed Jun 20, 2020

View reviewed changes

michaelgugino suggested changes Jun 20, 2020

View reviewed changes

✨ Implement graceful shutdown

53289b8

Co-authored-by: david.benque <david.benque@datadoghq.com>

alvaroaleman force-pushed the wait-for-stop branch from 4a802fb to 53289b8 Compare July 15, 2020 17:02

alvaroaleman mentioned this pull request Jul 15, 2020

Controller-Runtime: Bump CPU requests kubernetes/test-infra#18330

Merged

vincepri approved these changes Jul 20, 2020

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 20, 2020

k8s-ci-robot assigned vincepri Jul 21, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 21, 2020

k8s-ci-robot merged commit 21a81bc into kubernetes-sigs:master Jul 21, 2020

RobEarl mentioned this pull request Aug 24, 2020

Go operator doesn't shutdown gracefully on SIGTERM operator-framework/operator-sdk#1843

Closed

tiraboschi mentioned this pull request Sep 30, 2020

WIP: remove the finalizer if the namespace is terminating kubevirt/hyperconverged-cluster-operator#843

Closed

seaneagan mentioned this pull request Nov 30, 2020

Helm upgrade failed: another operation (install/upgrade/rollback) is in progress fluxcd/helm-controller#149

Closed

RobEarl mentioned this pull request Mar 12, 2021

Graceful shutdown doesn't wait for reconcile workers to complete #1424

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Implement graceful shutdown #967

✨ Implement graceful shutdown #967

alvaroaleman commented May 26, 2020 •

edited

Loading

DirectXMan12 commented Jun 11, 2020 •

edited

Loading

vincepri commented Jun 11, 2020

DirectXMan12 left a comment

DirectXMan12 Jun 11, 2020

DirectXMan12 Jun 11, 2020

alvaroaleman Jun 13, 2020

DirectXMan12 Jun 11, 2020

alvaroaleman Jun 13, 2020

alvaroaleman Jun 13, 2020

DirectXMan12 Jun 16, 2020

alvaroaleman Jun 19, 2020

alvaroaleman commented Jun 19, 2020

michaelgugino Jun 20, 2020

alvaroaleman Jun 21, 2020

michaelgugino left a comment

alvaroaleman commented Jun 21, 2020

michaelgugino commented Jun 21, 2020

alvaroaleman commented Jun 21, 2020

michaelgugino commented Jun 22, 2020

alvaroaleman commented Jun 22, 2020

vincepri left a comment

k8s-ci-robot commented Jul 20, 2020

vincepri commented Jul 21, 2020

alvaroaleman commented Jul 21, 2020

vincepri commented Jul 21, 2020

✨ Implement graceful shutdown #967

✨ Implement graceful shutdown #967

Conversation

alvaroaleman commented May 26, 2020 • edited Loading

DirectXMan12 commented Jun 11, 2020 • edited Loading

vincepri commented Jun 11, 2020

DirectXMan12 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alvaroaleman commented Jun 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelgugino left a comment

Choose a reason for hiding this comment

alvaroaleman commented Jun 21, 2020

michaelgugino commented Jun 21, 2020

alvaroaleman commented Jun 21, 2020

michaelgugino commented Jun 22, 2020

alvaroaleman commented Jun 22, 2020

vincepri left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jul 20, 2020

vincepri commented Jul 21, 2020

alvaroaleman commented Jul 21, 2020

vincepri commented Jul 21, 2020

alvaroaleman commented May 26, 2020 •

edited

Loading

DirectXMan12 commented Jun 11, 2020 •

edited

Loading