Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Implement graceful shutdown #967

Merged
merged 1 commit into from
Jul 21, 2020

Conversation

alvaroaleman
Copy link
Member

@alvaroaleman alvaroaleman commented May 26, 2020

This PR implements graceful shutdown.

It is a rebased version of #664 plus a set of "Verschlimmbesserungen":

  • There was a deadlock in case the cache start failed, because we required the lock to close(cm.internalStopper) but the cache kept it while waiting for sync, so if we shutdown before sync completed, this is a deadlock
  • Godocs: https://github.com/kubernetes-sigs/controller-runtime/pull/664/files#r355773233
  • Make the things we start Runnables as well so they are part of the graceful shutdown: https://github.com/kubernetes-sigs/controller-runtime/pull/664/files#r355783939
  • Use a defer to do the graceful shutdown to make sure it always happens
  • Return both the graceful shutdown and the run error, rather than only logging the former
  • Rename runnableTearDownTimeout to gracefulShutdownTimeout, IMHO the latter makes the idea clearer as its not using terms that are bound to concepts that only exist within c-r
  • exit 1 on leader election loss. This is important, because otherwise controllers might continue to run after the leader lease was given up. Maybe we should only cancel the leader election context after graceful shutdown was finished?

Supersedes and thereby closes #664
Fixes #764

/assign @DirectXMan12 @alexeldeib

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 26, 2020
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 26, 2020
@alvaroaleman alvaroaleman force-pushed the wait-for-stop branch 4 times, most recently from 01d7292 to 551da56 Compare May 26, 2020 22:21
pkg/manager/internal.go Show resolved Hide resolved
pkg/manager/internal.go Outdated Show resolved Hide resolved
pkg/manager/internal.go Outdated Show resolved Hide resolved
@DirectXMan12
Copy link
Contributor

DirectXMan12 commented Jun 11, 2020

"Verschlimmbesserungen"

wooo! compound German words for things are are weird/hard/take many words to express in English!

Sorry for the delay in reviewing this, gonna take a peek today/tomorrow

@vincepri
Copy link
Member

Last changes look good to me

Copy link
Contributor

@DirectXMan12 DirectXMan12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments inline. Most important one is prob around goroutine leaks caused by multiple errors simulanteously

pkg/manager/internal.go Show resolved Hide resolved
pkg/manager/internal.go Show resolved Hide resolved
pkg/manager/internal.go Outdated Show resolved Hide resolved
pkg/manager/internal.go Show resolved Hide resolved
case <-allStopped:
return nil
case <-gracefulShutdownTimer.C:
return fmt.Errorf("not all runnables have stopped within the grace period of %s", cm.gracefulShutdownTimeout.String())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a typed error (see above note about detected deadline-exceeded errors)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, no need for the call to .String if the thing is already stringable)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// else that keeps the binary running might allow controllers that
// need leader election to run after we lost the lease.
log.Info("leader lease lost, exiting")
os.Exit(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise a panic in case we've got defers and other stuff folks want to finish? Nice for cleanup of temporary files and such, esp when testing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that entering this is part of a normal shutdown of a Manager that has leader election enabled, so panicing here would mean that all controller-managers that have LE enabled will panic on shutdown. For tests, the whole func can be overriden.

I am not really sure what best to do here. The os.Exit is not nice and we also always exit with the same code, because there are different scenarios this has to cover for but which we can not distinguish:

  • We lost the leader lock unexpectedly
  • We just ended normally without any error
  • We had an error during graceful shutdown (we will just swallow that because its returned from the manager but by the time that happens we will already have exited the binary here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe push the problem down by documenting "You must exit the binary immediately after the manager ended when leader election is enabled" (and removing the os.Exit())?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively provide a way for external users to configure this, and default it to os.Exit? The immediate Exit is just annoying when dealing with stuff like defers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so after thinking about this some more, I still think what we do today is not correct and unsafe, but its not trivial to improve this so I would like to have a dedicated discussion and PR for that.

I have updated this PR to disable gracefulShutdown when the leader election is lost.

pkg/manager/internal.go Show resolved Hide resolved
@alvaroaleman
Copy link
Member Author

@DirectXMan12 @michaelgugino @vincepri updated, PTAL

// from the errChan and waits for them to end. It must not be called more than once.
func (cm *controllerManager) engageStopProcedure(stopComplete chan struct{}) error {
var cancel context.CancelFunc
if cm.gracefulShutdownTimeout > 0 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably should be >= 0 to be in line with the comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really matter, we return if its zero and then close the context via the defer cancel()

Copy link

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked out this patch locally, and I can confirm startNonLeaderElectionRunnables now waits for webhooks as it should with these changes with default Graceperiod.

However, if I set graceperiod to short amount (one second or less, I have tested), my tests hang indefinitely.

https://gist.github.com/michaelgugino/69757f35817a6ce7899874ccc8b24ba1

Also, I'm not sure I want an option that stops waiting (GracePeriod), I want an option that will wait for X seconds, and then force-terminate the child processes. Ignoring them is fruitless for a parent process that does not terminate (such as the unit tests I'm working with). I want to hard-close the webhook processes, I don't care if they finish or not.

Here's the PR where I'm attempting to utilize this. I've tried with out my patch, with my patch, and now this patch of controller-runtime.

https://github.com/openshift/machine-api-operator/pull/617/files

Anyway, even if we decide to use GracePeriod as we're doing in this patch, we should be more clear about goroutines left running in the background, and I think we should wait indefinitely by default (least surprise, I expect the manager to block until it's actually finished unless I specify otherwise). In any case, we need to investigate why these short waiting periods lock us up as it's having the exact opposite desired effect.

@alvaroaleman
Copy link
Member Author

However, if I set graceperiod to short amount (one second or less, I have tested), my tests hang indefinitely.

I tried to reproduce this here: alvaroaleman/machine-api-operator@b8ba348 However tests pass. Furthermore in the stacktrace I can not find any reference to controller-runtime. Can you provide a repro for this?
In the PR here are multiple tests that use a sub-second gracefulShutdownPeriod and they pass and don't hang.

Also, I'm not sure I want an option that stops waiting (GracePeriod), I want an option that will wait for X seconds, and then force-terminate the child processes. Ignoring them is fruitless for a parent process that does not terminate (such as the unit tests I'm working with). I want to hard-close the webhook processes, I don't care if they finish or not.

Yup I agree with that sentiment, but to my knowledge its not possible to terminate goroutines.

Anyway, even if we decide to use GracePeriod as we're doing in this patch, we should be more clear about goroutines left running in the background,

How would you make this more clear? Extend the godocs?

and I think we should wait indefinitely by default (least surprise, I expect the manager to block until it's actually finished unless I specify otherwise).

I disagree with this, because the result in case of a Runnable that doesn't end is that the manager will just hang indefinitely without doing anything which is a hard to debug situation. Returning an error stating It didn't finish shows where the problem is.

@michaelgugino
Copy link

@alvaroaleman here's the reproducer: https://github.com/openshift/machine-api-operator/pull/620/files

Make sure you have kubebuilder installed

$ ls /usr/local/kubebuilder/bin/
etcd  kube-apiserver  kubebuilder  kubectl

and you can run the tests with

go test ./pkg/apis/machine/v1beta1/

So, what I think is the essence of the problem here is that the kubeapi server is holding connections open to the webhook goroutine for one reason or another. There's not really any fundamental differences between my patch and yours, other than I'm running additional webhook handlers.

Yup I agree with that sentiment, but to my knowledge its not possible to terminate goroutines.

I was thinking, there must be some way to signal the go http.Server to stop immediately via a context or some method, or possibly some configurable on http.Server or one of the related components.

How would you make this more clear? Extend the godocs?

Yeah, in the documentation, be explicit that the goroutines maybe (will be if we hit the timeout?) left running in the background. It's kind of not really what I think most would expect by the name of the option. Maybe something like 'orphanGoRoutinesAfter' or something like that to indicate that we're doing something bad. Personally, I would prefer to block forever than leak goroutines. GracePeriod implies that we're going to wait for a connection to finish/cleanup, but if that time expires, we close the connection immediately.

I'll see if I can create another reproducer to try to eliminate my changes specifically.

@alvaroaleman
Copy link
Member Author

@michaelgugino those are just bugs in your tests, you are waiting for channels that are never closed when the manager errors: alvaroaleman/machine-api-operator@4da76a6

I was thinking, there must be some way to signal the go http.Server to stop immediately via a context or some method, or possibly some configurable on http.Server or one of the related components.

That might be possible for the http server but is not possible for other Runnables.

Personally, I would prefer to block forever than leak goroutines. GracePeriod implies that we're going to wait for a connection to finish/cleanup, but if that time expires, we close the connection immediately.

Rename it to WaitPeriod maybe? As seen in the isssues in your PR, blocking forever just makes debugging harder.

Generally it is expected that the binary will end if the manager ends and doing so is a mandatory requirement when LeaderElection is enabled. Hence returning means the binary ends which means all go routines stop.

@michaelgugino
Copy link

@michaelgugino those are just bugs in your tests, you are waiting for channels that are never closed when the manager errors: alvaroaleman/machine-api-operator@4da76a6

You're right. We weren't hitting this before as the manager never produced an error, and we never expected it to, but this new patch set returns an error.

I'm kind of on the fence as to what the right thing is. On the one hand, the manager never really returned an error after the stop channel was closed, and the manager ended quickly, pretty much instantly. For webhooks, this is probably okay, as we probably just want to tear it down as quickly as possible, nobody is going to be too upset if a request goes unfulfilled. On the other hand, the existing behavior is definitely a bug, we should wait for the webhooks to stop.

Let's consider the case where we wait indefinitely by default. The manager will block forever, and that might be unexpected and hard to diagnose. This might impact things that are otherwise working today, and a simple rebase onto this patch set is not going to signal any change to the end user as we're not breaking the interface. However, it would be pretty straight forward to see that the manager is blocking, though it might be hard to determine what it's waiting on. The flip side to this coin is, if anyone is eagerly restarting a manager without terminating a container (eg, they have the manager in some while loop, and they're catching a signal or something to break that loop), we're going to leak goroutines. I can't think of a great reason to do this in reality, but some users might be doing that. IMO, leaking goroutines is definitely worse than blocking forever as it's going to happen silently and eventually your containers probably going to crash.

My preference is to block indefinitely by default, make the timeout opt-in, and specify in some reasonable way that the timeout is going to orphan goroutines, and those goroutines could potentially stick around forever unless the process is stopped.

Personally, since I'm aware of this patch, I can live with either scenario. I think we should just consider the impact to other users that might not be aware of the change potentially. I'm not super firm on this position, but I do want to highlight it for consideration.

Otherwise, this patch works well for me.

@alvaroaleman
Copy link
Member Author

I'm kind of on the fence as to what the right thing is

For the record, leaking goroutines is what we already do today and not something this PR introduces. This PR just makes that visible.

In my opinion, if a runnable doesn't stop within 30 seconds its unlikely to stop later on. We can not kill goroutines and just hanging is guaranteed to result in a lot of bug reports and questions, since it looks like a bug in c-r. Returning an error OTOH gives users a clue as to how to debug problems we can not solve for them (get all goroutines to stop) .

Co-authored-by: david.benque <david.benque@datadoghq.com>
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Gave another review today, the code changes look good and this should be ready to go.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 20, 2020
@vincepri
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 21, 2020
@alvaroaleman
Copy link
Member Author

/retest

@vincepri
Copy link
Member

/test pull-controller-runtime-test-master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UMBRELLA: design and refactor graceful termination
7 participants