Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconciler panics should not crash the manager #797

Closed
ekuefler opened this issue Feb 13, 2020 · 10 comments · Fixed by #1627
Closed

Reconciler panics should not crash the manager #797

ekuefler opened this issue Feb 13, 2020 · 10 comments · Fixed by #1627
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/design Categorizes issue or PR as related to design. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@ekuefler
Copy link
Contributor

Currently, an unhandled panic in a reconciler will not be recovered from, and will likely cause the manager binary to crash. This is a problem, since a panic might be triggered by a single resource in an unexpected state, so that one bad resource could prevent all other resources from being processed. Since Kubernetes is likely to restart the manager pod after a crash, this can also cause the manager to DOS the Kubernetes API server as it continually restarts.

In my project, I wrote this utility function:

func MakeSafe(r reconcile.Reconciler) reconcile.Reconciler {
	return safeReconciler{impl: r}
}

type safeReconciler struct {
	impl reconcile.Reconciler
}

func (r safeReconciler) Reconcile(request reconcile.Request) (result reconcile.Result, err error) {
	defer func() {
		if r := recover(); r != nil {
			result = reconcile.Result{}
			err = fmt.Errorf("panic: %v [recovered]\n\n%s", r, debug.Stack())
		}
	}()
	return r.impl.Reconcile(request)
}

Every time I pass a reconciler to Complete, I wrap it with this. It ensures that any panics raised by the reconciler are converted to normal errors.

@alvaroaleman
Copy link
Member

Yeah, this would be great to have :)

@rajathagasthya
Copy link
Contributor

Something like apimachinery's HandleCrash function would also work if we plug it in the right place: https://github.com/kubernetes/apimachinery/blob/3253b0a30d67e7e362b8615e463156bac729c82f/pkg/util/runtime/runtime.go#L45

@vincepri vincepri added this to the v0.6.0 milestone Feb 21, 2020
@vincepri
Copy link
Member

We can revisit the milestone if the design doesn't have breaking changes. Folks might be relying on panics to detect failures today.

/priority important-soon
/help
/kind design

@k8s-ci-robot
Copy link
Contributor

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

We can revisit the milestone if the design doesn't have breaking changes. Folks might be relying on panics to detect failures today.

/priority important-soon
/help
/kind design

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/design Categorizes issue or PR as related to design. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 21, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 21, 2020
@vincepri
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 21, 2020
@varshaprasad96
Copy link
Member

/assign

@FillZpp
Copy link
Contributor

FillZpp commented Aug 6, 2021

@varshaprasad96 Hi, are you still pursuing this? Mind if I take a stab at it :) ?

@varshaprasad96
Copy link
Member

@FillZpp sure, please feel free to create a PR for this.

@FillZpp
Copy link
Contributor

FillZpp commented Aug 7, 2021

/assign @FillZpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/design Categorizes issue or PR as related to design. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants