Reconciler panics should not crash the manager #797

ekuefler · 2020-02-13T20:36:56Z

Currently, an unhandled panic in a reconciler will not be recovered from, and will likely cause the manager binary to crash. This is a problem, since a panic might be triggered by a single resource in an unexpected state, so that one bad resource could prevent all other resources from being processed. Since Kubernetes is likely to restart the manager pod after a crash, this can also cause the manager to DOS the Kubernetes API server as it continually restarts.

In my project, I wrote this utility function:

func MakeSafe(r reconcile.Reconciler) reconcile.Reconciler {
	return safeReconciler{impl: r}
}

type safeReconciler struct {
	impl reconcile.Reconciler
}

func (r safeReconciler) Reconcile(request reconcile.Request) (result reconcile.Result, err error) {
	defer func() {
		if r := recover(); r != nil {
			result = reconcile.Result{}
			err = fmt.Errorf("panic: %v [recovered]\n\n%s", r, debug.Stack())
		}
	}()
	return r.impl.Reconcile(request)
}

Every time I pass a reconciler to Complete, I wrap it with this. It ensures that any panics raised by the reconciler are converted to normal errors.

The text was updated successfully, but these errors were encountered:

alvaroaleman · 2020-02-15T09:55:38Z

Yeah, this would be great to have :)

rajathagasthya · 2020-02-19T17:22:57Z

Something like apimachinery's HandleCrash function would also work if we plug it in the right place: https://github.com/kubernetes/apimachinery/blob/3253b0a30d67e7e362b8615e463156bac729c82f/pkg/util/runtime/runtime.go#L45

vincepri · 2020-02-21T17:50:14Z

We can revisit the milestone if the design doesn't have breaking changes. Folks might be relying on panics to detect failures today.

/priority important-soon
/help
/kind design

k8s-ci-robot · 2020-02-21T17:50:15Z

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

We can revisit the milestone if the design doesn't have breaking changes. Folks might be relying on panics to detect failures today.

/priority important-soon
/help
/kind design

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2020-05-21T18:37:31Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vincepri · 2020-05-21T19:47:40Z

/lifecycle frozen

varshaprasad96 · 2020-06-18T19:20:07Z

/assign

FillZpp · 2021-08-06T06:03:07Z

@varshaprasad96 Hi, are you still pursuing this? Mind if I take a stab at it :) ?

varshaprasad96 · 2021-08-06T21:32:33Z

@FillZpp sure, please feel free to create a PR for this.

FillZpp · 2021-08-07T10:09:05Z

/assign @FillZpp

vincepri added this to the v0.6.0 milestone Feb 21, 2020

ekuefler mentioned this issue Feb 26, 2020

Support global timeouts for reconcilers #798

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 21, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 21, 2020

varshaprasad96 mentioned this issue Jun 17, 2020

✨ Add helper to handle panics in reconciler #1001

Closed

k8s-ci-robot assigned varshaprasad96 Jun 18, 2020

k8s-ci-robot assigned FillZpp Aug 7, 2021

FillZpp mentioned this issue Aug 9, 2021

✨Add an option to recover panic for controller reconcile #1627

Merged

k8s-ci-robot closed this as completed in #1627 Aug 19, 2021

FillZpp mentioned this issue Nov 16, 2021

REQUEST: New membership for FillZpp kubernetes/org#3100

Closed

7 tasks

FillZpp mentioned this issue Dec 22, 2021

🌱 Add FillZpp as a reviewer #1753

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconciler panics should not crash the manager #797

Reconciler panics should not crash the manager #797

ekuefler commented Feb 13, 2020

alvaroaleman commented Feb 15, 2020

rajathagasthya commented Feb 19, 2020

vincepri commented Feb 21, 2020

k8s-ci-robot commented Feb 21, 2020

fejta-bot commented May 21, 2020

vincepri commented May 21, 2020

varshaprasad96 commented Jun 18, 2020

FillZpp commented Aug 6, 2021

varshaprasad96 commented Aug 6, 2021

FillZpp commented Aug 7, 2021

Reconciler panics should not crash the manager #797

Reconciler panics should not crash the manager #797

Comments

ekuefler commented Feb 13, 2020

alvaroaleman commented Feb 15, 2020

rajathagasthya commented Feb 19, 2020

vincepri commented Feb 21, 2020

k8s-ci-robot commented Feb 21, 2020

fejta-bot commented May 21, 2020

vincepri commented May 21, 2020

varshaprasad96 commented Jun 18, 2020

FillZpp commented Aug 6, 2021

varshaprasad96 commented Aug 6, 2021

FillZpp commented Aug 7, 2021