Support global timeouts for reconcilers #798

ekuefler · 2020-02-13T20:40:00Z

I ran into an issue today where one of my reconcilers was unexpectedly deadlocking on a particular resource. This appeared to halt on processing of that reconciler for other resources, essentially freezing the system until I restarted the manager binary (where it would eventually freeze again when it got to the bad resource).

It would be nice if I could configure a timeout that would apply to all reconcilers in the manager such that they would automatically abort and back off it they appear to be deadlocked.

vincepri · 2020-02-15T15:31:26Z

The idea of a global timeout is definitely interesting and we can probably make it opt-in. I've seen these kinds of situations when the resources weren't enough, most common is deployments with less than 2-3 CPUs allocated.

ekuefler · 2020-02-16T01:21:37Z

Sounds good. My particular situation wasn't related to starving the CPU. I have a resource representing an external server that I need to do TCP health checks on. I'd failed to call SetDeadline on my connection, so when I got a server that responded to the connect request but failed to write back any data, the reconciler would block indefinitely on that resource and lock out processing for all the others. A global timeout would be a useful safety net here.

vincepri · 2020-02-21T17:50:55Z

/priority important-soon
/kind design
/help

k8s-ci-robot · 2020-02-21T17:50:56Z

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/priority important-soon
/kind design
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

GrigoriyMikhalkin · 2020-02-24T19:18:15Z

they would automatically abort and back off it they appear to be deadlocked

I'm little bit confused, what this issue is about. You want some timeout mechanism that would terminate reconciler automatically(i.e. without user handling this termination in reconciliation logic)?

ekuefler · 2020-02-26T08:16:09Z

That's right. Many web frameworks have a built-in timeout that will terminate a request if it takes too long, ensuring request threads aren't occupied indefinitely. I would imagine something similar here. In kubebuilder's case, it would force the reconciler to return an error.

Here's the wrapper I wrote and am using in my project to do this (it also handles panics per #797):

func MakeSafe(r reconcile.Reconciler) reconcile.Reconciler {
	return safeReconciler{impl: r}
}

type safeReconciler struct {
	impl reconcile.Reconciler
}

func (r safeReconciler) Reconcile(request reconcile.Request) (reconcile.Result, error) {
	type response struct {
		result reconcile.Result
		err    error
	}

	ch := make(chan response)
	go func() {
		defer func() {
			if p := recover(); p != nil {
				ch <- response{
					err: fmt.Errorf("panic: %v [recovered]\n\n%s", p, debug.Stack()),
				}
			}
		}()

		r, e := r.impl.Reconcile(request)
		ch <- response{
			result: r,
			err:    e,
		}
	}()

	select {
	case r := <-ch:
		return r.result, r.err
	case <-time.After(10 * time.Second):
		return reconcile.Result{}, fmt.Errorf("reconciler timed out")
	}
}

This is imperfect since it means we'll leak a goroutine if the wrapped reconciler deadlocks, but it will at least guarantee that kubebuilder's worker goroutines can't deadlock.

GrigoriyMikhalkin · 2020-02-26T14:49:28Z

IMO, it looks like solution to conceal some bugs in program, instead of fixing them.

alvaroaleman · 2020-03-23T02:23:12Z

It is not possible to terminate a goroutine from the outside, so I don't know how this could be implemented (leaking goroutines is IMHO not an option we should consider, it also means that we break the guarantee of "Only one routine will work at a given key at a given time").

Or did I misunderstand something about this?

luxas · 2020-07-14T10:32:31Z

This is really tricky as there is no way to terminate a goroutine which executes indefinitely.
I otherwise like the code that @ekuefler pasted above and hope we could do something in that spirit.

Adding contexts to Reconcile would allow us to signal to the goroutine that it should stop doing things (and then we'd wait a bit longer for it to "cleanup & return" before timing out), but it doesn't guarantee that the goroutine respects the context.

A poor "solution" would be to swap out the underlying client that the reconcile call uses to a client which just directly errors once the timeout has passed (also to make sure that deadlocked goroutine doesn't do any harm), but that definitely has drawbacks / side-effects.

devigned · 2020-07-14T14:32:22Z

A poor "solution" would be to swap out the underlying client that the reconcile call uses to a client which just directly errors once the timeout has passed (also to make sure that deadlocked goroutine doesn't do any harm), but that definitely has drawbacks / side-effects.

This would help a little, but I'd imagine many reconcilers are going to make calls to external systems which would not use an enlightened client. Eventually, a user will end up in the same situation where they have unbounded execution.

IMO, the controller author must have a context available to them at the root of execution.

I would go a step further than just having a context on reconcilers, but also mapping functions should include a context argument to bound their execution as well. One good example of unbounded execution in a handler is the following extracted from Cluster API. As you can see the context.Background() passed into client.List(...) is problematic.

// ClusterToObjectsMapper returns a mapper function that gets a cluster and lists all objects for the object passed in
// and returns a list of requests.
// NB: The objects are required to have `clusterv1.ClusterLabelName` applied.
func ClusterToObjectsMapper(c client.Client, ro runtime.Object, scheme *runtime.Scheme) (handler.Mapper, error) {
	if _, ok := ro.(metav1.ListInterface); !ok {
		return nil, errors.Errorf("expected a metav1.ListInterface, got %T instead", ro)
	}

	gvk, err := apiutil.GVKForObject(ro, scheme)
	if err != nil {
		return nil, err
	}

	return handler.ToRequestsFunc(func(o handler.MapObject) []ctrl.Request {
		cluster, ok := o.Object.(*clusterv1.Cluster)
		if !ok {
			return nil
		}

		list := &unstructured.UnstructuredList{}
		list.SetGroupVersionKind(gvk)
		if err := c.List(context.Background(), list, client.MatchingLabels{clusterv1.ClusterLabelName: cluster.Name}); err != nil {
			return nil
		}

		results := []ctrl.Request{}
		for _, obj := range list.Items {
			results = append(results, ctrl.Request{
				NamespacedName: client.ObjectKey{Namespace: obj.GetNamespace(), Name: obj.GetName()},
			})
		}
		return results

	}), nil
}

context.Context should probably be a first parameter on anything that calls into controller author code.

If a breaking change is to be made, I'd argue it would be best to make one sweeping change rather than multiple small breaking changes.

luxas · 2020-07-14T14:46:05Z

context.Context should probably be a first parameter on anything that calls into controller author code.

Yes

Also see #801 (comment), in particular:

The manager could also e.g. close the ctx given to the reconciler loop upon a Ctrl-C signal. We could just say "if Reconcile doesn't respect the context and deadlocks, you're violating the contract", with the tradeoff that if a reconcile loop doesn't follow the contract, there might be multiple executing reconcile loops at the same time (optionally; guaranteed to never run two Reconcile loops twice for the same NamespacedName, for the example of faulty resources)

fejta-bot · 2020-10-12T15:41:31Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vincepri · 2020-10-12T15:42:54Z

/lifecycle frozen

vincepri · 2020-10-12T15:44:01Z

The changes around context above have been implemented, to have cancellation for reconcilers we'll need to inject a context with a set timeout — although that doesn't really guarantee that operations will be cancelled so there is potentially a way to leak goroutines

vincepri mentioned this issue Feb 17, 2020

Reconcile methods should take a context and logger #801

Closed

vincepri added this to the v0.6.0 milestone Feb 21, 2020

alexeldeib mentioned this issue May 6, 2020

feat: aks provider kubernetes-sigs/cluster-api-provider-azure#482

Merged

devigned mentioned this issue May 6, 2020

Provide deadlines or cancelations for all uses of context.Context kubernetes-sigs/cluster-api-provider-azure#600

Closed

3 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 12, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 12, 2020

cheyang mentioned this issue Nov 19, 2021

[FEATURES] bind all reconcile loops with limited context.Context fluid-cloudnative/fluid#1144

Closed

plkokanov mentioned this issue Dec 9, 2022

Adds timeouts for reconciler contexts gardener/gardener#7147

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support global timeouts for reconcilers #798

Support global timeouts for reconcilers #798

ekuefler commented Feb 13, 2020

vincepri commented Feb 15, 2020

ekuefler commented Feb 16, 2020

vincepri commented Feb 21, 2020

k8s-ci-robot commented Feb 21, 2020

GrigoriyMikhalkin commented Feb 24, 2020 •

edited

Loading

ekuefler commented Feb 26, 2020

GrigoriyMikhalkin commented Feb 26, 2020 •

edited

Loading

alvaroaleman commented Mar 23, 2020

luxas commented Jul 14, 2020

devigned commented Jul 14, 2020 •

edited

Loading

luxas commented Jul 14, 2020

fejta-bot commented Oct 12, 2020

vincepri commented Oct 12, 2020

vincepri commented Oct 12, 2020

Support global timeouts for reconcilers #798

Support global timeouts for reconcilers #798

Comments

ekuefler commented Feb 13, 2020

vincepri commented Feb 15, 2020

ekuefler commented Feb 16, 2020

vincepri commented Feb 21, 2020

k8s-ci-robot commented Feb 21, 2020

GrigoriyMikhalkin commented Feb 24, 2020 • edited Loading

ekuefler commented Feb 26, 2020

GrigoriyMikhalkin commented Feb 26, 2020 • edited Loading

alvaroaleman commented Mar 23, 2020

luxas commented Jul 14, 2020

devigned commented Jul 14, 2020 • edited Loading

luxas commented Jul 14, 2020

fejta-bot commented Oct 12, 2020

vincepri commented Oct 12, 2020

vincepri commented Oct 12, 2020

GrigoriyMikhalkin commented Feb 24, 2020 •

edited

Loading

GrigoriyMikhalkin commented Feb 26, 2020 •

edited

Loading

devigned commented Jul 14, 2020 •

edited

Loading