Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support global timeouts for reconcilers #798

Open
ekuefler opened this issue Feb 13, 2020 · 14 comments
Open

Support global timeouts for reconcilers #798

ekuefler opened this issue Feb 13, 2020 · 14 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/design Categorizes issue or PR as related to design. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@ekuefler
Copy link
Contributor

I ran into an issue today where one of my reconcilers was unexpectedly deadlocking on a particular resource. This appeared to halt on processing of that reconciler for other resources, essentially freezing the system until I restarted the manager binary (where it would eventually freeze again when it got to the bad resource).

It would be nice if I could configure a timeout that would apply to all reconcilers in the manager such that they would automatically abort and back off it they appear to be deadlocked.

@vincepri
Copy link
Member

The idea of a global timeout is definitely interesting and we can probably make it opt-in. I've seen these kinds of situations when the resources weren't enough, most common is deployments with less than 2-3 CPUs allocated.

@ekuefler
Copy link
Contributor Author

Sounds good. My particular situation wasn't related to starving the CPU. I have a resource representing an external server that I need to do TCP health checks on. I'd failed to call SetDeadline on my connection, so when I got a server that responded to the connect request but failed to write back any data, the reconciler would block indefinitely on that resource and lock out processing for all the others. A global timeout would be a useful safety net here.

@vincepri
Copy link
Member

/priority important-soon
/kind design
/help

@k8s-ci-robot
Copy link
Contributor

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/priority important-soon
/kind design
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/design Categorizes issue or PR as related to design. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 21, 2020
@GrigoriyMikhalkin
Copy link
Contributor

GrigoriyMikhalkin commented Feb 24, 2020

they would automatically abort and back off it they appear to be deadlocked

I'm little bit confused, what this issue is about. You want some timeout mechanism that would terminate reconciler automatically(i.e. without user handling this termination in reconciliation logic)?

@ekuefler
Copy link
Contributor Author

That's right. Many web frameworks have a built-in timeout that will terminate a request if it takes too long, ensuring request threads aren't occupied indefinitely. I would imagine something similar here. In kubebuilder's case, it would force the reconciler to return an error.

Here's the wrapper I wrote and am using in my project to do this (it also handles panics per #797):

func MakeSafe(r reconcile.Reconciler) reconcile.Reconciler {
	return safeReconciler{impl: r}
}

type safeReconciler struct {
	impl reconcile.Reconciler
}

func (r safeReconciler) Reconcile(request reconcile.Request) (reconcile.Result, error) {
	type response struct {
		result reconcile.Result
		err    error
	}

	ch := make(chan response)
	go func() {
		defer func() {
			if p := recover(); p != nil {
				ch <- response{
					err: fmt.Errorf("panic: %v [recovered]\n\n%s", p, debug.Stack()),
				}
			}
		}()

		r, e := r.impl.Reconcile(request)
		ch <- response{
			result: r,
			err:    e,
		}
	}()

	select {
	case r := <-ch:
		return r.result, r.err
	case <-time.After(10 * time.Second):
		return reconcile.Result{}, fmt.Errorf("reconciler timed out")
	}
}

This is imperfect since it means we'll leak a goroutine if the wrapped reconciler deadlocks, but it will at least guarantee that kubebuilder's worker goroutines can't deadlock.

@GrigoriyMikhalkin
Copy link
Contributor

GrigoriyMikhalkin commented Feb 26, 2020

IMO, it looks like solution to conceal some bugs in program, instead of fixing them.

@alvaroaleman
Copy link
Member

It is not possible to terminate a goroutine from the outside, so I don't know how this could be implemented (leaking goroutines is IMHO not an option we should consider, it also means that we break the guarantee of "Only one routine will work at a given key at a given time").

Or did I misunderstand something about this?

@luxas
Copy link

luxas commented Jul 14, 2020

This is really tricky as there is no way to terminate a goroutine which executes indefinitely.
I otherwise like the code that @ekuefler pasted above and hope we could do something in that spirit.

Adding contexts to Reconcile would allow us to signal to the goroutine that it should stop doing things (and then we'd wait a bit longer for it to "cleanup & return" before timing out), but it doesn't guarantee that the goroutine respects the context.

A poor "solution" would be to swap out the underlying client that the reconcile call uses to a client which just directly errors once the timeout has passed (also to make sure that deadlocked goroutine doesn't do any harm), but that definitely has drawbacks / side-effects.

@devigned
Copy link

devigned commented Jul 14, 2020

A poor "solution" would be to swap out the underlying client that the reconcile call uses to a client which just directly errors once the timeout has passed (also to make sure that deadlocked goroutine doesn't do any harm), but that definitely has drawbacks / side-effects.

This would help a little, but I'd imagine many reconcilers are going to make calls to external systems which would not use an enlightened client. Eventually, a user will end up in the same situation where they have unbounded execution.

IMO, the controller author must have a context available to them at the root of execution.

I would go a step further than just having a context on reconcilers, but also mapping functions should include a context argument to bound their execution as well. One good example of unbounded execution in a handler is the following extracted from Cluster API. As you can see the context.Background() passed into client.List(...) is problematic.

// ClusterToObjectsMapper returns a mapper function that gets a cluster and lists all objects for the object passed in
// and returns a list of requests.
// NB: The objects are required to have `clusterv1.ClusterLabelName` applied.
func ClusterToObjectsMapper(c client.Client, ro runtime.Object, scheme *runtime.Scheme) (handler.Mapper, error) {
	if _, ok := ro.(metav1.ListInterface); !ok {
		return nil, errors.Errorf("expected a metav1.ListInterface, got %T instead", ro)
	}

	gvk, err := apiutil.GVKForObject(ro, scheme)
	if err != nil {
		return nil, err
	}

	return handler.ToRequestsFunc(func(o handler.MapObject) []ctrl.Request {
		cluster, ok := o.Object.(*clusterv1.Cluster)
		if !ok {
			return nil
		}

		list := &unstructured.UnstructuredList{}
		list.SetGroupVersionKind(gvk)
		if err := c.List(context.Background(), list, client.MatchingLabels{clusterv1.ClusterLabelName: cluster.Name}); err != nil {
			return nil
		}

		results := []ctrl.Request{}
		for _, obj := range list.Items {
			results = append(results, ctrl.Request{
				NamespacedName: client.ObjectKey{Namespace: obj.GetNamespace(), Name: obj.GetName()},
			})
		}
		return results

	}), nil
}

context.Context should probably be a first parameter on anything that calls into controller author code.

If a breaking change is to be made, I'd argue it would be best to make one sweeping change rather than multiple small breaking changes.

@luxas
Copy link

luxas commented Jul 14, 2020

context.Context should probably be a first parameter on anything that calls into controller author code.

Yes

Also see #801 (comment), in particular:

The manager could also e.g. close the ctx given to the reconciler loop upon a Ctrl-C signal. We could just say "if Reconcile doesn't respect the context and deadlocks, you're violating the contract", with the tradeoff that if a reconcile loop doesn't follow the contract, there might be multiple executing reconcile loops at the same time (optionally; guaranteed to never run two Reconcile loops twice for the same NamespacedName, for the example of faulty resources)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 12, 2020
@vincepri
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 12, 2020
@vincepri
Copy link
Member

The changes around context above have been implemented, to have cancellation for reconcilers we'll need to inject a context with a set timeout — although that doesn't really guarantee that operations will be cancelled so there is potentially a way to leak goroutines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/design Categorizes issue or PR as related to design. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

8 participants