wait and timeout options with Helm-based operator #4102

jfgosselin · 2020-10-26T12:35:00Z

Feature Request

Describe the problem you need a feature to resolve.

The Helm-based operator does not wait until the Deployments have minimum (Desired minus maxUnavailable) Pods in ready state before marking the release as successful.

Describe the solution you'd like.

Expose a wait and timeout option like the Helm cli and update the CustomResource status accordingly.

--timeout: A value in seconds to wait for Kubernetes commands to complete This defaults to 5m0s
--wait: Waits until all Pods are in a ready state, PVCs are bound, Deployments have minimum (Desired minus maxUnavailable) Pods in ready state and Services have an IP address (and Ingress if a LoadBalancer) before marking the release as successful. It will wait for as long as the --timeout value. If timeout is reached, the release will be marked as FAILED. Note: In scenarios where Deployment has replicas set to 1 and maxUnavailable is not set to 0 as part of rolling update strategy, --wait will return as ready as it has satisfied the minimum Pod in ready condition.

/language helm

joelanford · 2020-10-26T17:26:20Z

This seems like it would be helpful.

The one concern I have is that Kubernetes controllers generally don't block, waiting for certain conditions to exist. Instead, they make their observations and changes, and then wait until those changes trigger another reconcile loop. The loop then repeats until steady state is reached.

If Helm's --wait conditions could be implemented as I describe above, would that meet your use case?

This has me thinking about --timeout as well. If the above can be implemented, does a timeout like this make sense in the context of an always reconciling operator? I'm not sure it does. With the Helm CLI, after the timeout occurs, the Helm binary exits, presumably leaving the incomplete release resources as is. But the operator is always running, so it will just keep retrying, waiting, timing out, etc. the exact same way every time with an exponential backoff between retries.

jfgosselin · 2020-10-26T18:23:43Z

@joelanford that will meet my use case. You are right, I don't think the timeout make sense.

joelanford · 2020-10-26T19:06:21Z

@jfgosselin Cool! Is this something you would be willing to investigate and contribute?

One thing I'm not sure about is whether the Helm libraries expose the same checks that the --wait flag makes without actually having to wait. If it does, it should be possible to incorporate that check into the reconcile loop and possibly tweak or add a status condition.

drewwells · 2020-10-26T19:56:18Z

This is closely related to #2491. The operator should mark the CR status Ready only after all the resources are ready by k8s's definition of ready.

In Helm, there is an interface that implements ready checking of multiple resources https://github.com/helm/helm/blob/9b42702a4bced339ff424a78ad68dd6be6e1a80a/pkg/kube/interface.go#L38-L44

The hooks that perform the checks are here
https://github.com/helm/helm/blob/193850a9e2c509acf1a499d98e8d23c12c134f11/pkg/action/hooks.go#L58-L83

joelanford · 2020-10-26T21:19:40Z

@drewwells It looks like WatchUntilReady is used only for hook execution, and it also looks like it is always called, even if --wait is not set. So I think the helm-operator is likely already waiting for hooks.

For --wait functionality, we would want to mimic or re-use the KubeClient.Wait() function. Here's the usage in action.Install.

Unfortunately, it looks like the waiter struct and waiter.waitForResources() function are both unexported and they block, so we can't re-use them in the SDK without actually using the action.Install.Wait value on the struct, which I think we should avoid.

Ideally this block of logic would be extracted and exported into a new IsResourceReady(r Resource) (bool, error) function.

drewwells · 2020-10-26T21:48:37Z

I agree about the action package, it's very lacking in its flexibility.

You're right about WatchUntilReady. There's another similarly named method that is public and constructs a waiter https://github.com/helm/helm/blob/593fec6868d7a239032cf234b530aee6f5d52708/pkg/kube/client.go#L116-L128 We won't have any errors to show until the Timeout elapses though and possibly no other reconciliations.

What is the objection to using action.Install.Wait flag? I think even using client.Wait here will have limited visibility from within the operator itself.

joelanford · 2020-10-27T00:25:20Z

@drewwells There are a few reasons I don't think we should use action.Install.Wait.

As I mentioned in my first comment, it is generally an anti-pattern to wait in Kubernetes controllers. The more accepted pattern is to observe the current state of the CR and its resources (e.g. ask is the release ready?), make any changes (e.g. set CR status based on whether the resources are ready), and immediately return. Then if/when any of the status blocks of any of the release resources change, our reconciler will be triggered again, and we can again check if the release is ready. This process repeats until all of the release resource status blocks achieve steady state. If the final steady state is that all of the resources are ready, then the reconciler will set the CR status to reflect that.

With this design the operator can scale to handle a very large number of CRs and their releases being active at the same time.

If the reconciler is allowed to block, the operator will only be able to handle MaxConcurrentReconciles CRs simultaneously. Once that number of reconcilers is waiting, all other events will be queued until an existing CR is finished waiting or it times out.

Another benefit of this design is that the question of whether a release is ready is always checked. If steady state was reached and the release became ready, that doesn't mean it will ALWAYS be ready. If a pod terminates or a service loses its endpoints, the reconciler will be triggered and it will change the CR status from ready to not ready.

drewwells · 2020-10-27T02:33:02Z

We can cheat here just like helm-cli does. It marks the release as ready as soon as it becomes available even when that is only true for a short period. Programs that do not have sufficiently long startupProbes will pass as ready then start crash looping, helm does not catch this.

I think here we need to distinguish between operations helm can handle vs trying to micromanage resources of a release.

If a subresource is deleted, the operator could sort out how to bring it back. This can be handled by watching subresources for CUD events https://godoc.org/sigs.k8s.io/controller-runtime/pkg/builder#Builder.Owns.

The other one is I guess is to update status if the release is in bad health. I’m not so certain operator is responsible for this. Most operators I have used act sort of like k8s replicasets. They will do very simple things like recreate pods when deleted and update statuses if things go wrong. They can’t fix a broken pod and don’t try to.

Keeping in this mind, it seems ideal that helm-operator watches all the sub-resources and attempts to make some decisions on them at least until initial ready. It is a nice to have for the continuous watching. I suspect the current behavior which appears to be polling to sync the state over time. If configmap is mutated, requeue is reconciled, lookup resource owner, find the desired state, set the configmap back to expected state. It sounds expensive memory wise to watch this many objects so polling is fine here.

jfgosselin · 2020-10-27T12:34:33Z

@jfgosselin Cool! Is this something you would be willing to investigate and contribute?

One thing I'm not sure about is whether the Helm libraries expose the same checks that the --wait flag makes without actually having to wait. If it does, it should be possible to incorporate that check into the reconcile loop and possibly tweak or add a status condition.

@joelanford sorry, I've never coded in Go

joelanford · 2020-10-27T15:16:00Z

@jfgosselin No worries. This is not currently a high priority for the core SDK team maintainers, but it is definitely a nice feature, and we would definitely accept a PR.

I think the direction we would want to go is to:

work upstream in the Helm project to expose the release readiness check as an exported function
incorporate a new release of Helm that contains 1, such that the Helm operator reconcile loop performs Helm's readiness checks during every reconciliation without waiting for all the release resources to actually be ready.

If anyone is interested in taking this work on, let us know!

jfgosselin · 2020-10-27T16:31:03Z

BTW not sure how they did it in the Flux Helm Operator but I think they support this feature (we didn't use Flux for some other reasons).

joelanford · 2020-10-27T18:29:23Z

I didn't look too closely, but it looks like they just allow the .Wait setting to be used directly. If anyone sees this that has some knowledge about this particular aspect of the Flex Helm Operator, I'd be interested in knowing how/if the concurrency and reconciliation problems I mentioned above are handled.

drewwells · 2020-10-27T22:16:32Z

Workaround, kubectl wait can look at the status types to block when things happen. It's not currently possible to use it with InstallSuccessful b/c it is listed as a Reason instead of a type. You can get the deploy, then have to filter the output looking for InstallSuccessful

k -n konk get konk example-konk -o jsonpath="{range .status.conditions[*]}{$@.type}={@.reason}{'\n'}{end}"                     stg-1
Initialized=
Deployed=InstallSuccessful

Current

   - lastTransitionTime: "2020-10-27T21:48:35Z"
      message: |
        1. Get the application URL by running these commands:
          export POD_NAME=$(kubectl get pods --namespace konk -l "app.kubernetes.io/name=konk,app.kubernetes.io/instance=example-konk" -o jsonpath="{.it
ems[0].metadata.name}")
          echo "Visit http://127.0.0.1:8080 to use your application"
          kubectl --namespace konk port-forward $POD_NAME 8080:80
      reason: InstallSuccessful
      status: "True"
      type: Deployed

If there was a Status for "DeployedSuccessful", then users could use kubectl to wait on this status to take action.

until kubectl wait --timeout=1s --for=condition=DeployedSuccessful -n konk konk example-konk; do echo "still deploying"; sleep 1s; done

openshift-bot · 2021-01-26T01:01:14Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-02-25T01:18:16Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-03-27T05:37:09Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2021-03-27T05:37:11Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the language/helm Issue is related to a Helm operator project label Oct 26, 2020

jberkhahn assigned joelanford Oct 26, 2020

jberkhahn added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 26, 2020

jberkhahn added this to the Backlog milestone Oct 26, 2020

drewwells mentioned this issue Oct 26, 2020

konk-service-operator should check that apiservice is ready before marking itself ready infobloxopen/konk#53

Open

drewwells mentioned this issue Oct 26, 2020

Timeout #4115

Closed

2 tasks

drewwells added a commit to infobloxopen/operator-sdk that referenced this issue Oct 27, 2020

support timeout in helm calls operator-framework#4102

23a8831

drewwells added a commit to infobloxopen/operator-sdk that referenced this issue Oct 27, 2020

support timeout in helm calls operator-framework#4102

ecc9d2d

joelanford mentioned this issue Oct 27, 2020

Feature: ready command to report the readiness of a release helm/helm#8768

Closed

camilamacedo86 mentioned this issue Oct 30, 2020

liveness/readiness probe scaffolded by default kubernetes-sigs/kubebuilder#1761

Closed

This was referenced Dec 7, 2020

Helm operator: are rollback workarounds for helm install and upgrade failure still valid #4285

Closed

fix: (helm) - use atomic flag for install/upgrade instead of the workaround rollback #4289

Closed

mikeshng mentioned this issue Jan 18, 2021

Helm operator feature request: for CR delete, add option to wait until all resources are deleted before removing the finalizer #4401

Closed

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2021

openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 25, 2021

openshift-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 25, 2021

openshift-ci-robot closed this as completed Mar 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wait and timeout options with Helm-based operator #4102

wait and timeout options with Helm-based operator #4102

jfgosselin commented Oct 26, 2020

joelanford commented Oct 26, 2020

jfgosselin commented Oct 26, 2020

joelanford commented Oct 26, 2020

drewwells commented Oct 26, 2020

joelanford commented Oct 26, 2020

drewwells commented Oct 26, 2020

joelanford commented Oct 27, 2020

drewwells commented Oct 27, 2020

jfgosselin commented Oct 27, 2020

joelanford commented Oct 27, 2020

jfgosselin commented Oct 27, 2020

joelanford commented Oct 27, 2020

drewwells commented Oct 27, 2020

openshift-bot commented Jan 26, 2021

openshift-bot commented Feb 25, 2021

openshift-bot commented Mar 27, 2021

openshift-ci-robot commented Mar 27, 2021

wait and timeout options with Helm-based operator #4102

wait and timeout options with Helm-based operator #4102

Comments

jfgosselin commented Oct 26, 2020

Feature Request

Describe the problem you need a feature to resolve.

Describe the solution you'd like.

joelanford commented Oct 26, 2020

jfgosselin commented Oct 26, 2020

joelanford commented Oct 26, 2020

drewwells commented Oct 26, 2020

joelanford commented Oct 26, 2020

drewwells commented Oct 26, 2020

joelanford commented Oct 27, 2020

drewwells commented Oct 27, 2020

jfgosselin commented Oct 27, 2020

joelanford commented Oct 27, 2020

jfgosselin commented Oct 27, 2020

joelanford commented Oct 27, 2020

drewwells commented Oct 27, 2020

openshift-bot commented Jan 26, 2021

openshift-bot commented Feb 25, 2021

openshift-bot commented Mar 27, 2021

openshift-ci-robot commented Mar 27, 2021