Release update operation should be retried to avoid being stuck in pending-install #10023

hferentschik · 2021-08-12T12:14:46Z

I have a Helm chart which when getting installed will always get stuck in "pending-install", even though the install was successful and the application is running.

The issue is that the chart contains a ValidatingWebhookConfiguration. The resource gets installed and starts, but there is a time window in which API server calls will fail due to the webhook not being ready yet. Unfortunately, there seems to be no way in general in Kubernetes to wait for a webhook to be ready. It also does not help to specify the --wait option as part of the helm install. The situation is similar to the one described in issue #6339. I think the underlying problem was never really resolved.

The error message I am seeing looks like this (note the webhook in question is the HNC validating webhook:

install.go:387: [debug] failed to record the release: update: failed to update: Internal error occurred: failed calling webhook "objects.hnc.x-k8s.io": Post "https://hnc-webhook-service.hnc-system.svc:443/validate-objects?timeout=2s": dial tcp 10.96.16.139:443: connect: connection refused

The resources, including the webhook get applied and directly after that Helm tries to update the Secret containing the release information. This operation fails and leads to the release being permanently stuck in the pending-install state.

I think a simple solution, that seems to make sense regardless, is to retry the update operation. In case of the Secret storage, concretely that would mean to retry secrets.impl.Update in the following code (secrets.go):

func (secrets *Secrets) Update(key string, rls *rspb.Release) error {
	// set labels for secrets object meta data
	var lbs labels

	lbs.init()
	lbs.set("modifiedAt", strconv.Itoa(int(time.Now().Unix())))

	// create a new secret object to hold the release
	obj, err := newSecretsObject(key, rls, lbs)
	if err != nil {
		return errors.Wrapf(err, "update: failed to encode release %q", rls.Name)
	}
	// push the secret object out into the kubiverse
	_, err = secrets.impl.Update(context.Background(), obj, metav1.UpdateOptions{})
	return errors.Wrap(err, "update: failed to update")
}

Output of helm version:

version.BuildInfo{Version:"v3.6.1", GitCommit:"61d8e8c4a6f95540c15c6a65f36a6dd0a45e7a2f", GitTreeState:"dirty", GoVersion:"go1.16.5"}

The text was updated successfully, but these errors were encountered:

mattfarina · 2021-08-12T16:57:45Z

How would you handle the situation where concurrence happens? For example, you helm update running twice on the same release. The first one would put it into pending and the second one would see the pending and try it again even through the first is still going, if a change like this goes in.

hferentschik · 2021-08-13T09:32:49Z

How would you handle the situation where concurrence happens?

Is this really an issue? I just want to retry for a few times maybe with some sort of back-off in between. Let's say for 5 seconds or so. Theoretically, there could be other network-related issues when calling the Kube API server, so re-trying seems to me a good idea regardless.

For example, you helm update running twice on the same release.

Do you mean, someone running to helm install or helm updates in parallel? Is the behaviour/outcome of this deterministic regardless of a retry?

The first one would put it into pending and the second one would see the pending and try it again even through the first is still going, if a change like this goes in.

How is a retry of the update operation making things worse here? The retry is there to handle the situation where due to a webhook not responding or other (network) error, applying the update does not work. Even without the retry, if there are two parallel installs, you have the same situation. The one which reaches "completion" first updates the release status, not knowing that there is a second install on-going (unless there is some special code which handles this situation in some form and detects the parallel install, but that would break with the suggested retry).

mattfarina · 2021-08-16T13:57:13Z

How would you handle the situation where concurrence happens?

Is this really an issue? I just want to retry for a few times maybe with some sort of back-off in between. Let's say for 5 seconds or so. Theoretically, there could be other network-related issues when calling the Kube API server, so re-trying seems to me a good idea regardless.

Concurrency is a a real issue. Helm v2 had a server side component named Tiller. Some people did not want to run Tiller in the cluster and instead ran it locally, right next to Helm. That setup of communication provided a similar setup to what people have with Helm v3. Tiller managed the lock to make sure concurrent actions did not step on each other. When people ran it locally, such as in CI, they did run into concurrency problems. It happened in practice. This is why we have a lock based on state. Pending tells us a process has a lock.

It would be useful to have a way to clean up a lock. But, we need to be careful in how we do it and in Helm v3 do it in a backward compatible way.

If there are network issues the underlying Kube API should handle basic issues. Helm is layer 7 (in the OSI model). It can be tricky to deal with network issues here. If someone is in an ops situation it can be useful to use a pull based model for deployment rather than a push based one because it can be easier to handle the fault tolerance around connection issues.

For example, you helm update running twice on the same release.

Do you mean, someone running to helm install or helm updates in parallel? Is the behaviour/outcome of this deterministic regardless of a retry?

Whether the generated manifests are deterministic depends on the content of the templates.

One can create Helm charts that contain business logic around the applications. A simple example of that is the way the Bitnami WordPress chart contains the ability to set the blog title when the chart is installed. While charts started out as a method for simple substitution in Kubernetes manifests, people discovered that they could make the process of installing applications simple but configuring application business logic through the chart itself. Consider how using apt install mysql causes the system to ask for the password to use. The debian package doesn't just put the files in the right place. It also helps you get setup to use it. Helm is a package manager, like apt, and people try to do the same things.

Helm also contains metadata about the release being installed so you can do some basic things, like rollback. That means, if you install the same chart 5 times you have 5 new release versions installed. This affects a history of information stored by Helm. So, it's complicated. Helm is a package manager not a template rendering tool. It just happens to use template rendering to help with package management.

The one which reaches "completion" first updates the release status, not knowing that there is a second install on-going (unless there is some special code which handles this situation in some form and detects the parallel install, but that would break with the suggested retry).

It's more complicated that this. Kubernetes is supposed to be declarative. So, you declare resources to Kubernetes and then it reconciles until things work. Webhooks cause a wrinkle in this when they are not in place quick enough or there is some other issue with them. We're open to suggestions on improving the situation. We just have to do things in a way that handles parallelism (real problem) and maintains backwards compatibility within a major version.

hferentschik · 2021-08-16T19:17:32Z

@mattfarina, thanks for all clarifications. Most make sense, some things I am still confused about. I guess I am still a bit at loss on how exactly this "release lock" works. How does it ensure exclusivity for one of the potentially multiple parallel installs. Would the update retry not origin from the "same" install?

Is the locking mechanism described somewhere?

Kubernetes is supposed to be declarative. So, you declare resources to Kubernetes and then it reconciles until things work. Webhooks cause a wrinkle in this when they are not in place quick enough or there is some other issue with them.

+1. Webhooks might open the doors for some really valuable features, but in my experience, they also cause a lot of grief in particular when it comes to installation and un-installation.

@mattfarina, ooi, if retrying the update operation is not the right thing to do, do you have any better idea?

github-actions · 2021-11-15T00:07:33Z

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

rainest · 2021-11-23T20:37:31Z

Our chart has the same issue, but I've found a solution for some cases, at least.

This occurs if your controller handles Secrets, since the Helm release info is itself a Secret and becomes subject to admission webhook after it's created, but probably before it's ready. Fortunately, I only care about validating our Secrets. We don't use a label of our own (though we probably should), but Helm does, so you can exclude its Secrets from validation:

webhooks:
  - name: validations.example.com
    objectSelector:  
      matchExpressions:
      - key: owner                                   
        operator: NotIn                 
        values:        
        - helm

If you do in fact need to validate all Secrets, I don't know of any solution.

Helm charts with webhooks that handle Secrets run into an issue that prevents changes after an action that enables the webhook: helm/helm#10023 Because Helm's Secret for release information is subject to the webhook, Kubernetes will attempt to validate it, likely before the webhook service comes online (because Helm just created the Pod that will provide it). If the service is not online, validation fails, and Helm cannot update its Secret to mark the release status, usually leaving it stuck in a pending state that blocks future interactions. This change excludes Helm Secrets from our validation, because we have no need to validate them.

github-actions · 2022-03-09T00:12:33Z

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

hferentschik mentioned this issue Aug 12, 2021

Release status stuck to "pending-install" #6939

Closed

mattfarina added the question/support label Aug 12, 2021

hferentschik mentioned this issue Aug 13, 2021

Add health and readiness checks to hnc-controller-manager kubernetes-sigs/hierarchical-namespaces#68

Closed

github-actions bot added the Stale label Nov 15, 2021

rainest mentioned this issue Nov 23, 2021

fix(webhook) exclude Helm Secrets Kong/charts#500

Merged

2 tasks

github-actions bot removed the Stale label Nov 24, 2021

github-actions bot added the Stale label Mar 9, 2022

github-actions bot closed this as completed Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release update operation should be retried to avoid being stuck in pending-install #10023

Release update operation should be retried to avoid being stuck in pending-install #10023

hferentschik commented Aug 12, 2021

mattfarina commented Aug 12, 2021

hferentschik commented Aug 13, 2021

mattfarina commented Aug 16, 2021

hferentschik commented Aug 16, 2021

github-actions bot commented Nov 15, 2021

rainest commented Nov 23, 2021 •

edited

Loading

github-actions bot commented Mar 9, 2022

Release update operation should be retried to avoid being stuck in pending-install #10023

Release update operation should be retried to avoid being stuck in pending-install #10023

Comments

hferentschik commented Aug 12, 2021

mattfarina commented Aug 12, 2021

hferentschik commented Aug 13, 2021

mattfarina commented Aug 16, 2021

hferentschik commented Aug 16, 2021

github-actions bot commented Nov 15, 2021

rainest commented Nov 23, 2021 • edited Loading

github-actions bot commented Mar 9, 2022

rainest commented Nov 23, 2021 •

edited

Loading