Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release update operation should be retried to avoid being stuck in pending-install #10023

Closed
hferentschik opened this issue Aug 12, 2021 · 7 comments

Comments

@hferentschik
Copy link

I have a Helm chart which when getting installed will always get stuck in "pending-install", even though the install was successful and the application is running.

The issue is that the chart contains a ValidatingWebhookConfiguration. The resource gets installed and starts, but there is a time window in which API server calls will fail due to the webhook not being ready yet. Unfortunately, there seems to be no way in general in Kubernetes to wait for a webhook to be ready. It also does not help to specify the --wait option as part of the helm install. The situation is similar to the one described in issue #6339. I think the underlying problem was never really resolved.

The error message I am seeing looks like this (note the webhook in question is the HNC validating webhook:

install.go:387: [debug] failed to record the release: update: failed to update: Internal error occurred: failed calling webhook "objects.hnc.x-k8s.io": Post "https://hnc-webhook-service.hnc-system.svc:443/validate-objects?timeout=2s": dial tcp 10.96.16.139:443: connect: connection refused

The resources, including the webhook get applied and directly after that Helm tries to update the Secret containing the release information. This operation fails and leads to the release being permanently stuck in the pending-install state.

I think a simple solution, that seems to make sense regardless, is to retry the update operation. In case of the Secret storage, concretely that would mean to retry secrets.impl.Update in the following code (secrets.go):

func (secrets *Secrets) Update(key string, rls *rspb.Release) error {
	// set labels for secrets object meta data
	var lbs labels

	lbs.init()
	lbs.set("modifiedAt", strconv.Itoa(int(time.Now().Unix())))

	// create a new secret object to hold the release
	obj, err := newSecretsObject(key, rls, lbs)
	if err != nil {
		return errors.Wrapf(err, "update: failed to encode release %q", rls.Name)
	}
	// push the secret object out into the kubiverse
	_, err = secrets.impl.Update(context.Background(), obj, metav1.UpdateOptions{})
	return errors.Wrap(err, "update: failed to update")
}

Output of helm version:

version.BuildInfo{Version:"v3.6.1", GitCommit:"61d8e8c4a6f95540c15c6a65f36a6dd0a45e7a2f", GitTreeState:"dirty", GoVersion:"go1.16.5"}
@mattfarina
Copy link
Collaborator

How would you handle the situation where concurrence happens? For example, you helm update running twice on the same release. The first one would put it into pending and the second one would see the pending and try it again even through the first is still going, if a change like this goes in.

@hferentschik
Copy link
Author

How would you handle the situation where concurrence happens?

Is this really an issue? I just want to retry for a few times maybe with some sort of back-off in between. Let's say for 5 seconds or so. Theoretically, there could be other network-related issues when calling the Kube API server, so re-trying seems to me a good idea regardless.

For example, you helm update running twice on the same release.

Do you mean, someone running to helm install or helm updates in parallel? Is the behaviour/outcome of this deterministic regardless of a retry?

The first one would put it into pending and the second one would see the pending and try it again even through the first is still going, if a change like this goes in.

How is a retry of the update operation making things worse here? The retry is there to handle the situation where due to a webhook not responding or other (network) error, applying the update does not work. Even without the retry, if there are two parallel installs, you have the same situation. The one which reaches "completion" first updates the release status, not knowing that there is a second install on-going (unless there is some special code which handles this situation in some form and detects the parallel install, but that would break with the suggested retry).

@mattfarina
Copy link
Collaborator

How would you handle the situation where concurrence happens?

Is this really an issue? I just want to retry for a few times maybe with some sort of back-off in between. Let's say for 5 seconds or so. Theoretically, there could be other network-related issues when calling the Kube API server, so re-trying seems to me a good idea regardless.

Concurrency is a a real issue. Helm v2 had a server side component named Tiller. Some people did not want to run Tiller in the cluster and instead ran it locally, right next to Helm. That setup of communication provided a similar setup to what people have with Helm v3. Tiller managed the lock to make sure concurrent actions did not step on each other. When people ran it locally, such as in CI, they did run into concurrency problems. It happened in practice. This is why we have a lock based on state. Pending tells us a process has a lock.

It would be useful to have a way to clean up a lock. But, we need to be careful in how we do it and in Helm v3 do it in a backward compatible way.

If there are network issues the underlying Kube API should handle basic issues. Helm is layer 7 (in the OSI model). It can be tricky to deal with network issues here. If someone is in an ops situation it can be useful to use a pull based model for deployment rather than a push based one because it can be easier to handle the fault tolerance around connection issues.

For example, you helm update running twice on the same release.

Do you mean, someone running to helm install or helm updates in parallel? Is the behaviour/outcome of this deterministic regardless of a retry?

Whether the generated manifests are deterministic depends on the content of the templates.

One can create Helm charts that contain business logic around the applications. A simple example of that is the way the Bitnami WordPress chart contains the ability to set the blog title when the chart is installed. While charts started out as a method for simple substitution in Kubernetes manifests, people discovered that they could make the process of installing applications simple but configuring application business logic through the chart itself. Consider how using apt install mysql causes the system to ask for the password to use. The debian package doesn't just put the files in the right place. It also helps you get setup to use it. Helm is a package manager, like apt, and people try to do the same things.

Helm also contains metadata about the release being installed so you can do some basic things, like rollback. That means, if you install the same chart 5 times you have 5 new release versions installed. This affects a history of information stored by Helm. So, it's complicated. Helm is a package manager not a template rendering tool. It just happens to use template rendering to help with package management.

The one which reaches "completion" first updates the release status, not knowing that there is a second install on-going (unless there is some special code which handles this situation in some form and detects the parallel install, but that would break with the suggested retry).

It's more complicated that this. Kubernetes is supposed to be declarative. So, you declare resources to Kubernetes and then it reconciles until things work. Webhooks cause a wrinkle in this when they are not in place quick enough or there is some other issue with them. We're open to suggestions on improving the situation. We just have to do things in a way that handles parallelism (real problem) and maintains backwards compatibility within a major version.

@hferentschik
Copy link
Author

@mattfarina, thanks for all clarifications. Most make sense, some things I am still confused about. I guess I am still a bit at loss on how exactly this "release lock" works. How does it ensure exclusivity for one of the potentially multiple parallel installs. Would the update retry not origin from the "same" install?

Is the locking mechanism described somewhere?

Kubernetes is supposed to be declarative. So, you declare resources to Kubernetes and then it reconciles until things work. Webhooks cause a wrinkle in this when they are not in place quick enough or there is some other issue with them.

+1. Webhooks might open the doors for some really valuable features, but in my experience, they also cause a lot of grief in particular when it comes to installation and un-installation.

@mattfarina, ooi, if retrying the update operation is not the right thing to do, do you have any better idea?

@github-actions
Copy link

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Nov 15, 2021
@rainest
Copy link

rainest commented Nov 23, 2021

Our chart has the same issue, but I've found a solution for some cases, at least.

This occurs if your controller handles Secrets, since the Helm release info is itself a Secret and becomes subject to admission webhook after it's created, but probably before it's ready. Fortunately, I only care about validating our Secrets. We don't use a label of our own (though we probably should), but Helm does, so you can exclude its Secrets from validation:

webhooks:
  - name: validations.example.com
    objectSelector:  
      matchExpressions:
      - key: owner                                   
        operator: NotIn                 
        values:        
        - helm

If you do in fact need to validate all Secrets, I don't know of any solution.

rainest added a commit to Kong/charts that referenced this issue Nov 23, 2021
Helm charts with webhooks that handle Secrets run into an issue that
prevents changes after an action that enables the webhook:
helm/helm#10023

Because Helm's Secret for release information is subject to the webhook,
Kubernetes will attempt to validate it, likely before the webhook
service comes online (because Helm just created the Pod that will
provide it). If the service is not online, validation fails, and Helm
cannot update its Secret to mark the release status, usually leaving it
stuck in a pending state that blocks future interactions.

This change excludes Helm Secrets from our validation, because we have
no need to validate them.
@github-actions github-actions bot removed the Stale label Nov 24, 2021
rainest added a commit to Kong/charts that referenced this issue Nov 24, 2021
Helm charts with webhooks that handle Secrets run into an issue that
prevents changes after an action that enables the webhook:
helm/helm#10023

Because Helm's Secret for release information is subject to the webhook,
Kubernetes will attempt to validate it, likely before the webhook
service comes online (because Helm just created the Pod that will
provide it). If the service is not online, validation fails, and Helm
cannot update its Secret to mark the release status, usually leaving it
stuck in a pending state that blocks future interactions.

This change excludes Helm Secrets from our validation, because we have
no need to validate them.
@github-actions
Copy link

github-actions bot commented Mar 9, 2022

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Mar 9, 2022
@github-actions github-actions bot closed this as completed Apr 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants