Helm operator constantly creates secret for a failed Operand upgrade #6494

sukhil-suresh · 2023-07-07T20:34:57Z

Bug Report

When an Operand upgrade fails and the bundled helm chart has one or more PersistentVolumeClaim and ServiceAccount, the secrets generated by the openshift-controller-manager for each ServiceAccount increase constantly until a successful Operand update is made.

Sample helm operator demonstrating the problem: https://github.com/sukhil-suresh/openshift-operator-bug

What did you do?

I successfully deployed an Operand using a helm-based Operator and then proceeded to edit the Operand to reduce the size for a PersistentVolumeClaim. (Customers inadvertently reduce storage size).

What did you expect to see?

I expected to see the Operand upgrade fail (since reducing the storage is not allowed) and then a successful rollback should happen.

What did you see instead? Under which circumstances?

The helm-operator controller stays stuck in a loop of constant upgrade/rollback failure and the service account secrets increase constantly.

The screenshot below shows secrets increasing for the sample-sa service account:

The helm revision history constantly toggles between listing only Revision 1 and listing Revision 1, 2 and 3.

Revision 1 is for deployed status with Install complete as description.

Revision 2 is for the upgrade failure:

Upgrade "sample" failed: cannot patch "sample-pvc" with kind PersistentVolumeClaim: PersistentVolumeClaim "sample-pvc" is invalid: spec.resources.requests.storage: Forbidden: field can not be less than previous value

Revision 3 is for the rollback failure:

Rollback "sample" failed: failed to replace object: PersistentVolumeClaim "sample-pvc" is invalid: spec: Forbidden: spec is immutable after creation except resources.requests for bound claims
  core.PersistentVolumeClaimSpec{
  	AccessModes:      {"ReadWriteOnce"},
  	Selector:         nil,
  	Resources:        {Requests: {s"storage": {i: {...}, s: "2Gi", Format: "BinarySI"}}},
- 	VolumeName:       "pvc-a5cf5a2c-ae5b-4e4a-bbe8-a25b75a4b9be",
+ 	VolumeName:       "",
- 	StorageClassName: &"gp3-csi",
+ 	StorageClassName: nil,
  	VolumeMode:       &"Filesystem",
  	DataSource:       nil,
  	DataSourceRef:    nil,
  }

Environment

Helm-Operator

/language helm

Kubernetes cluster type:

OpenShift 4.12.32

$ operator-sdk version

operator-sdk version: "v1.28.0-ocp", commit: "9faa5105bfa6ca52490a17dbaee0653dedd0e536", kubernetes version: "v1.26.0", go version: "go1.19.6", GOOS: "darwin", GOARCH: "amd64"

$ kubectl version

Client Version: v1.25.9
Kustomize Version: v4.5.7
Server Version: v1.25.11+c43ddea

Additional context

This problem does not happen with a regular helm install/upgrade on OpenShift and is specific to the helm-based Operator.

The text was updated successfully, but these errors were encountered:

OchiengEd · 2023-07-13T20:04:15Z

Hi @sukhil-suresh, I am working to more precisely identify the source of the weird behavior you raised in this issue. I was able to confirm the behavior you explained thanks to the sample project provided. However, work is going on to narrow down the root cause.

As I look into this, we would recommend restricting requests which could not be completed such as resizing down a PVC being sent to the kube-apiserver. This could be done using a validating webhook. Failing to do so could result in the operator being stuck in a reconcile loop where the operator would not be able to get the custom resource from the current state to the desired state.

We will provide updates as we gather more information on the issue. Feel free to keep us updated with any new findings.

sukhil-suresh · 2023-07-13T22:29:37Z

Thanks for looking into this @OchiengEd

As I look into this, we would recommend restricting requests which could not be completed such as resizing down a PVC being sent to the kube-apiserver. This could be done using a validating webhook.

Is it possible to have a validating webhook with an Operator generated using the helm plugin provided by the operator-sdk? The Operator binary is part of the image published by your team - registry.redhat.io/openshift4/ose-helm-operator:v4.13. As far as I know, the only possible customization is via the watches.yaml file which is very limited.

I raised the ticket because this happened in our helm-operator-based product when a customer reduced the PVC size which is configurable via the helm value.

OchiengEd · 2023-07-17T17:14:25Z

@sukhil-suresh from my review, it does not seem possible to use webhooks for helm operators.

For the sake of others interested in the issue, it seems when a helm rollback is executed with the --force option, it would attempt to delete and recreate the kubernetes resources, service account in this case.

However, due to the quick succession of these actions, the new service account would be recreated before the secrets associated with the deleted service account are garbage collected. This consequently causes more secrets to be created and for each retry a new set of secrets would be generated.

The same behavior is observed when using the helm-operator however since the custom resource does not get to the desired state, the reconciliation process does not stop.

sukhil-suresh · 2023-07-24T19:17:06Z

@OchiengEd sorry about the late response - I was not at work last week.

from my review, it does not seem possible to use webhooks for helm operators.

Thanks for confirming.

it seems when a helm rollback is executed with the --force option, it would attempt to delete and recreate the kubernetes resources, service account in this case.

The service account is not deleted and recreated, it is repeatedly modified. I am basing this on the watch event logs of a service account captured through a successful Operand install and an upgrade-rollback loop. The logs show that the service account uid stays the same even when the secrets field gets set (by the openshift-controller-manager) and cleared during the upgrade-rollback loop.

However, due to the quick succession of these actions, the new service account would be recreated before the secrets associated with the deleted service account are garbage collected.

Since the service account never gets deleted, garbage collection may not be relevant.

Thank you for finding the significance of helm rollback with the --force flag. I am able to reproduce the creation of a new set of secrets for a service account with just the helm chart (on OpenShift) when doing a rollback with the --force flag.

However, the continued constant increase of the secrets for a service account is still specific to the helm-operator since the helm-controller clears any helm revision which does not have a successful deployment status.

operator-sdk/internal/helm/release/manager.go

Lines 105 to 115 in b6b3744

    
           // Cleanup non-deployed release versions. If all release versions are 
        
           // non-deployed, this will ensure that failed installations are correctly 
        
           // retried. 
        
           for _, rel := range releases { 
        
           	if rel.Info != nil && rel.Info.Status != rpb.StatusDeployed { 
        
           		_, err := m.storageBackend.Delete(rel.Name, rel.Version) 
        
           		if err != nil && !notFoundErr(err) { 
        
           			return fmt.Errorf("failed to delete stale release version: %w", err) 
        
           		} 
        
           	} 
        
           }

sukhil-suresh · 2023-07-25T00:20:39Z

Is there a reason why the helm-controller does a rollback with the Force option hardcoded as true?

operator-sdk/internal/helm/release/manager.go

Line 220 in 888ea7f

rollback.Force = true

Interestingly, the helm-controller upgrade process does not have Force hardcoded and is configurable via the helm.sdk.operatorframework.io/upgrade-force annotation on an Operand (I believe).

operator-sdk/internal/helm/controller/reconcile.go

Lines 313 to 314 in 888ea7f

    
           force := hasAnnotation(helmUpgradeForceAnnotation, o) 
        
           previousRelease, upgradedRelease, err := manager.UpgradeRelease(ctx, release.ForceUpgrade(force))

I built a debug helm-operator image with Force set as false for rollback, and used this image to build the Sample Operator. This newly built Operator did not have the problem of constantly increasing secrets for a failed User upgrade. However, the helm revision history continued to increase (and eventually slow down) since the user-updated Operand was never deployed and was resolved with a rollback.

If the Maintainers are open to having the Operator rollback Force option configurable (similar to upgrade), I can work on a PR to make this change. Please let me know. Meanwhile, I will confirm that this change does not break any of the existing tests.

sukhil-suresh · 2023-07-26T19:33:06Z

Skimming through the helm code:

For a rollback with --force option, the live resource is replaced with the manifest copy of the resource (from the revision to which we roll back) using a PUT operation. This explains why the secrets field of the service account gets cleared and reset by the creation of new secrets.
For a regular rollback without the --force option, a 3-way merge between the current manifest, revert manifest and the live copy is made and then applied with a PATCH operation if there is any change. This explains why the service account does not get updated since there is no change detected after a 3-way merge.

Ref: https://github.com/helm/helm/blob/343389856b080c7f1043bacfc07ff6ac83cc9393/pkg/kube/client.go#L663-L692

Given that resources can be updated after they are created using the helm-generated manifest, a three-way merge should be the default behaviour for a helm-controller-based rollback operation (similar to the helm CLI). A force if required at all, could be made configurable using an annotation on the Operand (similar to the helm-controller upgrade).

A helm rollback with a --force option will always fail for a helm chart which contains a PVC since the Volume Name is only set after it is bound to a PV and once set cannot be cleared. The helm website talks about similar upgrade scenarios where the live state changes.

OchiengEd · 2023-07-31T12:27:30Z

I can bring up the issue in the operator-sdk meeting and see what the collective opinion is on making the change.

OchiengEd · 2023-08-09T14:00:40Z

If the Maintainers are open to having the Operator rollback Force option configurable (similar to upgrade), I can work on a PR to make this change. Please let me know. Meanwhile, I will confirm that this change does not break any of the existing tests.

@sukhil-suresh Brought this up in a grooming meeting; Making the rollback configurable the user was welcomed. Let me know if you have something already worked out.

If not, we can have an issue open for a feature request.

sukhil-suresh · 2023-08-09T18:02:13Z

Thanks for the update, @OchiengEd.

Making the rollback configurable the user was welcomed

Shouldn't the default rollback be changed not to use the --force flag? I say this since a helm rollback with a --force option will always fail for a helm chart whose resources get updated. Having the rollback approach configurable is a nice-to-have feature.

Let me know if you have something already worked out.

I haven't put in any effort yet.

OchiengEd · 2023-08-09T18:47:07Z

The default could still remain to be true for --force to maintain the same behavior but allow users to change the behavior using annotations as desired.

I should be able to push out at least a draft PR early next week if we won't have anything to expose the option to configure the rollback --force option.

sukhil-suresh · 2023-08-09T18:58:12Z

The default could still remain to be true for --force to maintain the same behavior..

I am curious as to why the same behaviour has to be maintained when we know it is incorrect and is a bug.

The bug was demonstrated with the sample app. A helm rollback with a --force option will always fail for a helm chart which contains a PVC (since the Volume Name is only set after it is bound to a PV and once set cannot be cleared). Similar problems could also happen in scenarios where service meshes and other controller-based applications inject data into Kubernetes objects after they have been created.

I should be able to push out at least a draft PR early next week...

Thank you very much, @OchiengEd.

OchiengEd · 2023-08-14T18:34:25Z

It makes more sense to keep --force set to true in this scenario since that is the current behavior and so has it been for a while.

Changing the default value of --force to be false could potentially result in undesirable changes in behavior for other operators who rely on --force being true.

I brought this up in the community meeting and this approach seems to have more support.

sukhil-suresh · 2023-08-14T19:47:16Z

Thanks for the clarification, @OchiengEd. I understand and look forward to the fix.

OchiengEd added the triage/support Indicates an issue that is a support question. label Jul 10, 2023

OchiengEd added this to the Backlog milestone Jul 10, 2023

OchiengEd self-assigned this Jul 10, 2023

varshaprasad96 added the kind/bug Categorizes issue or PR as related to a bug. label Jul 31, 2023

OchiengEd mentioned this issue Aug 15, 2023

Allow users to add annotation to set configure rollback #6546

Merged

2 tasks

r4rajat mentioned this issue Sep 4, 2023

operator-sdk failes to Initialize a new project when parent helm chart contains dependent charts #6557

Closed

OchiengEd closed this as completed in #6546 Sep 11, 2023

acornett21 mentioned this issue Nov 15, 2023

Failed upgrade may lead to an endless loop of rollbacks operator-framework/helm-operator-plugins#224

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm operator constantly creates secret for a failed Operand upgrade #6494

Helm operator constantly creates secret for a failed Operand upgrade #6494

sukhil-suresh commented Jul 7, 2023 •

edited

Loading

OchiengEd commented Jul 13, 2023 •

edited

Loading

sukhil-suresh commented Jul 13, 2023 •

edited

Loading

OchiengEd commented Jul 17, 2023

sukhil-suresh commented Jul 24, 2023 •

edited

Loading

sukhil-suresh commented Jul 25, 2023

sukhil-suresh commented Jul 26, 2023 •

edited

Loading

OchiengEd commented Jul 31, 2023

OchiengEd commented Aug 9, 2023

sukhil-suresh commented Aug 9, 2023 •

edited

Loading

OchiengEd commented Aug 9, 2023

sukhil-suresh commented Aug 9, 2023 •

edited

Loading

OchiengEd commented Aug 14, 2023

sukhil-suresh commented Aug 14, 2023

Helm operator constantly creates secret for a failed Operand upgrade #6494

Helm operator constantly creates secret for a failed Operand upgrade #6494

Comments

sukhil-suresh commented Jul 7, 2023 • edited Loading

Bug Report

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

Additional context

OchiengEd commented Jul 13, 2023 • edited Loading

sukhil-suresh commented Jul 13, 2023 • edited Loading

OchiengEd commented Jul 17, 2023

sukhil-suresh commented Jul 24, 2023 • edited Loading

sukhil-suresh commented Jul 25, 2023

sukhil-suresh commented Jul 26, 2023 • edited Loading

OchiengEd commented Jul 31, 2023

OchiengEd commented Aug 9, 2023

sukhil-suresh commented Aug 9, 2023 • edited Loading

OchiengEd commented Aug 9, 2023

sukhil-suresh commented Aug 9, 2023 • edited Loading

OchiengEd commented Aug 14, 2023

sukhil-suresh commented Aug 14, 2023

sukhil-suresh commented Jul 7, 2023 •

edited

Loading

OchiengEd commented Jul 13, 2023 •

edited

Loading

sukhil-suresh commented Jul 13, 2023 •

edited

Loading

sukhil-suresh commented Jul 24, 2023 •

edited

Loading

sukhil-suresh commented Jul 26, 2023 •

edited

Loading

sukhil-suresh commented Aug 9, 2023 •

edited

Loading

sukhil-suresh commented Aug 9, 2023 •

edited

Loading