Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm operator constantly creates secret for a failed Operand upgrade #6494

Closed
sukhil-suresh opened this issue Jul 7, 2023 · 13 comments · Fixed by #6546
Closed

Helm operator constantly creates secret for a failed Operand upgrade #6494

sukhil-suresh opened this issue Jul 7, 2023 · 13 comments · Fixed by #6546
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/support Indicates an issue that is a support question.
Milestone

Comments

@sukhil-suresh
Copy link

sukhil-suresh commented Jul 7, 2023

Bug Report

When an Operand upgrade fails and the bundled helm chart has one or more PersistentVolumeClaim and ServiceAccount, the secrets generated by the openshift-controller-manager for each ServiceAccount increase constantly until a successful Operand update is made.

Sample helm operator demonstrating the problem: https://github.com/sukhil-suresh/openshift-operator-bug

What did you do?

I successfully deployed an Operand using a helm-based Operator and then proceeded to edit the Operand to reduce the size for a PersistentVolumeClaim. (Customers inadvertently reduce storage size).

What did you expect to see?

I expected to see the Operand upgrade fail (since reducing the storage is not allowed) and then a successful rollback should happen.

What did you see instead? Under which circumstances?

The helm-operator controller stays stuck in a loop of constant upgrade/rollback failure and the service account secrets increase constantly.

The screenshot below shows secrets increasing for the sample-sa service account:
secrets

The helm revision history constantly toggles between listing only Revision 1 and listing Revision 1, 2 and 3.
helm_revision_history
Revision 1 is for deployed status with Install complete as description.

Revision 2 is for the upgrade failure:

Upgrade "sample" failed: cannot patch "sample-pvc" with kind PersistentVolumeClaim: PersistentVolumeClaim "sample-pvc" is invalid: spec.resources.requests.storage: Forbidden: field can not be less than previous value

Revision 3 is for the rollback failure:

Rollback "sample" failed: failed to replace object: PersistentVolumeClaim "sample-pvc" is invalid: spec: Forbidden: spec is immutable after creation except resources.requests for bound claims
  core.PersistentVolumeClaimSpec{
  	AccessModes:      {"ReadWriteOnce"},
  	Selector:         nil,
  	Resources:        {Requests: {s"storage": {i: {...}, s: "2Gi", Format: "BinarySI"}}},
- 	VolumeName:       "pvc-a5cf5a2c-ae5b-4e4a-bbe8-a25b75a4b9be",
+ 	VolumeName:       "",
- 	StorageClassName: &"gp3-csi",
+ 	StorageClassName: nil,
  	VolumeMode:       &"Filesystem",
  	DataSource:       nil,
  	DataSourceRef:    nil,
  }

Environment

Helm-Operator

/language helm

Kubernetes cluster type:

OpenShift 4.12.32

$ operator-sdk version

operator-sdk version: "v1.28.0-ocp", commit: "9faa5105bfa6ca52490a17dbaee0653dedd0e536", kubernetes version: "v1.26.0", go version: "go1.19.6", GOOS: "darwin", GOARCH: "amd64"

$ kubectl version

Client Version: v1.25.9
Kustomize Version: v4.5.7
Server Version: v1.25.11+c43ddea

Additional context

This problem does not happen with a regular helm install/upgrade on OpenShift and is specific to the helm-based Operator.

@OchiengEd OchiengEd added the triage/support Indicates an issue that is a support question. label Jul 10, 2023
@OchiengEd OchiengEd added this to the Backlog milestone Jul 10, 2023
@OchiengEd OchiengEd self-assigned this Jul 10, 2023
@OchiengEd
Copy link
Contributor

OchiengEd commented Jul 13, 2023

Hi @sukhil-suresh, I am working to more precisely identify the source of the weird behavior you raised in this issue. I was able to confirm the behavior you explained thanks to the sample project provided. However, work is going on to narrow down the root cause.

As I look into this, we would recommend restricting requests which could not be completed such as resizing down a PVC being sent to the kube-apiserver. This could be done using a validating webhook. Failing to do so could result in the operator being stuck in a reconcile loop where the operator would not be able to get the custom resource from the current state to the desired state.

We will provide updates as we gather more information on the issue. Feel free to keep us updated with any new findings.

@sukhil-suresh
Copy link
Author

sukhil-suresh commented Jul 13, 2023

Thanks for looking into this @OchiengEd

As I look into this, we would recommend restricting requests which could not be completed such as resizing down a PVC being sent to the kube-apiserver. This could be done using a validating webhook.

Is it possible to have a validating webhook with an Operator generated using the helm plugin provided by the operator-sdk? The Operator binary is part of the image published by your team - registry.redhat.io/openshift4/ose-helm-operator:v4.13. As far as I know, the only possible customization is via the watches.yaml file which is very limited.

I raised the ticket because this happened in our helm-operator-based product when a customer reduced the PVC size which is configurable via the helm value.

@OchiengEd
Copy link
Contributor

@sukhil-suresh from my review, it does not seem possible to use webhooks for helm operators.

For the sake of others interested in the issue, it seems when a helm rollback is executed with the --force option, it would attempt to delete and recreate the kubernetes resources, service account in this case.

However, due to the quick succession of these actions, the new service account would be recreated before the secrets associated with the deleted service account are garbage collected. This consequently causes more secrets to be created and for each retry a new set of secrets would be generated.

The same behavior is observed when using the helm-operator however since the custom resource does not get to the desired state, the reconciliation process does not stop.

@sukhil-suresh
Copy link
Author

sukhil-suresh commented Jul 24, 2023

@OchiengEd sorry about the late response - I was not at work last week.

from my review, it does not seem possible to use webhooks for helm operators.

Thanks for confirming.

it seems when a helm rollback is executed with the --force option, it would attempt to delete and recreate the kubernetes resources, service account in this case.

The service account is not deleted and recreated, it is repeatedly modified. I am basing this on the watch event logs of a service account captured through a successful Operand install and an upgrade-rollback loop. The logs show that the service account uid stays the same even when the secrets field gets set (by the openshift-controller-manager) and cleared during the upgrade-rollback loop.

However, due to the quick succession of these actions, the new service account would be recreated before the secrets associated with the deleted service account are garbage collected.

Since the service account never gets deleted, garbage collection may not be relevant.


Thank you for finding the significance of helm rollback with the --force flag. I am able to reproduce the creation of a new set of secrets for a service account with just the helm chart (on OpenShift) when doing a rollback with the --force flag.

However, the continued constant increase of the secrets for a service account is still specific to the helm-operator since the helm-controller clears any helm revision which does not have a successful deployment status.

// Cleanup non-deployed release versions. If all release versions are
// non-deployed, this will ensure that failed installations are correctly
// retried.
for _, rel := range releases {
if rel.Info != nil && rel.Info.Status != rpb.StatusDeployed {
_, err := m.storageBackend.Delete(rel.Name, rel.Version)
if err != nil && !notFoundErr(err) {
return fmt.Errorf("failed to delete stale release version: %w", err)
}
}
}

@sukhil-suresh
Copy link
Author

Is there a reason why the helm-controller does a rollback with the Force option hardcoded as true?

rollback.Force = true

Interestingly, the helm-controller upgrade process does not have Force hardcoded and is configurable via the helm.sdk.operatorframework.io/upgrade-force annotation on an Operand (I believe).

force := hasAnnotation(helmUpgradeForceAnnotation, o)
previousRelease, upgradedRelease, err := manager.UpgradeRelease(ctx, release.ForceUpgrade(force))

I built a debug helm-operator image with Force set as false for rollback, and used this image to build the Sample Operator. This newly built Operator did not have the problem of constantly increasing secrets for a failed User upgrade. However, the helm revision history continued to increase (and eventually slow down) since the user-updated Operand was never deployed and was resolved with a rollback.

If the Maintainers are open to having the Operator rollback Force option configurable (similar to upgrade), I can work on a PR to make this change. Please let me know. Meanwhile, I will confirm that this change does not break any of the existing tests.

@sukhil-suresh
Copy link
Author

sukhil-suresh commented Jul 26, 2023

Skimming through the helm code:

  • For a rollback with --force option, the live resource is replaced with the manifest copy of the resource (from the revision to which we roll back) using a PUT operation. This explains why the secrets field of the service account gets cleared and reset by the creation of new secrets.

  • For a regular rollback without the --force option, a 3-way merge between the current manifest, revert manifest and the live copy is made and then applied with a PATCH operation if there is any change. This explains why the service account does not get updated since there is no change detected after a 3-way merge.

Ref: https://github.com/helm/helm/blob/343389856b080c7f1043bacfc07ff6ac83cc9393/pkg/kube/client.go#L663-L692

Given that resources can be updated after they are created using the helm-generated manifest, a three-way merge should be the default behaviour for a helm-controller-based rollback operation (similar to the helm CLI). A force if required at all, could be made configurable using an annotation on the Operand (similar to the helm-controller upgrade).

A helm rollback with a --force option will always fail for a helm chart which contains a PVC since the Volume Name is only set after it is bound to a PV and once set cannot be cleared. The helm website talks about similar upgrade scenarios where the live state changes.

@OchiengEd
Copy link
Contributor

I can bring up the issue in the operator-sdk meeting and see what the collective opinion is on making the change.

@varshaprasad96 varshaprasad96 added the kind/bug Categorizes issue or PR as related to a bug. label Jul 31, 2023
@OchiengEd
Copy link
Contributor

If the Maintainers are open to having the Operator rollback Force option configurable (similar to upgrade), I can work on a PR to make this change. Please let me know. Meanwhile, I will confirm that this change does not break any of the existing tests.

@sukhil-suresh Brought this up in a grooming meeting; Making the rollback configurable the user was welcomed. Let me know if you have something already worked out.

If not, we can have an issue open for a feature request.

@sukhil-suresh
Copy link
Author

sukhil-suresh commented Aug 9, 2023

Thanks for the update, @OchiengEd.

Making the rollback configurable the user was welcomed

Shouldn't the default rollback be changed not to use the --force flag? I say this since a helm rollback with a --force option will always fail for a helm chart whose resources get updated. Having the rollback approach configurable is a nice-to-have feature.

Let me know if you have something already worked out.

I haven't put in any effort yet.

@OchiengEd
Copy link
Contributor

The default could still remain to be true for --force to maintain the same behavior but allow users to change the behavior using annotations as desired.

I should be able to push out at least a draft PR early next week if we won't have anything to expose the option to configure the rollback --force option.

@sukhil-suresh
Copy link
Author

sukhil-suresh commented Aug 9, 2023

The default could still remain to be true for --force to maintain the same behavior..

I am curious as to why the same behaviour has to be maintained when we know it is incorrect and is a bug.

The bug was demonstrated with the sample app. A helm rollback with a --force option will always fail for a helm chart which contains a PVC (since the Volume Name is only set after it is bound to a PV and once set cannot be cleared). Similar problems could also happen in scenarios where service meshes and other controller-based applications inject data into Kubernetes objects after they have been created.

I should be able to push out at least a draft PR early next week...

Thank you very much, @OchiengEd.

@OchiengEd
Copy link
Contributor

It makes more sense to keep --force set to true in this scenario since that is the current behavior and so has it been for a while.

Changing the default value of --force to be false could potentially result in undesirable changes in behavior for other operators who rely on --force being true.

I brought this up in the community meeting and this approach seems to have more support.

@sukhil-suresh
Copy link
Author

Thanks for the clarification, @OchiengEd. I understand and look forward to the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/support Indicates an issue that is a support question.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants