Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow marking releases stuck in a pending state as failed #116

Conversation

SimonBaeumer
Copy link
Member

@SimonBaeumer SimonBaeumer commented Nov 16, 2021

PR to discuss how to recover from pending state.
This is our current implementation used in https://github.com/stackrox/helm-operator

Fixes #94

Currently if the Helm releases exits unexpectedly (i.e. due to node crash or a bug) the pending state of a Helm release is never released and leads to an infinite reconciliation loop.

To automatically resolve pending states the reconciler now takes an option via WithMarkFailedAfter which configures a timeout after the pending state is handled as a failure.

ToDo

  • Add Tests
  • Only mark failed on Operator owned Helm secrets

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 16, 2021
@coveralls
Copy link

coveralls commented Nov 16, 2021

Pull Request Test Coverage Report for Build 1593318507

  • 29 of 44 (65.91%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.5%) to 88.06%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/client/actionclient.go 0 7 0.0%
pkg/reconciler/reconciler.go 29 37 78.38%
Totals Coverage Status
Change from base Build 1579656302: -0.5%
Covered Lines: 1652
Relevant Lines: 1876

💛 - Coveralls

@SimonBaeumer SimonBaeumer force-pushed the recover-from-pending-state branch from 9166fc6 to c26cd1e Compare December 6, 2021 10:15
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2021
@SimonBaeumer SimonBaeumer force-pushed the recover-from-pending-state branch from a48929e to 8805e70 Compare December 17, 2021 17:09
@SimonBaeumer SimonBaeumer force-pushed the recover-from-pending-state branch from 8805e70 to 2961ecb Compare December 17, 2021 17:11
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2021
@SimonBaeumer
Copy link
Member Author

@joelanford @varshaprasad96 @fabianvf On the only recover from operator owned secrets-topic.
I think it does not make sense to limit the recovering of pending-states to operator owned Helm secrets because desired states the operator tries to match.
In example looking at this workflow:

  1. Install CR with operator
  2. Helm release is installed by operator with CR values
  3. Admin installs a release manually, values are updated and do not match desired state from the CR anymore
  4. Reconciliation kicks in, compares the manifest diff and reconciles again
  5. Admin changes reverted by operator

Imho if a user wants to do manually upgrade their release the operator should be disabled.

WDYTH on this?

@SimonBaeumer SimonBaeumer marked this pull request as ready for review December 17, 2021 17:28
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 17, 2021
@joelanford
Copy link
Member

@SimonBaeumer

Admin installs a release manually, values are updated and do not match desired state from the CR anymore

In this step, you're saying an admin runs helm upgrade on a release that is already being managed by a CR? And then that upgrade gets stuck in the pending state?

If so, I think your scenario makes sense, and I agree that the operator should reconcile back (by performing another upgrade) to the desired state as specified by the CR.

I was originally thinking of the reverse scenario:

  1. Admin installs a release manually, this release (or a subsequent upgrade) gets stuck in pending
  2. Create CR for the same release with operator
  3. Should operator adopt release and attempt to reconcile?

I was originally arguing that "no, it should not", but the more I think about it, I think I'm changing my mind. I could see 2 options for my scenario:

  1. Just block creation of the CR if a release secret already exists
  2. Adopt the release

Both of these options align with your scenario because the custom object would exist prior to the helm upgrade call. Theoretically both of these are valid depending on your perspective, but practically I think option 2 is easier because it doesn't involve building and shipping an admission webhook on behalf of a helm-operator author, which seems like it could be pretty difficult to orchestrate.

Comment on lines 184 to 190
func (c *actionClient) MarkFailed(rel *release.Release, reason string) error {
infoCopy := *rel.Info
releaseCopy := *rel
releaseCopy.Info = &infoCopy
releaseCopy.SetStatus(release.StatusFailed, reason)
return c.conf.Releases.Update(&releaseCopy)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I see how it's convenient to put this functionality into the actionClient, I'm not convinced it makes a ton of sense otherwise. There's really only one way to mark a release as failed (this is it), so I'm wondering if we pull this back out of the action client interface and just put this logic directly into the reconciler.

The only missing piece I see for that is giving the Reconciler an ActionConfigGetter field, which would likely just involve adding the field, adding a WithActionConfigGetter functional option, and then tweaking the addDefaults function slightly to handle the fact that the reconciler may already have an ActionConfigGetter setup via the new functional option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, extracted it and now head to adding a fake implementation for the ActionConfigGetter. After that it should be finished. The implementation is still a bit rough though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @joelanford.
I've tried to fake the ActionConfig but it was more complex than expected, i.e. also interacting with the storage.Storage interface to call the Update func for the given release.
I stopped from there and wondered if it fits the abstraction if the MarkFailed func is renamed to Update to be more generic, so the ActionClient wraps all interactions with Helm in a single struct.
Also the already existing fake client can be leveraged and easily extended.

If theUpdate func does not match the expectations I would implement a fake ActionConfig with a memory release driver in a separate PR.

@SimonBaeumer
Copy link
Member Author

@SimonBaeumer

Admin installs a release manually, values are updated and do not match desired state from the CR anymore

In this step, you're saying an admin runs helm upgrade on a release that is already being managed by a CR? And then that upgrade gets stuck in the pending state?

If so, I think your scenario makes sense, and I agree that the operator should reconcile back (by performing another upgrade) to the desired state as specified by the CR.

I was originally thinking of the reverse scenario:

  1. Admin installs a release manually, this release (or a subsequent upgrade) gets stuck in pending
  2. Create CR for the same release with operator
  3. Should operator adopt release and attempt to reconcile?

I agree. An operator ideally should only reconcile resources which belong to them or which are explicitly labeled to allow it (opt-in to reconcile).

I was originally arguing that "no, it should not", but the more I think about it, I think I'm changing my mind. I could see 2 options for my scenario:

  1. Just block creation of the CR if a release secret already exists
  2. Adopt the release

Both of these options align with your scenario because the custom object would exist prior to the helm upgrade call. Theoretically both of these are valid depending on your perspective, but practically I think option 2 is easier because it doesn't involve building and shipping an admission webhook on behalf of a helm-operator author, which seems like it could be pretty difficult to orchestrate.

I think adopting the release is reasonable if opted-in. This seems to be outside of the scope of this PR, created an issue for it: #144

@@ -796,8 +871,8 @@ func (r *Reconciler) addDefaults(mgr ctrl.Manager, controllerName string) {
r.log = ctrl.Log.WithName("controllers").WithName("Helm")
}
if r.actionClientGetter == nil {
actionConfigGetter := helmclient.NewActionConfigGetter(mgr.GetConfig(), mgr.GetRESTMapper(), r.log)
r.actionClientGetter = helmclient.NewActionClientGetter(actionConfigGetter)
r.actionConfigGetter = helmclient.NewActionConfigGetter(mgr.GetConfig(), mgr.GetRESTMapper(), r.log)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved out of the r.actionClientGetter == nil if block and into its own, right?

if r.actionConfigGetter == nil {
	r.actionConfigGetter = helmclient.NewActionConfigGetter(mgr.GetConfig(), mgr.GetRESTMapper(), r.log)
}
if r.actionClientGetter == nil {
	r.actionClientGetter = helmclient.NewActionClientGetter(r.actionConfigGetter)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree and done

@asmacdo
Copy link
Member

asmacdo commented Feb 8, 2022

@joelanford and @SimonBaeumer to pair on this

@@ -531,6 +544,10 @@ func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (res ctrl.
)
return ctrl.Result{}, err
}
if state == statePending {
return r.handlePending(actionClient, rel, &u, log)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelanford To be honest, I don't get to a good solution here and would like that you take a decision here.

1. handlePending in actionClient
My suggestion would be moving the handlePending to the actionClient.
This makes sense as it is a Helm state which is handled such the actionClient abstracts Helm interactions.
To include handlePending in handleReconcile the statePending must be checked at a different execution time (within the switch state in line 559).

2. Moving handlePending to the switch state
As far as I see the only disadvantage is that pre-hooks are executed before the pending state is resolved. This is not necessarily bad and depends more on our definition of pre-hooks/extensions functions.

@openshift-ci
Copy link

openshift-ci bot commented Mar 7, 2022

@SimonBaeumer: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 7, 2022
@perdasilva
Copy link
Contributor

Closing as stale. Please re-open if necessary.

@perdasilva perdasilva closed this Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sdk area/testing needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Helm operator stucks if install/upgrade were aborted due to unexpected exit of the operators
5 participants