Failed upgrade may lead to an endless loop of rollbacks #224

kovayur · 2023-07-31T14:16:47Z

Problem

When the reconciler fails to upgrade the release it rollbacks to the previous revision and returns an error. The controller runtime is expected to retry the reconciliation with an exponential backoff, but in reality it keeps reconciling over and over again. I was able to reproduce this behavior for the following use cases:

Lack of the PATCH permission for the operator service account to update the K8S object.
Error in the YAML structure caused by a bug in the chart or incorrect values. Example: an env variable in Deployment is set by both value and valueFrom tags (ROX-18477: operator delete valuesFrom in proxy config if values is set stackrox/stackrox#7105).
CRD used in the release has been removed from the cluster.

Every rollback increases the revision count. In my case, the operator spawns thousands of revisions in a matter of minutes.

Root cause

A rolled back revision is no different from the upgraded revision, it has the deployed status as after a normal upgrade. There always be a diff between the expected state calculated from the CR and the rolled back revision this will lead to a failed upgrade again and again.

There're events that are added in the reconciliation queue aside of the exponential backoff and cause the reconciliation without any delay. These events are:

CR status is updated on every reconcile. This is because Irreconcilable status is updated twice for every reconcile both with False (right before the upgrade) and True (after the upgrade failed).
A failed upgrade and subsequent rollback causes multiple changes in the secrets storage which are watched and each adds an item the reconciliation queue. Let's say that revision 1 was successful, revision 2 is problematic, revision 3 - is the rollback to revision 1. Upon upgrading to version 2 the following events will be triggered:
1. Create revision 2 with status pending-upgrade
2. Mark revision 2 as failed
3. Create revision 3 with status pending-rollback
4. Mark revision 1 as superseded
5. Mark revision 3 as deployed or failed depending on the rollback result.

There is deduplication in the queue, but still at least one event will be queued without delay.

The text was updated successfully, but these errors were encountered:

Jay-Madden · 2023-09-20T23:15:01Z

This is biting us as well, going into backoff seems like the best solution here no?

acornett21 · 2023-11-15T14:25:33Z

@kovayur I believe this is the same issue as:

Helm operator constantly creates secret for a failed Operand upgrade operator-sdk#6494

Which we have documented here. If they are different, apologies.

kovayur · 2023-12-08T17:24:54Z

Hey @acornett21, thanks for sharing the link to this issue.
I don't think it's the same, as the mentioned issue affects operator-sdk and mine is related to helm-operator-plugins.
As far as I understand, the reconciliation logic in these repositories is different.
If operator-sdk is not affected by this issue, perhaps we could learn from there how to fix it in helm-operator-plugins

acornett21 · 2023-12-08T17:45:36Z

@kovayur It does look like the reconcile.go files are unique for each project, I assumed operator-sdk was the source of truth for this, and helm-operator-plugins imported this logic. It seems that the implementation in opeator-sdk for this fix, should be carried into helm-operator-plugins project.

This was referenced Jul 31, 2023

Skip manifest changes if item is in exponential backoff #225

Open

Avoid duplicated irreconcilable status #226

Open

getReleaseState may sometimes cause an unwanted rollback #227

Open

Jay-Madden mentioned this issue Sep 26, 2023

Allow configuration of failed reconcile handling #239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed upgrade may lead to an endless loop of rollbacks #224

Failed upgrade may lead to an endless loop of rollbacks #224

kovayur commented Jul 31, 2023

Jay-Madden commented Sep 20, 2023

acornett21 commented Nov 15, 2023

kovayur commented Dec 8, 2023

acornett21 commented Dec 8, 2023

Failed upgrade may lead to an endless loop of rollbacks #224

Failed upgrade may lead to an endless loop of rollbacks #224

Comments

kovayur commented Jul 31, 2023

Problem

Root cause

Jay-Madden commented Sep 20, 2023

acornett21 commented Nov 15, 2023

kovayur commented Dec 8, 2023

acornett21 commented Dec 8, 2023