Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarf's cleanup-on-failure logic can cause unintended deletion of applications during failed upgrades #2455

Closed
blancharda opened this issue Apr 23, 2024 · 0 comments · Fixed by #2456
Labels
enhancement ✨ New feature or request

Comments

@blancharda
Copy link
Member

Is your feature request related to a problem? Please describe.

When Zarf fails to install a component it attempts to purge the helm install as shown below:

  :package: KEYCLOAK COMPONENT

  ✔  Pushed 2 images to the zarf registry
  ⠹  preparing upgrade for keycloak
 WARNING  Retrying (1/3) in 5s: unable to complete the helm chart install/upgrade: another operation
          (install/upgrade/rollback) is in progress
  ⠏  preparing upgrade for keycloak (6s)
 WARNING  Retrying (2/3) in 10s: unable to complete the helm chart install/upgrade: another
          operation (install/upgrade/rollback) is in progress
  ⠙  preparing upgrade for keycloak (16s)
 WARNING  Retrying (3/3) in 20s: unable to complete the helm chart install/upgrade: another
          operation (install/upgrade/rollback) is in progress
  ⠧  purge requested for keycloak
     ERROR:  Failed to deploy bundle: unable to deploy component "keycloak": unable to install helm chart(s):
             unable to install chart after 3 attempts

On initial install this makes some amount of sense, and there is logic that attempts to prevent removal on a failed upgrade. However, in some edge cases (as shown above), Zarf may mistake an upgrade as an initial install and delete something that should instead be rolled back.

These rough steps caused an existing (system critical) deployment to be removed from the cluster:

  1. Successfully deploy a Zarf package including Keycloak
  2. Attempt an upgrade some time later
  3. Realize the upgrade is doomed
  4. ctrl-c out of the deploy (a timeout would have resulted in a rollback, but would have taken ~45mins)
  5. Make a change and re-redeploy
  6. Laugh/cry a little as Keycloak is removed from the cluster

The deletion is almost certainly explained by the aborted deployment leaving the helm release in an incomplete state, but the existence of the behavior in the first place is very concerning for long-lived production deployments.

Describe the solution you'd like

  • Given a Zarf package
  • When when there is an error during deployment
  • Then Zarf attempts a rollback (if available), and does nothing if not

In this particular case, a rollback would not have been possible either due to the invalid helm state -- but in 9/10 situations, I would rather have a broken deployment than no deployment..

Describe alternatives you've considered

Potentially, Zarf could catch the abort signal from ctrl-c and attempt a rollback immediately on exit rather than leaving the helm release in an incomplete state. (half baked idea.. but maybe there's something there..)

@blancharda blancharda added the enhancement ✨ New feature or request label Apr 23, 2024
lucasrod16 added a commit that referenced this issue May 6, 2024
)

## Description
Do not uninstall helm chart after failed install or upgrade

## Related Issue

Fixes #2455

## Checklist before merging

- [x] Test, docs, adr added or updated as needed
- [x] [Contributor Guide
Steps](https://github.com/defenseunicorns/zarf/blob/main/.github/CONTRIBUTING.md#developer-workflow)
followed

---------

Co-authored-by: Austin Abro <37223396+AustinAbro321@users.noreply.github.com>
phillebaba pushed a commit to phillebaba/zarf that referenced this issue May 7, 2024
…fenseunicorns#2456)

## Description
Do not uninstall helm chart after failed install or upgrade

## Related Issue

Fixes defenseunicorns#2455

## Checklist before merging

- [x] Test, docs, adr added or updated as needed
- [x] [Contributor Guide
Steps](https://github.com/defenseunicorns/zarf/blob/main/.github/CONTRIBUTING.md#developer-workflow)
followed

---------

Co-authored-by: Austin Abro <37223396+AustinAbro321@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ✨ New feature or request
Projects
No open projects
Status: New
Development

Successfully merging a pull request may close this issue.

1 participant