Rework `HelmRelease` reconciliation logic #738

hiddeco · 2023-07-17T20:59:35Z

This PR is practically a rewrite of the controller as a whole, it moves things into packages, introduces a v2beta2 API with a changed Status structure, and handles the determination of the Helm action to take by observing writes to — and verifying the state of — the Helm storage.

Changes

During a Helm action, newly stored releases (and mutations to a previous observed release) are recorded as a Snapshot in the .status.history list of the HelmRelease.
The results of a Helm test are recorded in the Snapshot item for this release, to improve observability of the test results as run by the controller.
Tests can now be filtered using .spec.test.filters, which works like --filter of the helm test command.
When a release name exceeds the Helm limit, a checksum for the release name will be calculated, and the short hash of this will be appended to the trimmed name.
When a HelmRelease is mutated in such a way that the release target would be different (change in release namespace, storage namespace and/or chart name), the previous release is first uninstalled.
The default for .spec.maxHistory has been changed to 5 for performance improvements.
Reconciling and Stalled conditions are now being set to indicate when a HelmRelease is undergoing reconciliation, or has reached a point where it is not expected to proceed without e.g. a change in values, a new chart version, etc.
Releases in a pending-install, pending-upgrade or pending-rollback state will be unlocked by the controller to a failed state, and retried with a Helm upgrade.
Improved logging to help users reason about decisions the controller took, added timestamps to captured Helm logs, and various improvements to condition and event messages.

Fixes #265
Fixes #270
Fixes #282
Fixes #312
Fixes #322
Fixes #323
Fixes #324
Fixes #367
Fixes #403
Fixes #439
Fixes #471
Fixes #487
Fixes #534
Fixes #536
Fixes #554
Fixes #555
Fixes #644
Fixes #739
Fixes #804

stefanprodan · 2023-11-03T10:04:41Z

I'm testing this PR here. Here are my findings so far:

On HR install, the Ready status is stuck on HelmChart X is not ready even if the chart is ready and the installation is underway. I suggest we set Ready=Unknown with the same message as the Reconciling condition, to signal users what's going on, the Flux CLI and the various Flux UIs only look at Ready, this would make HC behave the same as KC.
I think the determining next Helm action based on current state log should be moved to debug level.
After a controller restart, some HRs are upgraded even if there is no change observed release is out of sync with desired state: release chart changed, reproducible with kube-prom-stack
On namespace deletion after logging skipping Helm release uninstallation, it also logs a not found error, I think we could skip that.
the controller doesn't unlock releases after a node crash/recovery
failed to download chart from 'http://source-controller.flux-system.svc.cluster.local./helmchart/helm-benchmark/helm-benchmark-podinfo-786/podinfo-6.5.3.tgz': 404 Not Found in KC we retry with backoff on 404 instead of setting Ready=False with this error
panics at unlock

{"level":"error","ts":"2023-11-09T06:39:36.536Z","logger":"runtime","msg":"Observed a panic: \"invalid memory address or nil pointer dereference\" (runtime error: invalid memory address or nil pointer dereference)\ngoroutine 193 [running]:\nk8s.io/apimachinery/pkg/util/runtime.logPanic({0x181d1e0?, 0x2e98de0})\n\tk8s.io/apimachinery@v0.27.4/pkg/util/runtime/runtime.go:75 +0x7c\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()\n\tsigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:107 +0xa4\npanic({0x181d1e0, 0x2e98de0})\n\truntime/panic.go:884 +0x1f4\ngit.luolix.top/fluxcd/helm-controller/api/v2beta2.(*Snapshot).FullReleaseName(...)\n\tgit.luolix.top/fluxcd/helm-controller/api@v0.36.2/v2beta2/snapshot_types.go:79\ngit.luolix.top/fluxcd/helm-controller/internal/reconcile.(*Unlock).success(0x400df18180, 0x4009d2a918, {0x4001d120b0, 0xf})\n\tgit.luolix.top/fluxcd/helm-controller/internal/reconcile/unlock.go:140 +0x90\ngit.luolix.top/fluxcd/helm-controller/internal/reconcile.(*Unlock).Reconcile(0x400df18180, {0x4004e388d0?, 0x4001a46538?}, 0x4009d2a918)\n\tgit.luolix.top/fluxcd/helm-controller/internal/reconcile/unlock.go:93 +0x300\ngit.luolix.top/fluxcd/helm-controller/internal/reconcile.(*AtomicRelease).Reconcile(0x4006530d68, {0x1e15128, 0x4006682ab0}, 0x4009d2a918)\n\tgit.luolix.top/fluxcd/helm-controller/internal/reconcile/atomic_release.go:232 +0x600\ngit.luolix.top/fluxcd/helm-controller/internal/controller.(*HelmReleaseReconciler).reconcileRelease(0x40006b82a0, {0x1e15128, 0x4006682ab0}, 0x400a2c5360, 0x400be93500)\n\tgit.luolix.top/fluxcd/helm-controller/internal/controller/helmrelease_controller.go:324 +0x860\ngit.luolix.top/fluxcd/helm-controller/internal/controller.(*HelmReleaseReconciler).Reconcile(0x40006b82a0, {0x1e15128, 0x4006682ab0}, {{{0x40010af490?, 0x30?}, {0x40010af480?, 0xfffe9003b501?}}})\n\tgit.luolix.top/fluxcd/helm-controller/internal/controller/helmrelease_controller.go:206 +0x3e4\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1e15128?, {0x1e15128?, 0x4006682ab0?}, {{{0x40010af490?, 0x1722140?}, {0x40010af480?, 0x4001a47e08?}}})\n\tsigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:118 +0x8c\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x400041f7c0, {0x1e15080, 0x400036c820}, {0x18f0460?, 0x400168c800?})\n\tsigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:314 +0x2cc\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x400041f7c0, {0x1e15080, 0x400036c820})\n\tsigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:265 +0x1a0\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()\n\tsigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:226 +0x74\ncreated by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\tsigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:222 +0x43c\n"}
{"level":"error","ts":"2023-11-09T06:39:36.536Z","msg":"Reconciler error","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"podinfo-72","namespace":"helm-benchmark"},"namespace":"helm-benchmark","name":"podinfo-72","reconcileID":"afd2e81d-a103-4a89-ac98-8d2b1a5d2996","error":"panic: runtime error: invalid memory address or nil pointer dereference [recovered]"}

we should set max history to 5, currently it's set to 10 and makes upgrades so much slower (x5) due the Helm SDK querying 10 Secrets every time
status.history.previous.version shows the wrong version after suspend/resume, previous in my case was v12 but the status shows v3.
During upgrades and tests instead of Ready=False should be Ready=Unknown so that the various Flux UIs will not show the release as failed.
After a rollback, when fixing the values, the new release is not reflected in status under Released=True, only TestSuccess=True is updated for the latest release.
When setting a chart version that doesn't exists, the release becomes Read=False, after reverting the version, the controller doesn't see the old version as being available.
When re-applying a v2beta1 release over a failed v2beta2, the controller does an upgrade instead of preserving state.

This ensures that certain UIs can continue to display information to their users while they work on making better use of the new data available in `v2beta2`. Signed-off-by: Hidde Beydals <hidde@hhh.computer>

This allows users to delay updating their `apiVersion` declarations, as the fields will be known in the previous version. If we would not do this, the fields would get wiped when the `v2beta1` resource is applied, potentially causing spurious upgrades. Signed-off-by: Hidde Beydals <hidde@hhh.computer>

This allows the controller to be updated from `v2beta1` to `v2beta2` without triggering a release to settle state. It does this by looking at the previous successful release as recorded for the `v2beta1` object, and if found, recording a snapshot for it in the new `History` field of the status. This feature can be disabled by setting the `AdoptLegacyReleases` feature flag to `false`. Signed-off-by: Hidde Beydals <hidde@hhh.computer>

This to allow better feedback to the user on why the controller decided to uninstall the release. Signed-off-by: Hidde Beydals <hidde@hhh.computer>

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

This improves continuity while the controller attempts to move the release forward. Signed-off-by: Hidde Beydals <hidde@hhh.computer>

As it does not offer real value, and creates noise when the logs are included in an emitted event. Signed-off-by: Hidde Beydals <hidde@hhh.computer>

This ensure that when a chart object has a temporary `Ready=False` state, the predicate will notice the change to `Ready=True` and cause an enqueue request. Signed-off-by: Hidde Beydals <hidde@hhh.computer>

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

stefanprodan

LGTM

Awesome work @hiddeco 🎖️ 🏅 🥇

I've tested the HelmRelease v2beta2 API and controller for two weeks using various configurations:

install/upgrade/rollback/uninstall (with and without Helm tests)
failure conditions (invalid charts, Helm test failures, controller OOM, Kubernetes API downtime)
release auto-recovery when Helm storage has been locked
multi-tenancy lockdown and account impersonation
verify Kubernetes and Slack events for all actions
install/upgrade of Flux monitoring stack (kube-prom-stack and other heavy charts)
Helm release promotion with GitHub repository dispatch workflow

I've also run extensive benchmarks on Apple M1 Max and in GitHub hosted-runner, the results are impressive. Well done!

Objects	Type	Flux component	Duration Apple M1	Duration GH Intel	Max Memory
100	HelmChart	source-controller	35s	35s	40Mi
100	HelmRelease	helm-controller	42s	42s	140Mi
500	HelmChart	source-controller	1m10s	1m10s	68Mi
500	HelmRelease	helm-controller	1m58s	4m40s	350Mi
1000	HelmChart	source-controller	1m45s	1m45s	110Mi
1000	HelmRelease	helm-controller	5m10s	14m10s	620Mi

For this benchmark 100, 500 and 1000 Helm releases are being installed/upgraded at the same time each with its own Helm chart.

https://github.com/stefanprodan/flux-benchmark

hiddeco force-pushed the new-reconciler branch 2 times, most recently from b1ae813 to 7d044a7 Compare July 17, 2023 21:05

stefanprodan added this to the Helm GA milestone Jul 18, 2023

hiddeco force-pushed the new-reconciler branch 7 times, most recently from d023277 to 42aac92 Compare August 1, 2023 08:03

hiddeco force-pushed the new-reconciler branch 2 times, most recently from 6854092 to 1307041 Compare August 21, 2023 22:59

hiddeco mentioned this pull request Aug 24, 2023

Support other Helm storage backends besides Secrets #760

Open

hiddeco force-pushed the new-reconciler branch 2 times, most recently from ae331af to 9e467fb Compare September 22, 2023 08:01

hiddeco force-pushed the new-reconciler branch from 9e467fb to 596b071 Compare October 31, 2023 17:07

hiddeco force-pushed the dev branch from caf406c to d91e234 Compare October 31, 2023 17:08

hiddeco force-pushed the new-reconciler branch 8 times, most recently from 6f9a0ed to 015def3 Compare November 1, 2023 13:50

hiddeco force-pushed the new-reconciler branch 4 times, most recently from 7e7f7d0 to 5f82c20 Compare November 8, 2023 12:10

api: continue to record LastAppliedRevision

9bb8f02

This ensures that certain UIs can continue to display information to their users while they work on making better use of the new data available in `v2beta2`. Signed-off-by: Hidde Beydals <hidde@hhh.computer>

hiddeco force-pushed the new-reconciler branch 2 times, most recently from 3c86265 to c9ddc0f Compare November 22, 2023 10:11

hiddeco marked this pull request as ready for review November 22, 2023 12:04

hiddeco requested review from makkes, darkowlzz, stefanprodan and aryan9600 November 22, 2023 12:04

hiddeco force-pushed the new-reconciler branch from 1b7c764 to 661c35b Compare November 22, 2023 22:11

hiddeco added 2 commits November 22, 2023 23:14

hiddeco force-pushed the new-reconciler branch from 661c35b to f03453d Compare November 22, 2023 22:14

hiddeco added 7 commits November 23, 2023 00:17

action: provide a reason on release target changes

20c00fd

This to allow better feedback to the user on why the controller decided to uninstall the release. Signed-off-by: Hidde Beydals <hidde@hhh.computer>

controller: patch after setting Reconciling=True

5d1f34a

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

controller: immediate requeue unfinished release

7aad010

This improves continuity while the controller attempts to move the release forward. Signed-off-by: Hidde Beydals <hidde@hhh.computer>

action: omit logging on CRD apply no-op

6ffdadd

As it does not offer real value, and creates noise when the logs are included in an emitted event. Signed-off-by: Hidde Beydals <hidde@hhh.computer>

predicates: notice source changing to Ready=True

0535ae1

This ensure that when a chart object has a temporary `Ready=False` state, the predicate will notice the change to `Ready=True` and cause an enqueue request. Signed-off-by: Hidde Beydals <hidde@hhh.computer>

reconcile: improve wording Stalled condition

3ce6e8d

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

action: provide reason for failures count reset

4a8d2ff

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

hiddeco force-pushed the new-reconciler branch from f03453d to 4a8d2ff Compare November 22, 2023 23:17

stefanprodan approved these changes Nov 23, 2023

View reviewed changes

hiddeco changed the base branch from dev to main November 24, 2023 07:58

hiddeco merged commit b8d6523 into main Nov 24, 2023
2 checks passed

hiddeco deleted the new-reconciler branch November 24, 2023 08:39

This was referenced Nov 24, 2023

HelmRelease doesnt delete prev resources on update fluxcd/flux2#2924

Closed

flux reconcile hr: double upgrade/redeployment fluxcd/flux2#1882

Closed

Upgrade of flux marks all helmreleases as unreconciled fluxcd/flux2#1698

Closed

hiddeco mentioned this pull request Nov 24, 2023

Address minor issues #816

Merged

stefanprodan mentioned this pull request Nov 30, 2023

Port Ginkgo tests to pkg/runtime/testenv #325

Closed

darkowlzz mentioned this pull request Dec 5, 2023

Update HelmRelease, Alerts and Providers to Flux v2.2.0 fluxcd/website#1718

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework `HelmRelease` reconciliation logic #738

Rework `HelmRelease` reconciliation logic #738

hiddeco commented Jul 17, 2023 •

edited

Loading

stefanprodan commented Nov 3, 2023 •

edited

Loading

stefanprodan left a comment •

edited

Loading

Rework HelmRelease reconciliation logic #738

Rework HelmRelease reconciliation logic #738

Conversation

hiddeco commented Jul 17, 2023 • edited Loading

Changes

stefanprodan commented Nov 3, 2023 • edited Loading

stefanprodan left a comment • edited Loading

Choose a reason for hiding this comment

Rework `HelmRelease` reconciliation logic #738

Rework `HelmRelease` reconciliation logic #738

hiddeco commented Jul 17, 2023 •

edited

Loading

stefanprodan commented Nov 3, 2023 •

edited

Loading

stefanprodan left a comment •

edited

Loading