Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework HelmRelease reconciliation logic #738

Merged
merged 76 commits into from
Nov 24, 2023
Merged

Rework HelmRelease reconciliation logic #738

merged 76 commits into from
Nov 24, 2023

Conversation

hiddeco
Copy link
Member

@hiddeco hiddeco commented Jul 17, 2023

This PR is practically a rewrite of the controller as a whole, it moves things into packages, introduces a v2beta2 API with a changed Status structure, and handles the determination of the Helm action to take by observing writes to — and verifying the state of — the Helm storage.

Changes

  • During a Helm action, newly stored releases (and mutations to a previous observed release) are recorded as a Snapshot in the .status.history list of the HelmRelease.
  • The results of a Helm test are recorded in the Snapshot item for this release, to improve observability of the test results as run by the controller.
  • Tests can now be filtered using .spec.test.filters, which works like --filter of the helm test command.
  • When a release name exceeds the Helm limit, a checksum for the release name will be calculated, and the short hash of this will be appended to the trimmed name.
  • When a HelmRelease is mutated in such a way that the release target would be different (change in release namespace, storage namespace and/or chart name), the previous release is first uninstalled.
  • The default for .spec.maxHistory has been changed to 5 for performance improvements.
  • Reconciling and Stalled conditions are now being set to indicate when a HelmRelease is undergoing reconciliation, or has reached a point where it is not expected to proceed without e.g. a change in values, a new chart version, etc.
  • Releases in a pending-install, pending-upgrade or pending-rollback state will be unlocked by the controller to a failed state, and retried with a Helm upgrade.
  • Improved logging to help users reason about decisions the controller took, added timestamps to captured Helm logs, and various improvements to condition and event messages.

Fixes #265
Fixes #270
Fixes #282
Fixes #312
Fixes #322
Fixes #323
Fixes #324
Fixes #367
Fixes #403
Fixes #439
Fixes #471
Fixes #487
Fixes #534
Fixes #536
Fixes #554
Fixes #555
Fixes #644
Fixes #739
Fixes #804

@stefanprodan
Copy link
Member

stefanprodan commented Nov 3, 2023

I'm testing this PR here. Here are my findings so far:

  • On HR install, the Ready status is stuck on HelmChart X is not ready even if the chart is ready and the installation is underway. I suggest we set Ready=Unknown with the same message as the Reconciling condition, to signal users what's going on, the Flux CLI and the various Flux UIs only look at Ready, this would make HC behave the same as KC.
  • I think the determining next Helm action based on current state log should be moved to debug level.
  • After a controller restart, some HRs are upgraded even if there is no change observed release is out of sync with desired state: release chart changed, reproducible with kube-prom-stack
  • On namespace deletion after logging skipping Helm release uninstallation, it also logs a not found error, I think we could skip that.
  • the controller doesn't unlock releases after a node crash/recovery
  • failed to download chart from 'http://source-controller.flux-system.svc.cluster.local./helmchart/helm-benchmark/helm-benchmark-podinfo-786/podinfo-6.5.3.tgz': 404 Not Found in KC we retry with backoff on 404 instead of setting Ready=False with this error
  • panics at unlock
{"level":"error","ts":"2023-11-09T06:39:36.536Z","logger":"runtime","msg":"Observed a panic: \"invalid memory address or nil pointer dereference\" (runtime error: invalid memory address or nil pointer dereference)\ngoroutine 193 [running]:\nk8s.io/apimachinery/pkg/util/runtime.logPanic({0x181d1e0?, 0x2e98de0})\n\tk8s.io/apimachinery@v0.27.4/pkg/util/runtime/runtime.go:75 +0x7c\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()\n\tsigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:107 +0xa4\npanic({0x181d1e0, 0x2e98de0})\n\truntime/panic.go:884 +0x1f4\ngit.luolix.top/fluxcd/helm-controller/api/v2beta2.(*Snapshot).FullReleaseName(...)\n\tgit.luolix.top/fluxcd/helm-controller/api@v0.36.2/v2beta2/snapshot_types.go:79\ngit.luolix.top/fluxcd/helm-controller/internal/reconcile.(*Unlock).success(0x400df18180, 0x4009d2a918, {0x4001d120b0, 0xf})\n\tgit.luolix.top/fluxcd/helm-controller/internal/reconcile/unlock.go:140 +0x90\ngit.luolix.top/fluxcd/helm-controller/internal/reconcile.(*Unlock).Reconcile(0x400df18180, {0x4004e388d0?, 0x4001a46538?}, 0x4009d2a918)\n\tgit.luolix.top/fluxcd/helm-controller/internal/reconcile/unlock.go:93 +0x300\ngit.luolix.top/fluxcd/helm-controller/internal/reconcile.(*AtomicRelease).Reconcile(0x4006530d68, {0x1e15128, 0x4006682ab0}, 0x4009d2a918)\n\tgit.luolix.top/fluxcd/helm-controller/internal/reconcile/atomic_release.go:232 +0x600\ngit.luolix.top/fluxcd/helm-controller/internal/controller.(*HelmReleaseReconciler).reconcileRelease(0x40006b82a0, {0x1e15128, 0x4006682ab0}, 0x400a2c5360, 0x400be93500)\n\tgit.luolix.top/fluxcd/helm-controller/internal/controller/helmrelease_controller.go:324 +0x860\ngit.luolix.top/fluxcd/helm-controller/internal/controller.(*HelmReleaseReconciler).Reconcile(0x40006b82a0, {0x1e15128, 0x4006682ab0}, {{{0x40010af490?, 0x30?}, {0x40010af480?, 0xfffe9003b501?}}})\n\tgit.luolix.top/fluxcd/helm-controller/internal/controller/helmrelease_controller.go:206 +0x3e4\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1e15128?, {0x1e15128?, 0x4006682ab0?}, {{{0x40010af490?, 0x1722140?}, {0x40010af480?, 0x4001a47e08?}}})\n\tsigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:118 +0x8c\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x400041f7c0, {0x1e15080, 0x400036c820}, {0x18f0460?, 0x400168c800?})\n\tsigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:314 +0x2cc\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x400041f7c0, {0x1e15080, 0x400036c820})\n\tsigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:265 +0x1a0\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()\n\tsigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:226 +0x74\ncreated by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\tsigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:222 +0x43c\n"}
{"level":"error","ts":"2023-11-09T06:39:36.536Z","msg":"Reconciler error","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"podinfo-72","namespace":"helm-benchmark"},"namespace":"helm-benchmark","name":"podinfo-72","reconcileID":"afd2e81d-a103-4a89-ac98-8d2b1a5d2996","error":"panic: runtime error: invalid memory address or nil pointer dereference [recovered]"}
  • we should set max history to 5, currently it's set to 10 and makes upgrades so much slower (x5) due the Helm SDK querying 10 Secrets every time
  • status.history.previous.version shows the wrong version after suspend/resume, previous in my case was v12 but the status shows v3.
  • During upgrades and tests instead of Ready=False should be Ready=Unknown so that the various Flux UIs will not show the release as failed.
  • After a rollback, when fixing the values, the new release is not reflected in status under Released=True, only TestSuccess=True is updated for the latest release.
  • When setting a chart version that doesn't exists, the release becomes Read=False, after reverting the version, the controller doesn't see the old version as being available.
  • When re-applying a v2beta1 release over a failed v2beta2, the controller does an upgrade instead of preserving state.

@hiddeco hiddeco force-pushed the new-reconciler branch 4 times, most recently from 7e7f7d0 to 5f82c20 Compare November 8, 2023 12:10
This ensures that certain UIs can continue to display information to
their users while they work on making better use of the new data
available in `v2beta2`.

Signed-off-by: Hidde Beydals <hidde@hhh.computer>
@hiddeco hiddeco force-pushed the new-reconciler branch 2 times, most recently from 3c86265 to c9ddc0f Compare November 22, 2023 10:11
@hiddeco hiddeco marked this pull request as ready for review November 22, 2023 12:04
This allows users to delay updating their `apiVersion` declarations,
as the fields will be known in the previous version.

If we would not do this, the fields would get wiped when the `v2beta1`
resource is applied, potentially causing spurious upgrades.

Signed-off-by: Hidde Beydals <hidde@hhh.computer>
This allows the controller to be updated from `v2beta1` to `v2beta2`
without triggering a release to settle state.

It does this by looking at the previous successful release as recorded
for the `v2beta1` object, and if found, recording a snapshot for it in
the new `History` field of the status.

This feature can be disabled by setting the `AdoptLegacyReleases`
feature flag to `false`.

Signed-off-by: Hidde Beydals <hidde@hhh.computer>
This to allow better feedback to the user on why the controller decided
to uninstall the release.

Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
This improves continuity while the controller attempts to move the
release forward.

Signed-off-by: Hidde Beydals <hidde@hhh.computer>
As it does not offer real value, and creates noise when the logs are
included in an emitted event.

Signed-off-by: Hidde Beydals <hidde@hhh.computer>
This ensure that when a chart object has a temporary `Ready=False`
state, the predicate will notice the change to `Ready=True` and
cause an enqueue request.

Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Awesome work @hiddeco 🎖️ 🏅 🥇

I've tested the HelmRelease v2beta2 API and controller for two weeks using various configurations:

  • install/upgrade/rollback/uninstall (with and without Helm tests)
  • failure conditions (invalid charts, Helm test failures, controller OOM, Kubernetes API downtime)
  • release auto-recovery when Helm storage has been locked
  • multi-tenancy lockdown and account impersonation
  • verify Kubernetes and Slack events for all actions
  • install/upgrade of Flux monitoring stack (kube-prom-stack and other heavy charts)
  • Helm release promotion with GitHub repository dispatch workflow

I've also run extensive benchmarks on Apple M1 Max and in GitHub hosted-runner, the results are impressive. Well done!

Objects Type Flux component Duration Apple M1 Duration GH Intel Max Memory
100 HelmChart source-controller 35s 35s 40Mi
100 HelmRelease helm-controller 42s 42s 140Mi
500 HelmChart source-controller 1m10s 1m10s 68Mi
500 HelmRelease helm-controller 1m58s 4m40s 350Mi
1000 HelmChart source-controller 1m45s 1m45s 110Mi
1000 HelmRelease helm-controller 5m10s 14m10s 620Mi

For this benchmark 100, 500 and 1000 Helm releases are being installed/upgraded at the same time each with its own Helm chart.

https://github.com/stefanprodan/flux-benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment