v1: re-implement Node failure recovery #168

chrisseto · 2024-07-10T19:33:57Z

Due to the usage of local NVME disks, redpanda deployments are particular sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod will be stuck in a Pending state due to the NodeAffinity of it's PV.

This commit implements a "PVCUnbinder" reconciler that watches for such cases and attempts automatic remediation by "unbinding" PVs. See the implementation for details on the strategy.

This implementation is similar to, yet much more paranoid than, the RedpandaNodePVCReconciler. Ideally the two implementation should merge together before long.

Fixes #166

t-eckert · 2024-07-15T19:52:05Z

src/go/k8s/cmd/main.go

@@ -320,6 +317,20 @@ func main() {
 			os.Exit(1)
 		}

+		if unbindPVCsAfter <= 0 {


I'm not crazy about 0 and negative numbers being the "disabled" case for the controller, but it does save having a separate bool flag. Plus, a user shouldn't ever see it AFAICT.

I like to avoid flag duplication and additional validation when possible. 0 or a negative time would be an invalid for the "timeout" field it's used for. Instead of throwing an error, it's easier to disable the controller. And as you point out, this isn't a user facing flag.

t-eckert · 2024-07-15T20:00:14Z

src/go/k8s/internal/controller/pvcunbinder/pvcunbinder.go

+		return ctrl.Result{RequeueAfter: requeueAfter, Requeue: ok}, nil
+	}
+
+	// NB: We denote PVCs that are deleted as a nil entry within this map. If a


Nit: you could encode this logic by hiding the map within a struct as a private field and exposing MarkDeleted(key client.ObjectKey) and Ignore(key client.ObjectKey).

Good call out! I'm going to decline making this change unless this package/function becomes more complex. The extra code volume and loss of map ergonomics doesn't currently feel like a worthwhile trade off. Once/if iterators land in go 1.23, this would be a very nice change.

t-eckert · 2024-07-15T20:01:40Z

src/go/k8s/internal/controller/pvcunbinder/pvcunbinder.go

+	pvcByKey := map[client.ObjectKey]*corev1.PersistentVolumeClaim{}
+
+	for _, pvcKey := range StsPVCs(&pod) {
+		var pvc corev1.PersistentVolumeClaim


If you pull this out above the loop, you can prevent this pointer from getting deleted and recreated on each iteration.

Then again, you'd have to reset it to nil to match your existing behavior. 🤷🏻 Not a huge cost to the runtime as is.

pvc isn't a pointer. It has to be re-created every loop otherwise we'll be setting every element in pvcByKey to the same value.

t-eckert · 2024-07-15T20:05:45Z

src/go/k8s/internal/controller/pvcunbinder/pvcunbinder.go

+		pvs = append(pvs, pv)
+	}
+
+	// 3. Ensure that all PVs have reclaim set to Retain


I really like that you've numbered the major steps in the process here. It helps me know when to clear out my mental "cache" of relevant code I've read.

t-eckert · 2024-07-15T20:14:19Z

src/go/k8s/internal/controller/pvcunbinder/k3d.go

@@ -0,0 +1,117 @@
+//nolint:gosec // this code is for tests
+package pvcunbinder


I would hide this from consumers and make it available only to tests by using pvcunbinder_test for both files.

I've refactored this out into a dedicated k3d package.

t-eckert

Nice! Very clean and easy to read. I don't see any issues.

chrisseto · 2024-07-15T20:31:17Z

@c4milo has pointed out that Scylla has a similar feature that we could draw inspiration from.

Some additional notes that likely deserve a first class comment somewhere:

This controller explicitly does not watch for node deletions. Node deletion events could be missed and would otherwise require a hefty loop that doesn't really fit into the reconciler model that constantly loops over all PVs. Node deletions are also a cloud specific detail, meaning our test cases won't be able to catch intricacies. Instead, we rely on a more universe K8s API; Pods being stuck in Pending due to evictions from the taint-eviction-manager.

c4milo · 2024-07-17T19:39:23Z

src/go/k8s/cmd/main.go

@@ -185,6 +181,7 @@ func main() {
 	flag.StringSliceVar(&additionalControllers, "additional-controllers", []string{""}, fmt.Sprintf("which controllers to run, available: all, %s", strings.Join(availableControllers, ", ")))
 	flag.BoolVar(&operatorMode, "operator-mode", true, "enables to run as an operator, setting this to false will disable cluster (deprecated), redpanda resources reconciliation.")
 	flag.BoolVar(&enableHelmControllers, "enable-helm-controllers", true, "if a namespace is defined and operator mode is true, this enables the use of helm controllers to manage fluxcd helm resources.")
+	flag.DurationVar(&unbindPVCsAfter, "unbind-pvcs-after", 0, "if not zero, runs the PVCUnbinder controller which attempts to 'unbind' the PVCs' of Pods that are Pending for longer than the given duration")


I like that it only does unbinding if the pod is pending for a giving amount of time.

is it possible to tell it not to unbind PVCs if the majority of redpanda pods (quroum) are in pending state? and instead page a human? To prevent total data loss. Or is the "ensuring Retain policy" constraint taking care of that case already?

This is copied from our slack conversation:

is it possible to tell it not to unbind PVCs if the majority of redpanda pods (quroum) are in pending state?

This is the case I don't have a solution for. Though I don't think we need to. If all the Pods go into a Pending state, a human should be getting paged anyways. The controller will unbind all Pods but leave the PVs around.

The worst case here is that a completely new redpanda cluster is created. That would only happen if every redpanda Pod is rolled at the same time and the scheduler has a bug (which we have seen). The controller would unbind all PVs and delete the PVCs. Then Kubernetes would make all new PVs because none of the existing ones could be used due to the scheduler bug.

If Pods were rolled one at a time, it's just a race about if data can up replicate in time. Though Pods being unavailable should stop any updates unless something really bad is happening.

c4milo · 2024-07-17T19:40:37Z

src/go/k8s/config/e2e-tests/manager.yaml

        - "--superusers-prefix=__redpanda_system__"
        - "--log-level=trace"
        - "--unsafe-decommission-failed-brokers=true"
+        - "--unbind-pvcs-after=5s"


If we needed to force unbind a PVC, I guess the route will be through k9s or kubectl by explicitly deleting the PVC?

src/go/k8s/internal/controller/pvcunbinder/pvcunbinder.go

c4milo · 2024-07-17T19:47:15Z

src/go/k8s/internal/controller/pvcunbinder/pvcunbinder.go

+//     which _might_ reclaim the now freed volume.
+//  5. Deleting the Pod to re-trigger PVC creation and rebinding.
+type Reconciler struct {
+	Client client.Client


k8s API client I suppose?

c4milo · 2024-07-17T19:53:23Z

src/go/k8s/internal/controller/pvcunbinder/pvcunbinder.go

+	})
+
+	// Paranoid check, ensure that the Pod we've fetched still passes our predicate.
+	if idx == -1 || !pvcUnbinderPredicate(pod) {


it does seem unnecessary indeed unless we refetch metadata for the pod from the k8s API.

c4milo

This is probably the cleanest k8s code I've seen at Redpanda.

chrisseto · 2024-07-17T20:54:56Z

TFTRs! Need to figure out why CI is suddenly being fussy 🤔

chrisseto · 2024-07-22T19:08:50Z

CI failure is unrelated to this PR and I'm struggling to understand how the test that's now failing could have ever passed...

This commit disables the funlen, goerr113, prealloc, testpackage, and unparam linters.

Due to the usage of local NVME disks, redpanda deployments are particular sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod will be stuck in a Pending state due to the NodeAffinity of it's PV. This commit implements a "PVCUnbinder" reconciler that watches for such cases and attempts automatic remediation by "unbinding" PVs. See the implementation for details on the strategy. This implementation is similar to, yet much more paranoid than, the `RedpandaNodePVCReconciler`. Ideally the two implementation should merge together before long. Fixes #166

chrisseto requested a review from birdayz July 10, 2024 19:34

chrisseto force-pushed the chris/p/pvc-unbinder branch 2 times, most recently from 7a337a9 to cc52fee Compare July 15, 2024 15:53

chrisseto marked this pull request as ready for review July 15, 2024 15:54

chrisseto requested a review from RafalKorepta as a code owner July 15, 2024 15:54

chrisseto force-pushed the chris/p/pvc-unbinder branch from cc52fee to 10c87cd Compare July 15, 2024 18:51

t-eckert reviewed Jul 15, 2024

View reviewed changes

t-eckert approved these changes Jul 15, 2024

View reviewed changes

chrisseto force-pushed the chris/p/pvc-unbinder branch 3 times, most recently from 2386622 to 667f4dd Compare July 16, 2024 18:55

c4milo reviewed Jul 17, 2024

View reviewed changes

src/go/k8s/internal/controller/pvcunbinder/pvcunbinder.go Show resolved Hide resolved

c4milo reviewed Jul 17, 2024

View reviewed changes

c4milo approved these changes Jul 17, 2024

View reviewed changes

chrisseto added 2 commits July 22, 2024 15:09

update golangci-lint settings

09c32b6

This commit disables the funlen, goerr113, prealloc, testpackage, and unparam linters.

chrisseto force-pushed the chris/p/pvc-unbinder branch from 667f4dd to 7f102be Compare July 22, 2024 19:09

chrisseto merged commit 4b49ac9 into main Jul 22, 2024
4 checks passed

chrisseto deleted the chris/p/pvc-unbinder branch July 22, 2024 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1: re-implement Node failure recovery #168

v1: re-implement Node failure recovery #168

chrisseto commented Jul 10, 2024

t-eckert Jul 15, 2024

chrisseto Jul 16, 2024

t-eckert Jul 15, 2024

chrisseto Jul 16, 2024

t-eckert Jul 15, 2024

t-eckert Jul 15, 2024

chrisseto Jul 16, 2024

t-eckert Jul 15, 2024

t-eckert Jul 15, 2024

chrisseto Jul 16, 2024

t-eckert left a comment

chrisseto commented Jul 15, 2024

c4milo Jul 17, 2024

c4milo Jul 17, 2024 •

edited

Loading

chrisseto Jul 17, 2024

c4milo Jul 17, 2024

c4milo Jul 17, 2024

c4milo Jul 17, 2024 •

edited

Loading

c4milo left a comment

chrisseto commented Jul 17, 2024

chrisseto commented Jul 22, 2024

		@@ -0,0 +1,117 @@
		//nolint:gosec // this code is for tests
		package pvcunbinder

v1: re-implement Node failure recovery #168

v1: re-implement Node failure recovery #168

Conversation

chrisseto commented Jul 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

t-eckert left a comment

Choose a reason for hiding this comment

chrisseto commented Jul 15, 2024

Choose a reason for hiding this comment

c4milo Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c4milo Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

c4milo left a comment

Choose a reason for hiding this comment

chrisseto commented Jul 17, 2024

chrisseto commented Jul 22, 2024

c4milo Jul 17, 2024 •

edited

Loading

c4milo Jul 17, 2024 •

edited

Loading