Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1: re-implement Node failure recovery #168

Merged
merged 2 commits into from
Jul 22, 2024
Merged

Conversation

chrisseto
Copy link
Contributor

Due to the usage of local NVME disks, redpanda deployments are particular sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod will be stuck in a Pending state due to the NodeAffinity of it's PV.

This commit implements a "PVCUnbinder" reconciler that watches for such cases and attempts automatic remediation by "unbinding" PVs. See the implementation for details on the strategy.

This implementation is similar to, yet much more paranoid than, the RedpandaNodePVCReconciler. Ideally the two implementation should merge together before long.

Fixes #166

@chrisseto chrisseto requested a review from birdayz July 10, 2024 19:34
@chrisseto chrisseto force-pushed the chris/p/pvc-unbinder branch 2 times, most recently from 7a337a9 to cc52fee Compare July 15, 2024 15:53
@chrisseto chrisseto marked this pull request as ready for review July 15, 2024 15:54
@@ -320,6 +317,20 @@ func main() {
os.Exit(1)
}

if unbindPVCsAfter <= 0 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not crazy about 0 and negative numbers being the "disabled" case for the controller, but it does save having a separate bool flag. Plus, a user shouldn't ever see it AFAICT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to avoid flag duplication and additional validation when possible. 0 or a negative time would be an invalid for the "timeout" field it's used for. Instead of throwing an error, it's easier to disable the controller. And as you point out, this isn't a user facing flag.

return ctrl.Result{RequeueAfter: requeueAfter, Requeue: ok}, nil
}

// NB: We denote PVCs that are deleted as a nil entry within this map. If a

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you could encode this logic by hiding the map within a struct as a private field and exposing MarkDeleted(key client.ObjectKey) and Ignore(key client.ObjectKey).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out! I'm going to decline making this change unless this package/function becomes more complex. The extra code volume and loss of map ergonomics doesn't currently feel like a worthwhile trade off. Once/if iterators land in go 1.23, this would be a very nice change.

pvcByKey := map[client.ObjectKey]*corev1.PersistentVolumeClaim{}

for _, pvcKey := range StsPVCs(&pod) {
var pvc corev1.PersistentVolumeClaim

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pull this out above the loop, you can prevent this pointer from getting deleted and recreated on each iteration.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again, you'd have to reset it to nil to match your existing behavior. 🤷🏻 Not a huge cost to the runtime as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pvc isn't a pointer. It has to be re-created every loop otherwise we'll be setting every element in pvcByKey to the same value.

pvs = append(pvs, pv)
}

// 3. Ensure that all PVs have reclaim set to Retain

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like that you've numbered the major steps in the process here. It helps me know when to clear out my mental "cache" of relevant code I've read.

@@ -0,0 +1,117 @@
//nolint:gosec // this code is for tests
package pvcunbinder

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hide this from consumers and make it available only to tests by using pvcunbinder_test for both files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored this out into a dedicated k3d package.

Copy link

@t-eckert t-eckert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Very clean and easy to read. I don't see any issues.

@chrisseto
Copy link
Contributor Author

@c4milo has pointed out that Scylla has a similar feature that we could draw inspiration from.

Some additional notes that likely deserve a first class comment somewhere:

This controller explicitly does not watch for node deletions. Node deletion events could be missed and would otherwise require a hefty loop that doesn't really fit into the reconciler model that constantly loops over all PVs. Node deletions are also a cloud specific detail, meaning our test cases won't be able to catch intricacies. Instead, we rely on a more universe K8s API; Pods being stuck in Pending due to evictions from the taint-eviction-manager.

@chrisseto chrisseto force-pushed the chris/p/pvc-unbinder branch 3 times, most recently from 2386622 to 667f4dd Compare July 16, 2024 18:55
@@ -185,6 +181,7 @@ func main() {
flag.StringSliceVar(&additionalControllers, "additional-controllers", []string{""}, fmt.Sprintf("which controllers to run, available: all, %s", strings.Join(availableControllers, ", ")))
flag.BoolVar(&operatorMode, "operator-mode", true, "enables to run as an operator, setting this to false will disable cluster (deprecated), redpanda resources reconciliation.")
flag.BoolVar(&enableHelmControllers, "enable-helm-controllers", true, "if a namespace is defined and operator mode is true, this enables the use of helm controllers to manage fluxcd helm resources.")
flag.DurationVar(&unbindPVCsAfter, "unbind-pvcs-after", 0, "if not zero, runs the PVCUnbinder controller which attempts to 'unbind' the PVCs' of Pods that are Pending for longer than the given duration")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that it only does unbinding if the pod is pending for a giving amount of time.

Copy link
Member

@c4milo c4milo Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to tell it not to unbind PVCs if the majority of redpanda pods (quroum) are in pending state? and instead page a human? To prevent total data loss. Or is the "ensuring Retain policy" constraint taking care of that case already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from our slack conversation:

is it possible to tell it not to unbind PVCs if the majority of redpanda pods (quroum) are in pending state?

This is the case I don't have a solution for. Though I don't think we need to. If all the Pods go into a Pending state, a human should be getting paged anyways. The controller will unbind all Pods but leave the PVs around.

The worst case here is that a completely new redpanda cluster is created. That would only happen if every redpanda Pod is rolled at the same time and the scheduler has a bug (which we have seen). The controller would unbind all PVs and delete the PVCs. Then Kubernetes would make all new PVs because none of the existing ones could be used due to the scheduler bug.

If Pods were rolled one at a time, it's just a race about if data can up replicate in time. Though Pods being unavailable should stop any updates unless something really bad is happening.

- "--superusers-prefix=__redpanda_system__"
- "--log-level=trace"
- "--unsafe-decommission-failed-brokers=true"
- "--unbind-pvcs-after=5s"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we needed to force unbind a PVC, I guess the route will be through k9s or kubectl by explicitly deleting the PVC?

// which _might_ reclaim the now freed volume.
// 5. Deleting the Pod to re-trigger PVC creation and rebinding.
type Reconciler struct {
Client client.Client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k8s API client I suppose?

})

// Paranoid check, ensure that the Pod we've fetched still passes our predicate.
if idx == -1 || !pvcUnbinderPredicate(pod) {
Copy link
Member

@c4milo c4milo Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does seem unnecessary indeed unless we refetch metadata for the pod from the k8s API.

Copy link
Member

@c4milo c4milo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the cleanest k8s code I've seen at Redpanda.

@chrisseto
Copy link
Contributor Author

TFTRs! Need to figure out why CI is suddenly being fussy 🤔

@chrisseto
Copy link
Contributor Author

CI failure is unrelated to this PR and I'm struggling to understand how the test that's now failing could have ever passed...

This commit disables the funlen, goerr113, prealloc, testpackage, and unparam
linters.
Due to the usage of local NVME disks, redpanda deployments are particular
sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod
will be stuck in a Pending state due to the NodeAffinity of it's PV.

This commit implements a "PVCUnbinder" reconciler that watches for such cases
and attempts automatic remediation by "unbinding" PVs. See the implementation
for details on the strategy.

This implementation is similar to, yet much more paranoid than, the
`RedpandaNodePVCReconciler`. Ideally the two implementation should merge
together before long.

Fixes #166
@chrisseto chrisseto merged commit 4b49ac9 into main Jul 22, 2024
4 checks passed
@chrisseto chrisseto deleted the chris/p/pvc-unbinder branch July 22, 2024 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

V1: Handle node / disk failures for deployments using local-path-provisioner
3 participants