V1: Handle node / disk failures for deployments using `local-path-provisioner` #166

chrisseto · 2024-06-28T19:25:21Z

In #160, some untrustworthy code was removed in response to a series in incidents. We accepted the technical debt that node failures would require manual intervention to delete hanging PVCs. A recent run of some chaos tests demonstrated to us that this is not an acceptable intermediate state. We need to handle node failures automatically and safely.

Reverting #160 , is not an option (IMO) as the recovery from node failures was almost a side effect more so than the primary focus of that code path.

Tersely, our requirements are:

This feature needs to be enable-able or disable-able. (Feature flagging) Ideally as either a CLI to the operator or an annotation on the V1 CRD.
When a Node running a Redpanda Broker is suddenly removed from the cluster, a new Pod must be scheduled successfully. Currently, the replacement Pod will fail to schedule due to a leftover PersistentVolume with an affinity to the now removed Node.

Below is my proposal of how to handle this situation. @birdayz potentially has a different viewpoint and will share that on this issue.

IMO, the root cause of this issue is an incorrect representation of the actual cluster state due to the usage of local-path-provisioner. When a Kubernetes node is lost, the disk backing the PersistentVolume associated with Node is also lost. However, Kubernetes is ignorant of this as the PersistentVolume continues to exist within the cluster albeit with an affinity to the now lost Node.

To resolve this, I propose that we run a new reconciler in the operator that resolves the inconsistency. We're understandably a bit hesitant to delete disks that might contain data, so we'll need to be careful and clever with how this is done. Ideally, we'd preserve the ability to remount a disk should the Node hosting it re-join the cluster for any reason. I think we could reasonably "unbind" the PV and PVC whenever a PV contains an unsatisfiable constraint. Kubernetes' docs allude that a PV's scheduling constraints would be taken into account when performing the binding processes ¹. If this is true (we'll need to verify and test this behavior), we can satisfy all constraints with an "unbind" as the dangling PV would only be rebound if the Node hosting the disk were to rejoin the cluster. ²

To trigger the unbinding behavior, we could:

Add a finalizer to Nodes which unbinds all PVs with an affinity to that node
Periodically check the mount-ability of all PVs and unbind those deemed to be unmountable.
Watch for Pods that are unscheduable due to their PVC's PV and unbind them after some timeout.
Add a finalizer to all Pods that unbinds their PVCs' on Pod deletion and relies on the rebinding behavior.

I like 3 the most as it's an easy condition to watch for and we can very easily attach timeouts to the unbinding behavior. 4 is likely to cause issues with redpanda due to the potential of shuffling BrokerIDs and hostnames. 1 feels error prone as the exact way a Node is removed from a cluster is unclear. 2 feels error prone as we might incorrectly implement the check for scheduability.

Onto the actual "unbind" operation. This could be done 3 ways.

Set the PVC's policy to Retain, Delete the PVC, change the PVs status back to Ready.
Clear the PVC's VolumeName field, Clear the PV's ClaimRef, and change the PVs status back to Ready.
Find a way to abuse the Recycle Policy.

I don't recommend 3 as the Recycle Policy is deprecated. 1 is arguably the most idiomatic way to implement this behavior but 2 should be just as affective. So no preference here.

JIRA Link: K8S-281

https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode ↩
Should we see a node rejoin the cluster with the same name but with a new disk, there should be no issue as this would be identical to the scenario of a new node joining the cluster. ↩

The text was updated successfully, but these errors were encountered:

Due to the usage of local NVME disks, redpanda deployments are particular sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod will be stuck in a Pending state due to the NodeAffinity of it's PV. This commit implements a "PVCUnbinder" reconciler that watches for such cases and attempts automatic remediation by "unbinding" PVs. See the implementation for details on the strategy. This implementation is similar to, yet much more paranoid than, the `RedpandaNodePVCReconciler`. Ideally the two implementation should merge together before long. Fixes #166

chrisseto linked a pull request Jul 10, 2024 that will close this issue

v1: re-implement Node failure recovery #168

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V1: Handle node / disk failures for deployments using `local-path-provisioner` #166

V1: Handle node / disk failures for deployments using `local-path-provisioner` #166

chrisseto commented Jun 28, 2024 •

edited by github-actions bot

Loading

V1: Handle node / disk failures for deployments using local-path-provisioner #166

V1: Handle node / disk failures for deployments using local-path-provisioner #166

Comments

chrisseto commented Jun 28, 2024 • edited by github-actions bot Loading

Footnotes

V1: Handle node / disk failures for deployments using `local-path-provisioner` #166

V1: Handle node / disk failures for deployments using `local-path-provisioner` #166

chrisseto commented Jun 28, 2024 •

edited by github-actions bot

Loading