Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V1: Handle node / disk failures for deployments using local-path-provisioner #166

Open
chrisseto opened this issue Jun 28, 2024 · 0 comments · May be fixed by #168
Open

V1: Handle node / disk failures for deployments using local-path-provisioner #166

chrisseto opened this issue Jun 28, 2024 · 0 comments · May be fixed by #168

Comments

@chrisseto
Copy link
Contributor

chrisseto commented Jun 28, 2024

In #160, some untrustworthy code was removed in response to a series in incidents. We accepted the technical debt that node failures would require manual intervention to delete hanging PVCs. A recent run of some chaos tests demonstrated to us that this is not an acceptable intermediate state. We need to handle node failures automatically and safely.

Reverting #160 , is not an option (IMO) as the recovery from node failures was almost a side effect more so than the primary focus of that code path.

Tersely, our requirements are:

  • This feature needs to be enable-able or disable-able. (Feature flagging) Ideally as either a CLI to the operator or an annotation on the V1 CRD.
  • When a Node running a Redpanda Broker is suddenly removed from the cluster, a new Pod must be scheduled successfully. Currently, the replacement Pod will fail to schedule due to a leftover PersistentVolume with an affinity to the now removed Node.

Below is my proposal of how to handle this situation. @birdayz potentially has a different viewpoint and will share that on this issue.

IMO, the root cause of this issue is an incorrect representation of the actual cluster state due to the usage of local-path-provisioner. When a Kubernetes node is lost, the disk backing the PersistentVolume associated with Node is also lost. However, Kubernetes is ignorant of this as the PersistentVolume continues to exist within the cluster albeit with an affinity to the now lost Node.

To resolve this, I propose that we run a new reconciler in the operator that resolves the inconsistency. We're understandably a bit hesitant to delete disks that might contain data, so we'll need to be careful and clever with how this is done. Ideally, we'd preserve the ability to remount a disk should the Node hosting it re-join the cluster for any reason. I think we could reasonably "unbind" the PV and PVC whenever a PV contains an unsatisfiable constraint. Kubernetes' docs allude that a PV's scheduling constraints would be taken into account when performing the binding processes 1. If this is true (we'll need to verify and test this behavior), we can satisfy all constraints with an "unbind" as the dangling PV would only be rebound if the Node hosting the disk were to rejoin the cluster. 2

To trigger the unbinding behavior, we could:

  1. Add a finalizer to Nodes which unbinds all PVs with an affinity to that node
  2. Periodically check the mount-ability of all PVs and unbind those deemed to be unmountable.
  3. Watch for Pods that are unscheduable due to their PVC's PV and unbind them after some timeout.
  4. Add a finalizer to all Pods that unbinds their PVCs' on Pod deletion and relies on the rebinding behavior.

I like 3 the most as it's an easy condition to watch for and we can very easily attach timeouts to the unbinding behavior. 4 is likely to cause issues with redpanda due to the potential of shuffling BrokerIDs and hostnames. 1 feels error prone as the exact way a Node is removed from a cluster is unclear. 2 feels error prone as we might incorrectly implement the check for scheduability.

Onto the actual "unbind" operation. This could be done 3 ways.

  1. Set the PVC's policy to Retain, Delete the PVC, change the PVs status back to Ready.
  2. Clear the PVC's VolumeName field, Clear the PV's ClaimRef, and change the PVs status back to Ready.
  3. Find a way to abuse the Recycle Policy.

I don't recommend 3 as the Recycle Policy is deprecated. 1 is arguably the most idiomatic way to implement this behavior but 2 should be just as affective. So no preference here.

JIRA Link: K8S-281

Footnotes

  1. https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode

  2. Should we see a node rejoin the cluster with the same name but with a new disk, there should be no issue as this would be identical to the scenario of a new node joining the cluster.

chrisseto added a commit that referenced this issue Jul 10, 2024
Due to the usage of local NVME disks, redpanda deployments are particular
sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod
will be stuck in a Pending state due to the NodeAffinity of it's PV.

This commit implements a "PVCUnbinder" reconciler that watches for such cases
and attempts automatic remediation by "unbinding" PVs. See the implementation
for details on the strategy.

This implementation is similar to, yet much more paranoid than, the
`RedpandaNodePVCReconciler`. Ideally the two implementation should merge
together before long.

Fixes #166
@chrisseto chrisseto linked a pull request Jul 10, 2024 that will close this issue
chrisseto added a commit that referenced this issue Jul 12, 2024
Due to the usage of local NVME disks, redpanda deployments are particular
sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod
will be stuck in a Pending state due to the NodeAffinity of it's PV.

This commit implements a "PVCUnbinder" reconciler that watches for such cases
and attempts automatic remediation by "unbinding" PVs. See the implementation
for details on the strategy.

This implementation is similar to, yet much more paranoid than, the
`RedpandaNodePVCReconciler`. Ideally the two implementation should merge
together before long.

Fixes #166
chrisseto added a commit that referenced this issue Jul 15, 2024
Due to the usage of local NVME disks, redpanda deployments are particular
sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod
will be stuck in a Pending state due to the NodeAffinity of it's PV.

This commit implements a "PVCUnbinder" reconciler that watches for such cases
and attempts automatic remediation by "unbinding" PVs. See the implementation
for details on the strategy.

This implementation is similar to, yet much more paranoid than, the
`RedpandaNodePVCReconciler`. Ideally the two implementation should merge
together before long.

Fixes #166
chrisseto added a commit that referenced this issue Jul 15, 2024
Due to the usage of local NVME disks, redpanda deployments are particular
sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod
will be stuck in a Pending state due to the NodeAffinity of it's PV.

This commit implements a "PVCUnbinder" reconciler that watches for such cases
and attempts automatic remediation by "unbinding" PVs. See the implementation
for details on the strategy.

This implementation is similar to, yet much more paranoid than, the
`RedpandaNodePVCReconciler`. Ideally the two implementation should merge
together before long.

Fixes #166
chrisseto added a commit that referenced this issue Jul 16, 2024
Due to the usage of local NVME disks, redpanda deployments are particular
sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod
will be stuck in a Pending state due to the NodeAffinity of it's PV.

This commit implements a "PVCUnbinder" reconciler that watches for such cases
and attempts automatic remediation by "unbinding" PVs. See the implementation
for details on the strategy.

This implementation is similar to, yet much more paranoid than, the
`RedpandaNodePVCReconciler`. Ideally the two implementation should merge
together before long.

Fixes #166
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant