-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V1: Handle node / disk failures for deployments using local-path-provisioner
#166
Comments
chrisseto
added a commit
that referenced
this issue
Jul 10, 2024
Due to the usage of local NVME disks, redpanda deployments are particular sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod will be stuck in a Pending state due to the NodeAffinity of it's PV. This commit implements a "PVCUnbinder" reconciler that watches for such cases and attempts automatic remediation by "unbinding" PVs. See the implementation for details on the strategy. This implementation is similar to, yet much more paranoid than, the `RedpandaNodePVCReconciler`. Ideally the two implementation should merge together before long. Fixes #166
chrisseto
added a commit
that referenced
this issue
Jul 12, 2024
Due to the usage of local NVME disks, redpanda deployments are particular sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod will be stuck in a Pending state due to the NodeAffinity of it's PV. This commit implements a "PVCUnbinder" reconciler that watches for such cases and attempts automatic remediation by "unbinding" PVs. See the implementation for details on the strategy. This implementation is similar to, yet much more paranoid than, the `RedpandaNodePVCReconciler`. Ideally the two implementation should merge together before long. Fixes #166
chrisseto
added a commit
that referenced
this issue
Jul 15, 2024
Due to the usage of local NVME disks, redpanda deployments are particular sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod will be stuck in a Pending state due to the NodeAffinity of it's PV. This commit implements a "PVCUnbinder" reconciler that watches for such cases and attempts automatic remediation by "unbinding" PVs. See the implementation for details on the strategy. This implementation is similar to, yet much more paranoid than, the `RedpandaNodePVCReconciler`. Ideally the two implementation should merge together before long. Fixes #166
chrisseto
added a commit
that referenced
this issue
Jul 15, 2024
Due to the usage of local NVME disks, redpanda deployments are particular sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod will be stuck in a Pending state due to the NodeAffinity of it's PV. This commit implements a "PVCUnbinder" reconciler that watches for such cases and attempts automatic remediation by "unbinding" PVs. See the implementation for details on the strategy. This implementation is similar to, yet much more paranoid than, the `RedpandaNodePVCReconciler`. Ideally the two implementation should merge together before long. Fixes #166
chrisseto
added a commit
that referenced
this issue
Jul 16, 2024
Due to the usage of local NVME disks, redpanda deployments are particular sensitive to Node failure. Whenever a Node crashes, the resultant redpanda Pod will be stuck in a Pending state due to the NodeAffinity of it's PV. This commit implements a "PVCUnbinder" reconciler that watches for such cases and attempts automatic remediation by "unbinding" PVs. See the implementation for details on the strategy. This implementation is similar to, yet much more paranoid than, the `RedpandaNodePVCReconciler`. Ideally the two implementation should merge together before long. Fixes #166
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In #160, some untrustworthy code was removed in response to a series in incidents. We accepted the technical debt that node failures would require manual intervention to delete hanging PVCs. A recent run of some chaos tests demonstrated to us that this is not an acceptable intermediate state. We need to handle node failures automatically and safely.
Reverting #160 , is not an option (IMO) as the recovery from node failures was almost a side effect more so than the primary focus of that code path.
Tersely, our requirements are:
Below is my proposal of how to handle this situation. @birdayz potentially has a different viewpoint and will share that on this issue.
IMO, the root cause of this issue is an incorrect representation of the actual cluster state due to the usage of
local-path-provisioner
. When a Kubernetes node is lost, the disk backing the PersistentVolume associated with Node is also lost. However, Kubernetes is ignorant of this as the PersistentVolume continues to exist within the cluster albeit with an affinity to the now lost Node.To resolve this, I propose that we run a new reconciler in the operator that resolves the inconsistency. We're understandably a bit hesitant to delete disks that might contain data, so we'll need to be careful and clever with how this is done. Ideally, we'd preserve the ability to remount a disk should the Node hosting it re-join the cluster for any reason. I think we could reasonably "unbind" the PV and PVC whenever a PV contains an unsatisfiable constraint. Kubernetes' docs allude that a PV's scheduling constraints would be taken into account when performing the binding processes 1. If this is true (we'll need to verify and test this behavior), we can satisfy all constraints with an "unbind" as the dangling PV would only be rebound if the Node hosting the disk were to rejoin the cluster. 2
To trigger the unbinding behavior, we could:
I like 3 the most as it's an easy condition to watch for and we can very easily attach timeouts to the unbinding behavior. 4 is likely to cause issues with redpanda due to the potential of shuffling BrokerIDs and hostnames. 1 feels error prone as the exact way a Node is removed from a cluster is unclear. 2 feels error prone as we might incorrectly implement the check for scheduability.
Onto the actual "unbind" operation. This could be done 3 ways.
I don't recommend 3 as the Recycle Policy is deprecated. 1 is arguably the most idiomatic way to implement this behavior but 2 should be just as affective. So no preference here.
JIRA Link: K8S-281
Footnotes
https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode ↩
Should we see a node rejoin the cluster with the same name but with a new disk, there should be no issue as this would be identical to the scenario of a new node joining the cluster. ↩
The text was updated successfully, but these errors were encountered: