-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📖 KCP remediation proposal #3676
📖 KCP remediation proposal #3676
Conversation
|
||
- KCP remediation is triggered by the MachineHealthCheck controller marking a machine for remediation. see | ||
[machine-health-checking proposal](https://github.com/kubernetes-sigs/cluster-api/blob/11485f4f817766c444840d8ea7e4e7d1a6b94cc9/docs/proposals/20191030-machine-health-checking.md) | ||
for additional details. When there are multiple machines that are marked for remediation, the oldest one will be remediate first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true in any case? What if the oldest one shouldn't be remediated because it could impact quorum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is not optimal, but this is by far the simplest solution and so I would advocate this is an acceptable solution for the first iteration on KCP remediations, unless there are evidences that multiple remediations with concurrent sparse etcd failures is a frequent use cases.
However, if during the implementation we find a smarter way to determine the machine to remediate, I will more than happy to amend this proposal ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one important comment regarding relaxing the requirement of depending MHC for Kubeadm control plane remediation. Others are nit.
@@ -101,6 +109,9 @@ for the default implementation would not preclude the use of alternative control | |||
- To manage control plane deployments across failure domains. | |||
- To support user-initiated remediation: | |||
E.g. user deletes a Machine. Control Plane Provider reconciles by removing the corresponding etcd member and updating related metadata | |||
- To support auto remediation triggered by MachineHealthCheck objects: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since MachineHealthCheck only looks at Node conditions, may not capture all the things to understand if a control-plane machine needs remediation.
should this be more generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of wish it was a little smarter, but maybe we can split control plane vs worker nodes health checker later in v1alpha4 after #3547
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but IMO in order to address this we should extend MHC to observe machine conditions, so we could benefit on CAPI conditions for worker nodes as well
/lgtm |
@fabriziopandini squash? |
8ab8d17
to
0d91562
Compare
/test pull-cluster-api-test |
/lgtm |
/approve |
/milestone v0.3.10 |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
This PR updates the KCP document by introducing support for automatic remediation of unhealthy control-plane machines.
Kudos and credits to @benmoss who kicked off this effort and laid the ground for this PR with https://docs.google.com/document/d/1hJza3X-XbVV_yczB5N6vXbl_97D0bOVQ0OwGovcnth0/edit
Which issue(s) this PR fixes:
Rif #2976