Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📖 KCP remediation proposal #3676

Merged

Conversation

fabriziopandini
Copy link
Member

What this PR does / why we need it:
This PR updates the KCP document by introducing support for automatic remediation of unhealthy control-plane machines.

Kudos and credits to @benmoss who kicked off this effort and laid the ground for this PR with https://docs.google.com/document/d/1hJza3X-XbVV_yczB5N6vXbl_97D0bOVQ0OwGovcnth0/edit

Which issue(s) this PR fixes:
Rif #2976

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 22, 2020
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 22, 2020
docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved
docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved
docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved
docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

- KCP remediation is triggered by the MachineHealthCheck controller marking a machine for remediation. see
[machine-health-checking proposal](https://github.com/kubernetes-sigs/cluster-api/blob/11485f4f817766c444840d8ea7e4e7d1a6b94cc9/docs/proposals/20191030-machine-health-checking.md)
for additional details. When there are multiple machines that are marked for remediation, the oldest one will be remediate first.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true in any case? What if the oldest one shouldn't be remediated because it could impact quorum?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is not optimal, but this is by far the simplest solution and so I would advocate this is an acceptable solution for the first iteration on KCP remediations, unless there are evidences that multiple remediations with concurrent sparse etcd failures is a frequent use cases.
However, if during the implementation we find a smarter way to determine the machine to remediate, I will more than happy to amend this proposal ....

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved
docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved
docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved
docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved
docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved
Copy link

@sedefsavas sedefsavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one important comment regarding relaxing the requirement of depending MHC for Kubeadm control plane remediation. Others are nit.

@@ -101,6 +109,9 @@ for the default implementation would not preclude the use of alternative control
- To manage control plane deployments across failure domains.
- To support user-initiated remediation:
E.g. user deletes a Machine. Control Plane Provider reconciles by removing the corresponding etcd member and updating related metadata
- To support auto remediation triggered by MachineHealthCheck objects:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since MachineHealthCheck only looks at Node conditions, may not capture all the things to understand if a control-plane machine needs remediation.

should this be more generic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of wish it was a little smarter, but maybe we can split control plane vs worker nodes health checker later in v1alpha4 after #3547

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but IMO in order to address this we should extend MHC to observe machine conditions, so we could benefit on CAPI conditions for worker nodes as well

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved
docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved
docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved
docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved
docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved
docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved
@sedefsavas
Copy link

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 22, 2020
@vincepri
Copy link
Member

vincepri commented Sep 24, 2020

@fabriziopandini squash?

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 25, 2020
@vincepri
Copy link
Member

/test pull-cluster-api-test

@sedefsavas
Copy link

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 25, 2020
@vincepri
Copy link
Member

/approve

@vincepri
Copy link
Member

/milestone v0.3.10

@k8s-ci-robot k8s-ci-robot added this to the v0.3.10 milestone Sep 25, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 25, 2020
@k8s-ci-robot k8s-ci-robot merged commit b0236f2 into kubernetes-sigs:master Sep 25, 2020
@fabriziopandini fabriziopandini deleted the kcp-remediation-proposal branch September 30, 2020 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants