New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

📖 KCP remediation proposal #3676

Merged

k8s-ci-robot merged 1 commit into kubernetes-sigs:master from fabriziopandini:kcp-remediation-proposal

Sep 25, 2020

Member

fabriziopandini commented Sep 22, 2020

What this PR does / why we need it:
This PR updates the KCP document by introducing support for automatic remediation of unhealthy control-plane machines.

Kudos and credits to @benmoss who kicked off this effort and laid the ground for this PR with https://docs.google.com/document/d/1hJza3X-XbVV_yczB5N6vXbl_97D0bOVQ0OwGovcnth0/edit

Which issue(s) this PR fixes:
Rif #2976

k8s-ci-robot added the cncf-cla: yes label

k8s-ci-robot requested review from justinsb and vincepri

September 22, 2020 15:41

k8s-ci-robot added the size/L label

vincepri reviewed

View reviewed changes

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated

+              - KCP remediation is triggered by the MachineHealthCheck controller marking a machine for remediation. see
+                [machine-health-checking proposal](https://github.com/kubernetes-sigs/cluster-api/blob/11485f4f817766c444840d8ea7e4e7d1a6b94cc9/docs/proposals/20191030-machine-health-checking.md)
+                for additional details. When there are multiple machines that are marked for remediation, the oldest one will be remediate first.

Member

vincepri Sep 22, 2020

Is this true in any case? What if the oldest one shouldn't be remediated because it could impact quorum?

Member Author

fabriziopandini Sep 22, 2020

I agree this is not optimal, but this is by far the simplest solution and so I would advocate this is an acceptable solution for the first iteration on KCP remediations, unless there are evidences that multiple remediations with concurrent sparse etcd failures is a frequent use cases.
However, if during the implementation we find a smarter way to determine the machine to remediate, I will more than happy to amend this proposal ....

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

sedefsavas reviewed

View reviewed changes

sedefsavas left a comment

I have one important comment regarding relaxing the requirement of depending MHC for Kubeadm control plane remediation. Others are nit.

docs/proposals/20191017-kubeadm-based-control-plane.md

               - To manage control plane deployments across failure domains.
               - To support user-initiated remediation:
                 E.g. user deletes a Machine. Control Plane Provider reconciles by removing the corresponding etcd member and updating related metadata
+              - To support auto remediation triggered by MachineHealthCheck objects:

sedefsavas Sep 22, 2020

Since MachineHealthCheck only looks at Node conditions, may not capture all the things to understand if a control-plane machine needs remediation.

should this be more generic?

Member

vincepri Sep 22, 2020

I kind of wish it was a little smarter, but maybe we can split control plane vs worker nodes health checker later in v1alpha4 after #3547

Member Author

fabriziopandini Sep 22, 2020

I agree, but IMO in order to address this we should extend MHC to observe machine conditions, so we could benefit on CAPI conditions for worker nodes as well

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

docs/proposals/20191017-kubeadm-based-control-plane.md Outdated Show resolved Hide resolved

sedefsavas commented Sep 22, 2020

/lgtm

k8s-ci-robot assigned sedefsavas

k8s-ci-robot added the lgtm label

Member

vincepri commented Sep 24, 2020 •

edited

Loading

@fabriziopandini squash?


          kcp remediation proposal

0d91562

fabriziopandini force-pushed the kcp-remediation-proposal branch from 8ab8d17 to 0d91562 Compare

September 25, 2020 09:20

k8s-ci-robot removed the lgtm label

Member

vincepri commented Sep 25, 2020

/test pull-cluster-api-test

sedefsavas commented Sep 25, 2020

/lgtm

k8s-ci-robot added the lgtm label

Member

vincepri commented Sep 25, 2020

/approve

Member

vincepri commented Sep 25, 2020

/milestone v0.3.10

k8s-ci-robot added this to the v0.3.10 milestone

Contributor

k8s-ci-robot commented Sep 25, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the approved label

k8s-ci-robot merged commit b0236f2 into kubernetes-sigs:master

fabriziopandini deleted the kcp-remediation-proposal branch

September 30, 2020 10:06

fabriziopandini mentioned this pull request

✨ KCP remediates unhealthy machines #3830

Merged

fabriziopandini mentioned this pull request

✨ KCP remediation #3956

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved cncf-cla: yes lgtm size/L