Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ KCP remediates unhealthy machines #3830

Merged

Conversation

fabriziopandini
Copy link
Member

What this PR does / why we need it:
This PR adds to KCP support for remediating unhealthy machines according to KCP proposal changes defined by #3676

Which issue(s) this PR fixes:
Fixes #2976

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 20, 2020
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 20, 2020
@fabriziopandini
Copy link
Member Author

The implementation should be completed so people can review (thanks in advance for feedback)
however, I prefer to keep this on hold while doing some additional local testing
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 20, 2020
Copy link
Contributor

@srm09 srm09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit suggestions

api/v1alpha3/condition_consts.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/workload_cluster_etcd.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/control_plane_test.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/control_plane_test.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/remediation.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/remediation.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/remediation.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/remediation.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/remediation.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/remediation.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/remediation.go Outdated Show resolved Hide resolved
Comment on lines 92 to 93
logger.Info("A control plane machine needs remediation, but the current number of replicas is currently lower that expected. Skipping remediation", "UnhealthyMachine", machineToBeRemediated.Name, "Replicas", desiredReplicas, "CurrentReplicas", c.Machines.Len())
conditions.MarkFalse(machineToBeRemediated, clusterv1.MachineOwnerRemediatedCondition, clusterv1.WaitingForRemediationReason, clusterv1.ConditionSeverityWarning, "KCP waiting for having at least %d control plane machines before triggering remediation", desiredReplicas)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote, it'd be great to have the conditions functions to log things out for us

@vincepri
Copy link
Member

/milestone v0.3.11

@k8s-ci-robot k8s-ci-robot added this to the v0.3.11 milestone Oct 21, 2020
@fabriziopandini
Copy link
Member Author

@vincepri @sedefsavas thanks for the feedback!
I have added some small improvements after today's tests, each of them in separate commits for helping in the review.

/hold cancel
Given the positive results of the additional local tests

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 21, 2020
@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-test-release-0-3

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 22, 2020
@fabriziopandini
Copy link
Member Author

/hold
for #3863 to merge

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 23, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2020
Copy link

@sedefsavas sedefsavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits.

I will LGTM after it is approved.

api/v1alpha3/condition_consts.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/control_plane.go Outdated Show resolved Hide resolved

defer func() {
// Always attempt to Patch the Machine conditions after each reconcileUnhealthyMachines.
if err := patchHelper.Patch(ctx, machineToBeRemediated, patch.WithOwnedConditions{Conditions: []clusterv1.ConditionType{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit.
WithOwnedConditions is deprecated, is there another way to use here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'm keeping the same approach we are using in the other controllers (WithOwnedConditions), but soon or later we should switch everything to WithForceOverwriteConditions unless we finally switch to server side apply.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should un-deprecate it, we shouldn't use WithForceOverwriteConditions here

@fabriziopandini fabriziopandini linked an issue Oct 27, 2020 that may be closed by this pull request
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 27, 2020
@ncdc
Copy link
Contributor

ncdc commented Oct 27, 2020

@sedefsavas

I will LGTM after it is approved.

FYI, the documented process in CONTRIBUTING.md is that LGTM is supposed to come before approval.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2020
@fabriziopandini
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 29, 2020
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 4, 2020
@fabriziopandini
Copy link
Member Author

/ hold
for #3900 to merge
only the latest commit is relevant for this PR

@fabriziopandini fabriziopandini force-pushed the kcp-remediation branch 3 times, most recently from 32b4bde to fbd23d4 Compare November 11, 2020 14:04
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 11, 2020
@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-e2e-full-release-0-3

@fabriziopandini
Copy link
Member Author

/retest

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-e2e-full-release-0-3 fbd23d4 link /test pull-cluster-api-e2e-full-release-0-3

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

controllers/machinehealthcheck_controller.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/remediation.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/remediation.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/remediation.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/remediation.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/remediation.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/remediation.go Outdated Show resolved Hide resolved
return ctrl.Result{}, errors.Wrapf(err, "failed to delete unhealthy machine %s", machineToBeRemediated.Name)
}

logger.Info("Remediating unhealthy machine", "UnhealthyMachine", machineToBeRemediated.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hasn't remediation already happened at this point? Or is it in progress?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the machine object is deleted at this point and the KCP job is completed, but the actual machine deletion will take some time to happen, so from a user PoV I think it is correct to say that remediation is in progress (and it shows up in the same way in the conditions as well).

controlplane/kubeadm/controllers/remediation.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 12, 2020
@vincepri
Copy link
Member

LGTM, squash? :)

@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-e2e-full-release-0-3

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 12, 2020
@k8s-ci-robot k8s-ci-robot merged commit 24f10f8 into kubernetes-sigs:release-0.3 Nov 12, 2020
@fabriziopandini fabriziopandini deleted the kcp-remediation branch November 17, 2020 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KCP unhealthy remediation support
6 participants