✨ Update MHC proposal with new annotation strategy #2920

benmoss · 2020-04-15T21:32:09Z

What this PR does / why we need it:
Updates the MHC proposal with a new strategy so that MHC can support machines managed by KCP.

Related to #2836

@enxebre @vincepri @JoelSpeed

/kind proposal

docs/proposals/20191030-machine-health-checking.md

vincepri · 2020-04-16T14:13:03Z

docs/proposals/20191030-machine-health-checking.md

@@ -108,6 +109,11 @@ As an operator of a Management Cluster, I want my machines to be self-healing an

 ### Implementation Details/Notes/Constraints

+#### Machine annotation:
+```go
+const MachineUnhealthyAnnotation = "machine.cluster.x-k8s.io/unhealthy"


I'm wondering if this should be have a time as value?

We need to expand this section to give some guarantees, for example if this is a timestamp, do we keep updating the annotation at every reconciliation?

I think if we track timestamps, we likely need two (unhealthy since, and last health check) to ensure that we are operating against the information the way we expect to. That way we could differentiate between a persistent unhealthy state and lack of reconciliation.

As an aside, to support different external remediation (outside of the owning controller), such as the request from #2846, would it make sense to allow for overriding the annotation key here?

Having different timestamps, while interesting, also means unfortunately that the Machine and everything that watches them is going to be updated every time the check is performed :/

to support different external remediation (outside of the owning controller), such as the request from #2846, would it make sense to allow for overriding the annotation key here?

I'd expect that at some point (proposal tbd) that we'll allow different strategies for remediation.

Maybe to start, we should have a simple annotation with no timestamp on it?

Is it possible to have also an annotation that signals when the machine is not healthy but the timeout is not yet reached (machine under probation)

I'd say that until we have a use case for it being a timestamp, let's not add that extra complexity

Also, you can't trust that the timestamp is not skewed (best case) or actively wrong.

docs/proposals/20191030-machine-health-checking.md

vincepri · 2020-04-16T14:17:25Z

/assign @JoelSpeed @enxebre

vincepri · 2020-04-16T14:17:33Z

/milestone v0.3.4

detiber · 2020-04-16T15:43:56Z

docs/proposals/20191030-machine-health-checking.md

@@ -108,6 +109,11 @@ As an operator of a Management Cluster, I want my machines to be self-healing an

 ### Implementation Details/Notes/Constraints

+#### Machine annotation:
+```go
+const MachineUnhealthyAnnotation = "machine.cluster.x-k8s.io/unhealthy"


I think if we track timestamps, we likely need two (unhealthy since, and last health check) to ensure that we are operating against the information the way we expect to. That way we could differentiate between a persistent unhealthy state and lack of reconciliation.

docs/proposals/20191030-machine-health-checking.md

fabriziopandini

LGTM from my side.
this work might be relevant for conditions as well

JoelSpeed

Two thoughts cross my mind while reading this as it is currently:

Do we have noted down anywhere, how remediation will happen for any of the controllers? Should this be noted down somewhere?
What does the annotation mean, are we expecting it to always be present on an unhealthy Machine? Are we expecting other controllers to remove it? Does this really matter right now/can we change this later if we find the model needs changing?

JoelSpeed · 2020-04-17T16:37:39Z

docs/proposals/20191030-machine-health-checking.md

@@ -108,6 +109,11 @@ As an operator of a Management Cluster, I want my machines to be self-healing an

 ### Implementation Details/Notes/Constraints

+#### Machine annotation:
+```go
+const MachineUnhealthyAnnotation = "machine.cluster.x-k8s.io/unhealthy"


I'd say that until we have a use case for it being a timestamp, let's not add that extra complexity

vincepri · 2020-04-17T18:37:19Z

Do we have noted down anywhere, how remediation will happen for any of the controllers? Should this be noted down somewhere?

Let's take a little bit more time to think about the annotation naming and key-value definition. If the default strategy is "controller owner needs to delete", we should communicate it.

What does the annotation mean, are we expecting it to always be present on an unhealthy Machine? Are we expecting other controllers to remove it? Does this really matter right now/can we change this later if we find the model needs changing?

Related to the above, the expectations for this iteration should be that the Machine is deleted after we set the annotation, it doesn't need to be removed.

What do you all think?

enxebre · 2020-04-20T10:18:56Z

The way it looks atm this update would effectively enable any "remediation" controller to widely do anything with no clear expectations or contract with the core MHC. It would leave a lot of unresolved ambiguity that might result in consumers confusion and antipatterns, e.g what happens if the remediatior chooses not to delete the machine but to do something else, what happen if two remediator race against the annotation, etc, etc... We should not enable this programatically until there's a clear particular proposal and plan for it.

For the scope of this PR I'd suggest we keep it to solve the particular problem you are trying to overcome here, i.e remediate machines owned by the KCP.

To that end I'd suggest we programatically enforce this in the MHC by checking when a machine is a controlPlane one and only set the annotation in that case.
I'd suggest we update the remediation section https://github.com/kubernetes-sigs/cluster-api/blob/62b7a1d777fd19a5af1884960eb59378844f90d6/docs/proposals/20191030-machine-health-checking.md#remediation to remain as it is plus including a subsection for control plane machines.

Then if we want to generalise the annotation mechanism I think that deserves a particular fleshed out proposal covering all the scenarios mentioned above with clear expectations and contracts between components.

fmuyassarov · 2020-04-20T13:10:49Z

As discussed in the last community meeting and #2846 we are working on a proposal to allow using provider custom remediation strategy that would be essential where deleting Machine as remediation operation isn't sufficient. We should be submitting the proposal in a couple of days.

vincepri · 2020-04-20T20:57:38Z

@enxebre @JoelSpeed

The way it looks atm this update would effectively enable any "remediation" controller to widely do anything with no clear expectations or contract with the core MHC. It would leave a lot of unresolved ambiguity that might result in consumers confusion and antipatterns, e.g what happens if the remediatior chooses not to delete the machine but to do something else, what happen if two remediator race against the annotation, etc, etc... We should not enable this programatically until there's a clear particular proposal and plan for it.

We need to call out that the expectation is that with this strategy the owner deletes the Machine.

For the scope of this PR I'd suggest we keep it to solve the particular problem you are trying to overcome here, i.e remediate machines owned by the KCP.

During our discussion we agreed to rework the proposal in a way that touches any Machine with a controller owner, not just KCP-owned Machines.

/cc @ncdc

vincepri · 2020-04-20T21:08:23Z

@fmuyassarov Thanks for the reminder, we won't be tackling external machine remediation in this amendment to the MHC proposal.

benmoss · 2020-04-20T22:09:51Z

@enxebre Is it sufficiently explicit now that no controller that works with MHC is to do anything but handle deletion now? We can generalize it and relax the contract once we have that subsequent proposal, but for now can't it be understood that the two implementing controllers are MachineSet and KubeadmControlPlane?

JoelSpeed

Added a few comments that might improve the clarity of the changes but otherwise happy with the changes described

docs/proposals/20191030-machine-health-checking.md

enxebre · 2020-04-21T15:30:18Z

We can generalize it and relax the contract once we have that subsequent proposal, but for now can't it be understood that the two implementing controllers are MachineSet and KubeadmControlPlane

Yeh that's exactly my point, it does not need to be understood, It's enforced by code already. You can just enable annotations for KCP now and generalise it afterwards based on the upcoming proposal.

Anyway don't take my feedback as a blocker, I'm fine whatever path you choose to go :)

vincepri · 2020-04-21T15:49:06Z

Thanks @enxebre, I think it's great feedback and I'll make sure to broadcast it when we'll get there in the external remediation proposal.

The main reason I want to avoid to build something just for KCP is that it sets a precedent to have some controller specific behavior tied up to another. In general, Cluster API should strive (whenever possible) to be agnostic and provide generic interfaces.

michaelgugino · 2020-04-21T19:33:07Z

docs/proposals/20191030-machine-health-checking.md

- The owning controller e.g machineSet controller reconciles to meet number of replicas and start the process to bring up a new machine/node.
+- The owning controller observes the annotation and is responsible to remediate the machine.
+- The owning controller performs any pre-deletion tasks required.
+- The owning controller MUST delete the machine.


Why must the machine be deleted?

For the purpose of this effort, which is mainly to separate the responsibilities between controllers, we want to make sure the expected outcomes are spelled out.

In the future, when new proposals come out and have new strategies, we'll redesign this.

JoelSpeed

I'm happy that the changes are sufficiently descriptive now so "owning controllers" know what they should be doing

/lgtm

benmoss · 2020-04-22T13:16:39Z

/test pull-cluster-api-capd-e2e

fmuyassarov

just a small tiny nit

docs/proposals/20191030-machine-health-checking.md

vincepri

/approve

k8s-ci-robot · 2020-04-27T13:43:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benmoss, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vincepri · 2020-04-27T13:43:22Z

@JoelSpeed @enxebre
over to you for final lgtm

JoelSpeed · 2020-04-27T13:46:07Z

/hold

Should we squash this down before merging? If not cancel hold, otherwise I'm happy with the content

Apply suggestions from code review Co-Authored-By: Vince Prignano <vince@vincepri.com> More edits Add actual implementation history Rewrite latest alternative to represent that it was actually implemented Apply suggestions from code review Co-Authored-By: Jason DeTiberus <detiberusj@vmware.com> Make deletion explicit Apply suggestions from code review Co-Authored-By: Joel Speed <Joel.speed@hotmail.co.uk> Reorder downstream controller actions Add missing word

enxebre · 2020-04-28T11:29:18Z

I hope this unblocks KPC and enables exploration but I really hope eventually machineSet does not need to know anything about a MHC annotation. Happy leaving to @JoelSpeed to unhold as appropriate

JoelSpeed · 2020-04-28T14:44:31Z

I really hope eventually machineSet does not need to know anything about a MHC annotation

I echo this sentiment

/hold cancel
/lgtm

vincepri · 2020-04-28T15:02:09Z

@enxebre @JoelSpeed The unhealthy annotation is a placeholder until we have a formalized condition workflow coming in v1alpha3 (as informational only) and with extended use in v1alpha4.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 15, 2020

k8s-ci-robot requested review from CecileRobertMichon and detiber April 15, 2020 21:32

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/proposal Issues or PRs related to proposals. labels Apr 15, 2020

benmoss force-pushed the mhc-annotations branch from 4d049ca to 552bb56 Compare April 15, 2020 21:34

vincepri reviewed Apr 16, 2020

View reviewed changes

k8s-ci-robot assigned enxebre and JoelSpeed Apr 16, 2020

k8s-ci-robot added this to the v0.3.4 milestone Apr 16, 2020

detiber reviewed Apr 16, 2020

View reviewed changes

docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved

fabriziopandini approved these changes Apr 17, 2020

View reviewed changes

JoelSpeed reviewed Apr 17, 2020

View reviewed changes

JoelSpeed reviewed Apr 21, 2020

View reviewed changes

docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved

docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved

docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved

michaelgugino reviewed Apr 21, 2020

View reviewed changes

JoelSpeed reviewed Apr 22, 2020

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2020

fmuyassarov reviewed Apr 26, 2020

View reviewed changes

docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2020

vincepri approved these changes Apr 27, 2020

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2020

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 27, 2020

benmoss force-pushed the mhc-annotations branch from 9480cef to 9e890d7 Compare April 27, 2020 14:39

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 28, 2020

k8s-ci-robot merged commit 5a2b54e into kubernetes-sigs:master Apr 28, 2020

benmoss mentioned this pull request Apr 28, 2020

🏃 Move MHC remediations to MachineSet controller #2975

Closed

benmoss mentioned this pull request May 13, 2020

✨ Update MHC proposal to use conditions instead of annotations #3056

Merged

✨ Update MHC proposal with new annotation strategy #2920

✨ Update MHC proposal with new annotation strategy #2920

Conversation

benmoss commented Apr 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beekhof Apr 26, 2020 • edited Loading

Choose a reason for hiding this comment

vincepri commented Apr 16, 2020

vincepri commented Apr 16, 2020

Choose a reason for hiding this comment

fabriziopandini left a comment

Choose a reason for hiding this comment

JoelSpeed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincepri commented Apr 17, 2020

enxebre commented Apr 20, 2020 • edited Loading

fmuyassarov commented Apr 20, 2020

vincepri commented Apr 20, 2020 • edited Loading

vincepri commented Apr 20, 2020

benmoss commented Apr 20, 2020

JoelSpeed left a comment

Choose a reason for hiding this comment

enxebre commented Apr 21, 2020 • edited Loading

vincepri commented Apr 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoelSpeed left a comment

Choose a reason for hiding this comment

benmoss commented Apr 22, 2020

fmuyassarov left a comment

Choose a reason for hiding this comment

vincepri left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 27, 2020

vincepri commented Apr 27, 2020

JoelSpeed commented Apr 27, 2020

enxebre commented Apr 28, 2020 • edited Loading

JoelSpeed commented Apr 28, 2020

vincepri commented Apr 28, 2020

benmoss commented Apr 15, 2020 •

edited

Loading

beekhof Apr 26, 2020 •

edited

Loading

enxebre commented Apr 20, 2020 •

edited

Loading

vincepri commented Apr 20, 2020 •

edited

Loading

enxebre commented Apr 21, 2020 •

edited

Loading

enxebre commented Apr 28, 2020 •

edited

Loading