✨ Update MHC proposal to use conditions instead of annotations #3056

benmoss · 2020-05-13T16:12:49Z

What this PR does / why we need it:
We realized during the review of #2975 that using annotations would amount to a privilege escalation for users who had machine.edit access but not machine.delete.

So instead we're proposing that we use conditions (#3017) because that has the benefit of being a part of the status, which requires a higher level of privilege that is not commonly held by users.

That CAEP is still pending so this is effectively blocked on that proposal being merged first.

/hold

xref #2836, #2920

docs/proposals/20191030-machine-health-checking.md

vincepri · 2020-05-15T15:33:23Z

@n1r1 @enxebre @JoelSpeed Lots of conversations happening above, I'll try to summarize here what we're trying to achieve and why.

Goals

Migrate the logic from annotations to status.conditions.
Add a HealthCheckSucceeded condition to Machine objects.
- MachineHealthCheck controller watches Machines, and if it doesn't find the Condition, writes one.
- MachineHealthCheck during subsequent checks, if one fails it transitions the condition's State to False.
  
  Note: for now we'll only do this once, in the future we can chat about allowing to being flipped back, but probably only after remediation — this discussion should happen in the external proposal)
Add a OwnerRemediated condition to Machine objects.
- MachineHealthCheck controller, when it sees that a Machine goes unhealthy, it writes also this condition with its State set to False.
- Owner-Controllers watching these Machines (i.e. KCP, or MachineSet) trigger their own built-in remediation based on OwnerRemediated being False.
- Owner-Controllers will, in this case, try to ultimately delete the Machine and set OwnerRemediated to True after setting the deletion timestamp.
  
  Note: if the patch to set the condition state after deleting the Machine fails, we'll fail gracefully given that is expected. In most cases, we'll be able to patch the condition given that deleting a Machine isn't a quick operation (except for CAPD probably).

Observations

Adding two conditions instead of a single one allows us to expand (in the future) the remediations we want to support (like external) by either not writing the second condition, or writing a different one based on some strategy (details tbd in the other proposal).
In the future, I can see users potentially asking to disable remediation completely (e.g. by setting a remediation.strategy = none) — this use case is going to be fully covered by allowing MHC to not write any remediation condition.

detiber · 2020-05-18T14:50:42Z

docs/proposals/20191030-machine-health-checking.md

-This annotation is applied by the MHC controller and it is expected that it
-will never be removed. The only current mediation strategy is deletion, which
-is to be handled by the machine's owning controller.
+Both of these conditions are applied by the MHC controller. ConditionHealthCheckSucceeded should only be updated by the MHC after running the health check.


I'm wondering if it would make sense to invert the logic here and have the condition managed by the MHC controller represent failed status, that way one could interpret the absence of the condition as being false.

I take it this is related to your comment about relaxing the need for truthy conditions always being "good"? So you mean ConditionHealthCheckFailed would be true here, and the absence can be false?

Indeed, I'm just thinking of the fact that we aren't really able to verify that a Machine is actually Healthy, we are looking to trigger behavior on the case that the health check fails rather than succeeds, and we aren't really trying to block any additional behavior/processes until the health check passes.

detiber · 2020-05-18T14:56:05Z

docs/proposals/20191030-machine-health-checking.md

-will never be removed. The only current mediation strategy is deletion, which
-is to be handled by the machine's owning controller.
+Both of these conditions are applied by the MHC controller. ConditionHealthCheckSucceeded should only be updated by the MHC after running the health check.
+If a health check passes after it has failed, it should remove the ConditionOwnerRemediationSucceeded condition and set ConditionHealthCheckSucceeded to True.


Since it looks like we are going to have multiple controllers modifying conditions (a slice), including operations that remove an entry from the slice, do we have plans on how to perform these operations safely, especially since removal will require an index?

I'm wondering if some type of locking will need to be done around potentially dangerous operations or if we should rely on Update semantics vs Patch for these types of changes.

This is a good point, kubernetes/kubernetes#7856 (comment) brings this up. This seems like a broader concern around conditions that maybe should be raised in that CAEP.

detiber · 2020-05-18T14:58:32Z

docs/proposals/20191030-machine-health-checking.md

+If a health check passes after it has failed, it should remove the ConditionOwnerRemediationSucceeded condition and set ConditionHealthCheckSucceeded to True.
+ConditionOwnerRemediationSucceeded is set to False after a health check fails, but should be changed to True by the owning controller after remediation succeeds.
+If remediation fails ConditionOwnerRemediationSucceeded can be updated to a higher severity and the reason can be updated to aid in troubleshooting.
+In the event that ConditionOwnerRemediationSucceeded is still False when the next health check is run, the MHC controller will not update this condition.


It seems like we are overloading the use of ConditionOwnerRemediationSucceeded here to mean both that we intend for the resource owner to remediate and the success/failure of that remediation. Would it make sense to break this out into two separate conditions to make the distinction between the two intentions/uses clear?

Hmm, this one is stretching my brain a little. When we increase the conditions we increase the number of dimensions of weird state combinations we can get.

h = healthcheck succeeded True
s = remediation succeeded True
H = healthcheck succeeded False
S = remediation succeeded False

with 2 conditions it's 2^3:

h = good, passing healthcheck

H = undefined, failing healthcheck but no remediation

s = undefined

S = undefined

Hs = good, "happy path" of a failing healthcheck that has been remediated but not yet deleted or returned to passing

HS = good, this is what we expect the controller to see before it remediates

hs = good, this will be temporarily the state when the deletion of the machine is pending or if/when we support non-deletion remediation

hS = undefined, if remediation fails but health check is needed it's not clear how we got here. Maybe subsequent healthchecks passed for reasons, or the healthcheck was changed? Same concerns as raised in your comment here.

with 3 conditions it's 3^3 and i can't be bothered 😄

docs/proposals/20191030-machine-health-checking.md

vincepri · 2020-05-20T16:26:59Z

@benmoss What's the status of this proposal? Is there any conversations we can resolve? Let's try to get it merge-able and implementable by EOW if possible

benmoss · 2020-05-20T17:06:16Z

Remaining questions I know of:

@JoelSpeed @n1r1 and @enxebre want to have a conversation about external remediation and how this fits in ref
@enxebre feels strongly that the fact that this proposal introduces weird behavior around overlapping MHCs is not acceptable ref
If the healthcheck passes but remediation has already been performed and failed, what should be done? ref
Should we go false-y with the conditions in order to make their absence imply everything is AOK and fix the weird semantics around ConditionOwnerRemediationSucceeded=False ref
Is the a possibility of updates to Conditions clobbering other conditions due to array update/patch semantics? ref

And we still need to wait for #3017 to merge obviously

vincepri · 2020-05-26T14:05:51Z

@benmoss Any updates for this proposal after last week's meeting?

benmoss · 2020-05-26T15:00:46Z

We got consensus around moving forward with moving forward on this and leaving the bigger problems around external remediation to that proposal. I'm not sure if @enxebre wants to discuss the problem around overlapping checks more. I think we are going to punt on partial remediation failures by never changing a failing machine back to healthy.

I spoke to @detiber around the update/patch semantics problem, it seems like we're going to have to use Update exclusively whenever we change Machine conditions, since we will have two controllers that are both updating and we don't want them to clobber each other's changes.

benmoss · 2020-05-26T16:37:55Z

/hold cancel

If this looks good to people in its current form can we get some lgtms?

vincepri · 2020-05-26T16:47:40Z

Needs squash

docs/proposals/20191030-machine-health-checking.md

vincepri

/approve
/milestone v0.3.7

k8s-ci-robot · 2020-05-26T17:05:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benmoss, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ncdc · 2020-05-27T19:15:24Z

/lgtm

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 13, 2020

k8s-ci-robot requested review from justinsb and vincepri May 13, 2020 16:13

vincepri reviewed May 13, 2020

View reviewed changes

benmoss mentioned this pull request May 13, 2020

MachineHealthChecks should not overlap #3057

Closed

benmoss force-pushed the mhc-annotations branch from b89f1ac to c020263 Compare May 13, 2020 17:48

fabriziopandini reviewed May 14, 2020

View reviewed changes

docs/proposals/20191030-machine-health-checking.md Show resolved Hide resolved

n1r1 reviewed May 14, 2020

View reviewed changes

docs/proposals/20191030-machine-health-checking.md Show resolved Hide resolved

benmoss mentioned this pull request May 14, 2020

Document that users should not create overlapping health checks #3058

Closed

enxebre reviewed May 15, 2020

View reviewed changes

docs/proposals/20191030-machine-health-checking.md Show resolved Hide resolved

enxebre reviewed May 15, 2020

View reviewed changes

docs/proposals/20191030-machine-health-checking.md Show resolved Hide resolved

benmoss force-pushed the mhc-annotations branch from 06d633b to f27ff39 Compare May 18, 2020 14:33

detiber reviewed May 18, 2020

View reviewed changes

benmoss mentioned this pull request May 18, 2020

📖 Condition CAEP #3017

Merged

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 26, 2020

vincepri reviewed May 26, 2020

View reviewed changes

docs/proposals/20191030-machine-health-checking.md Outdated Show resolved Hide resolved

Update to propose using conditions instead of annotations

d60e9ee

benmoss force-pushed the mhc-annotations branch from 427c4f2 to d60e9ee Compare May 26, 2020 17:01

vincepri approved these changes May 26, 2020

View reviewed changes

k8s-ci-robot added this to the v0.3.7 milestone May 26, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 26, 2020

k8s-ci-robot assigned ncdc May 27, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 27, 2020

k8s-ci-robot merged commit 7d12a45 into kubernetes-sigs:master May 27, 2020

This was referenced May 28, 2020

MHC shouldn't remediate the reported terminal failures #3106

Closed

⚠️ Let MHC to remediate any machine owned by a controller #2828

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Update MHC proposal to use conditions instead of annotations #3056

✨ Update MHC proposal to use conditions instead of annotations #3056

benmoss commented May 13, 2020

vincepri commented May 15, 2020 •

edited

Loading

detiber May 18, 2020

benmoss May 18, 2020 •

edited

Loading

detiber May 18, 2020

detiber May 18, 2020

benmoss May 18, 2020 •

edited

Loading

detiber May 18, 2020

benmoss May 18, 2020 •

edited

Loading

vincepri commented May 20, 2020

benmoss commented May 20, 2020 •

edited

Loading

vincepri commented May 26, 2020

benmoss commented May 26, 2020

benmoss commented May 26, 2020

vincepri commented May 26, 2020

vincepri left a comment

k8s-ci-robot commented May 26, 2020

ncdc commented May 27, 2020

✨ Update MHC proposal to use conditions instead of annotations #3056

✨ Update MHC proposal to use conditions instead of annotations #3056

Conversation

benmoss commented May 13, 2020

vincepri commented May 15, 2020 • edited Loading

Goals

Observations

detiber May 18, 2020

Choose a reason for hiding this comment

benmoss May 18, 2020 • edited Loading

Choose a reason for hiding this comment

detiber May 18, 2020

Choose a reason for hiding this comment

detiber May 18, 2020

Choose a reason for hiding this comment

benmoss May 18, 2020 • edited Loading

Choose a reason for hiding this comment

detiber May 18, 2020

Choose a reason for hiding this comment

benmoss May 18, 2020 • edited Loading

Choose a reason for hiding this comment

vincepri commented May 20, 2020

benmoss commented May 20, 2020 • edited Loading

vincepri commented May 26, 2020

benmoss commented May 26, 2020

benmoss commented May 26, 2020

vincepri commented May 26, 2020

vincepri left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 26, 2020

ncdc commented May 27, 2020

vincepri commented May 15, 2020 •

edited

Loading

benmoss May 18, 2020 •

edited

Loading

benmoss May 18, 2020 •

edited

Loading

benmoss May 18, 2020 •

edited

Loading

benmoss commented May 20, 2020 •

edited

Loading