Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Update MHC proposal to use conditions instead of annotations #3056

Merged
merged 1 commit into from
May 27, 2020

Conversation

benmoss
Copy link

@benmoss benmoss commented May 13, 2020

What this PR does / why we need it:
We realized during the review of #2975 that using annotations would amount to a privilege escalation for users who had machine.edit access but not machine.delete.

So instead we're proposing that we use conditions (#3017) because that has the benefit of being a part of the status, which requires a higher level of privilege that is not commonly held by users.

That CAEP is still pending so this is effectively blocked on that proposal being merged first.

/hold

xref #2836, #2920

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 13, 2020
@vincepri
Copy link
Member

vincepri commented May 15, 2020

@n1r1 @enxebre @JoelSpeed Lots of conversations happening above, I'll try to summarize here what we're trying to achieve and why.

Goals

  • Migrate the logic from annotations to status.conditions.
  • Add a HealthCheckSucceeded condition to Machine objects.
    • MachineHealthCheck controller watches Machines, and if it doesn't find the Condition, writes one.
    • MachineHealthCheck during subsequent checks, if one fails it transitions the condition's State to False.

      Note: for now we'll only do this once, in the future we can chat about allowing to being flipped back, but probably only after remediation — this discussion should happen in the external proposal)

  • Add a OwnerRemediated condition to Machine objects.
    • MachineHealthCheck controller, when it sees that a Machine goes unhealthy, it writes also this condition with its State set to False.
    • Owner-Controllers watching these Machines (i.e. KCP, or MachineSet) trigger their own built-in remediation based on OwnerRemediated being False.
    • Owner-Controllers will, in this case, try to ultimately delete the Machine and set OwnerRemediated to True after setting the deletion timestamp.

      Note: if the patch to set the condition state after deleting the Machine fails, we'll fail gracefully given that is expected. In most cases, we'll be able to patch the condition given that deleting a Machine isn't a quick operation (except for CAPD probably).

Observations

  • Adding two conditions instead of a single one allows us to expand (in the future) the remediations we want to support (like external) by either not writing the second condition, or writing a different one based on some strategy (details tbd in the other proposal).
  • In the future, I can see users potentially asking to disable remediation completely (e.g. by setting a remediation.strategy = none) — this use case is going to be fully covered by allowing MHC to not write any remediation condition.

This annotation is applied by the MHC controller and it is expected that it
will never be removed. The only current mediation strategy is deletion, which
is to be handled by the machine's owning controller.
Both of these conditions are applied by the MHC controller. ConditionHealthCheckSucceeded should only be updated by the MHC after running the health check.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it would make sense to invert the logic here and have the condition managed by the MHC controller represent failed status, that way one could interpret the absence of the condition as being false.

Copy link
Author

@benmoss benmoss May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it this is related to your comment about relaxing the need for truthy conditions always being "good"? So you mean ConditionHealthCheckFailed would be true here, and the absence can be false?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I'm just thinking of the fact that we aren't really able to verify that a Machine is actually Healthy, we are looking to trigger behavior on the case that the health check fails rather than succeeds, and we aren't really trying to block any additional behavior/processes until the health check passes.

will never be removed. The only current mediation strategy is deletion, which
is to be handled by the machine's owning controller.
Both of these conditions are applied by the MHC controller. ConditionHealthCheckSucceeded should only be updated by the MHC after running the health check.
If a health check passes after it has failed, it should remove the ConditionOwnerRemediationSucceeded condition and set ConditionHealthCheckSucceeded to True.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it looks like we are going to have multiple controllers modifying conditions (a slice), including operations that remove an entry from the slice, do we have plans on how to perform these operations safely, especially since removal will require an index?

I'm wondering if some type of locking will need to be done around potentially dangerous operations or if we should rely on Update semantics vs Patch for these types of changes.

Copy link
Author

@benmoss benmoss May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, kubernetes/kubernetes#7856 (comment) brings this up. This seems like a broader concern around conditions that maybe should be raised in that CAEP.

If a health check passes after it has failed, it should remove the ConditionOwnerRemediationSucceeded condition and set ConditionHealthCheckSucceeded to True.
ConditionOwnerRemediationSucceeded is set to False after a health check fails, but should be changed to True by the owning controller after remediation succeeds.
If remediation fails ConditionOwnerRemediationSucceeded can be updated to a higher severity and the reason can be updated to aid in troubleshooting.
In the event that ConditionOwnerRemediationSucceeded is still False when the next health check is run, the MHC controller will not update this condition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we are overloading the use of ConditionOwnerRemediationSucceeded here to mean both that we intend for the resource owner to remediate and the success/failure of that remediation. Would it make sense to break this out into two separate conditions to make the distinction between the two intentions/uses clear?

Copy link
Author

@benmoss benmoss May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this one is stretching my brain a little. When we increase the conditions we increase the number of dimensions of weird state combinations we can get.

h = healthcheck succeeded True
s = remediation succeeded True
H = healthcheck succeeded False
S = remediation succeeded False

with 2 conditions it's 2^3:

  • h = good, passing healthcheck
  • H = undefined, failing healthcheck but no remediation
  • s = undefined
  • S = undefined
  • Hs = good, "happy path" of a failing healthcheck that has been remediated but not yet deleted or returned to passing
  • HS = good, this is what we expect the controller to see before it remediates
  • hs = good, this will be temporarily the state when the deletion of the machine is pending or if/when we support non-deletion remediation
  • hS = undefined, if remediation fails but health check is needed it's not clear how we got here. Maybe subsequent healthchecks passed for reasons, or the healthcheck was changed? Same concerns as raised in your comment here.

with 3 conditions it's 3^3 and i can't be bothered 😄

@benmoss benmoss mentioned this pull request May 18, 2020
@vincepri
Copy link
Member

@benmoss What's the status of this proposal? Is there any conversations we can resolve? Let's try to get it merge-able and implementable by EOW if possible

@benmoss
Copy link
Author

benmoss commented May 20, 2020

Remaining questions I know of:

  • @JoelSpeed @n1r1 and @enxebre want to have a conversation about external remediation and how this fits in ref
  • @enxebre feels strongly that the fact that this proposal introduces weird behavior around overlapping MHCs is not acceptable ref
  • If the healthcheck passes but remediation has already been performed and failed, what should be done? ref
  • Should we go false-y with the conditions in order to make their absence imply everything is AOK and fix the weird semantics around ConditionOwnerRemediationSucceeded=False ref
  • Is the a possibility of updates to Conditions clobbering other conditions due to array update/patch semantics? ref

And we still need to wait for #3017 to merge obviously

@vincepri
Copy link
Member

@benmoss Any updates for this proposal after last week's meeting?

@benmoss
Copy link
Author

benmoss commented May 26, 2020

We got consensus around moving forward with moving forward on this and leaving the bigger problems around external remediation to that proposal. I'm not sure if @enxebre wants to discuss the problem around overlapping checks more. I think we are going to punt on partial remediation failures by never changing a failing machine back to healthy.

I spoke to @detiber around the update/patch semantics problem, it seems like we're going to have to use Update exclusively whenever we change Machine conditions, since we will have two controllers that are both updating and we don't want them to clobber each other's changes.

@benmoss
Copy link
Author

benmoss commented May 26, 2020

/hold cancel

If this looks good to people in its current form can we get some lgtms?

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 26, 2020
@vincepri
Copy link
Member

Needs squash

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/milestone v0.3.7

@k8s-ci-robot k8s-ci-robot added this to the v0.3.7 milestone May 26, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benmoss, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 26, 2020
@ncdc
Copy link
Contributor

ncdc commented May 27, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 27, 2020
@k8s-ci-robot k8s-ci-robot merged commit 7d12a45 into kubernetes-sigs:master May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants