-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Update MHC proposal to use conditions instead of annotations #3056
Conversation
@n1r1 @enxebre @JoelSpeed Lots of conversations happening above, I'll try to summarize here what we're trying to achieve and why. Goals
Observations
|
This annotation is applied by the MHC controller and it is expected that it | ||
will never be removed. The only current mediation strategy is deletion, which | ||
is to be handled by the machine's owning controller. | ||
Both of these conditions are applied by the MHC controller. ConditionHealthCheckSucceeded should only be updated by the MHC after running the health check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it would make sense to invert the logic here and have the condition managed by the MHC controller represent failed status, that way one could interpret the absence of the condition as being false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it this is related to your comment about relaxing the need for truthy conditions always being "good"? So you mean ConditionHealthCheckFailed
would be true here, and the absence can be false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I'm just thinking of the fact that we aren't really able to verify that a Machine is actually Healthy, we are looking to trigger behavior on the case that the health check fails rather than succeeds, and we aren't really trying to block any additional behavior/processes until the health check passes.
will never be removed. The only current mediation strategy is deletion, which | ||
is to be handled by the machine's owning controller. | ||
Both of these conditions are applied by the MHC controller. ConditionHealthCheckSucceeded should only be updated by the MHC after running the health check. | ||
If a health check passes after it has failed, it should remove the ConditionOwnerRemediationSucceeded condition and set ConditionHealthCheckSucceeded to True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it looks like we are going to have multiple controllers modifying conditions (a slice), including operations that remove an entry from the slice, do we have plans on how to perform these operations safely, especially since removal will require an index?
I'm wondering if some type of locking will need to be done around potentially dangerous operations or if we should rely on Update semantics vs Patch for these types of changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, kubernetes/kubernetes#7856 (comment) brings this up. This seems like a broader concern around conditions that maybe should be raised in that CAEP.
If a health check passes after it has failed, it should remove the ConditionOwnerRemediationSucceeded condition and set ConditionHealthCheckSucceeded to True. | ||
ConditionOwnerRemediationSucceeded is set to False after a health check fails, but should be changed to True by the owning controller after remediation succeeds. | ||
If remediation fails ConditionOwnerRemediationSucceeded can be updated to a higher severity and the reason can be updated to aid in troubleshooting. | ||
In the event that ConditionOwnerRemediationSucceeded is still False when the next health check is run, the MHC controller will not update this condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we are overloading the use of ConditionOwnerRemediationSucceeded here to mean both that we intend for the resource owner to remediate and the success/failure of that remediation. Would it make sense to break this out into two separate conditions to make the distinction between the two intentions/uses clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this one is stretching my brain a little. When we increase the conditions we increase the number of dimensions of weird state combinations we can get.
h = healthcheck succeeded True
s = remediation succeeded True
H = healthcheck succeeded False
S = remediation succeeded False
with 2 conditions it's 2^3:
- h = good, passing healthcheck
- H = undefined, failing healthcheck but no remediation
- s = undefined
- S = undefined
- Hs = good, "happy path" of a failing healthcheck that has been remediated but not yet deleted or returned to passing
- HS = good, this is what we expect the controller to see before it remediates
- hs = good, this will be temporarily the state when the deletion of the machine is pending or if/when we support non-deletion remediation
- hS = undefined, if remediation fails but health check is needed it's not clear how we got here. Maybe subsequent healthchecks passed for reasons, or the healthcheck was changed? Same concerns as raised in your comment here.
with 3 conditions it's 3^3 and i can't be bothered 😄
@benmoss What's the status of this proposal? Is there any conversations we can resolve? Let's try to get it merge-able and implementable by EOW if possible |
Remaining questions I know of:
And we still need to wait for #3017 to merge obviously |
@benmoss Any updates for this proposal after last week's meeting? |
We got consensus around moving forward with moving forward on this and leaving the bigger problems around external remediation to that proposal. I'm not sure if @enxebre wants to discuss the problem around overlapping checks more. I think we are going to punt on partial remediation failures by never changing a failing machine back to healthy. I spoke to @detiber around the update/patch semantics problem, it seems like we're going to have to use |
/hold cancel If this looks good to people in its current form can we get some lgtms? |
Needs squash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/milestone v0.3.7
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: benmoss, vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
What this PR does / why we need it:
We realized during the review of #2975 that using annotations would amount to a privilege escalation for users who had
machine.edit
access but notmachine.delete
.So instead we're proposing that we use conditions (#3017) because that has the benefit of being a part of the status, which requires a higher level of privilege that is not commonly held by users.
That CAEP is still pending so this is effectively blocked on that proposal being merged first.
/hold
xref #2836, #2920