-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Update MHC proposal with new annotation strategy #2920
Conversation
4d049ca
to
552bb56
Compare
@@ -108,6 +109,11 @@ As an operator of a Management Cluster, I want my machines to be self-healing an | |||
|
|||
### Implementation Details/Notes/Constraints | |||
|
|||
#### Machine annotation: | |||
```go | |||
const MachineUnhealthyAnnotation = "machine.cluster.x-k8s.io/unhealthy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this should be have a time as value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to expand this section to give some guarantees, for example if this is a timestamp, do we keep updating the annotation at every reconciliation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we track timestamps, we likely need two (unhealthy since, and last health check) to ensure that we are operating against the information the way we expect to. That way we could differentiate between a persistent unhealthy state and lack of reconciliation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, to support different external remediation (outside of the owning controller), such as the request from #2846, would it make sense to allow for overriding the annotation key here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having different timestamps, while interesting, also means unfortunately that the Machine and everything that watches them is going to be updated every time the check is performed :/
to support different external remediation (outside of the owning controller), such as the request from #2846, would it make sense to allow for overriding the annotation key here?
I'd expect that at some point (proposal tbd) that we'll allow different strategies for remediation.
Maybe to start, we should have a simple annotation with no timestamp on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have also an annotation that signals when the machine is not healthy but the timeout is not yet reached (machine under probation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that until we have a use case for it being a timestamp, let's not add that extra complexity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you can't trust that the timestamp is not skewed (best case) or actively wrong.
/assign @JoelSpeed @enxebre |
/milestone v0.3.4 |
@@ -108,6 +109,11 @@ As an operator of a Management Cluster, I want my machines to be self-healing an | |||
|
|||
### Implementation Details/Notes/Constraints | |||
|
|||
#### Machine annotation: | |||
```go | |||
const MachineUnhealthyAnnotation = "machine.cluster.x-k8s.io/unhealthy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we track timestamps, we likely need two (unhealthy since, and last health check) to ensure that we are operating against the information the way we expect to. That way we could differentiate between a persistent unhealthy state and lack of reconciliation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from my side.
this work might be relevant for conditions as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two thoughts cross my mind while reading this as it is currently:
- Do we have noted down anywhere, how remediation will happen for any of the controllers? Should this be noted down somewhere?
- What does the annotation mean, are we expecting it to always be present on an unhealthy Machine? Are we expecting other controllers to remove it? Does this really matter right now/can we change this later if we find the model needs changing?
@@ -108,6 +109,11 @@ As an operator of a Management Cluster, I want my machines to be self-healing an | |||
|
|||
### Implementation Details/Notes/Constraints | |||
|
|||
#### Machine annotation: | |||
```go | |||
const MachineUnhealthyAnnotation = "machine.cluster.x-k8s.io/unhealthy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that until we have a use case for it being a timestamp, let's not add that extra complexity
Let's take a little bit more time to think about the annotation naming and key-value definition. If the default strategy is "controller owner needs to delete", we should communicate it.
Related to the above, the expectations for this iteration should be that the Machine is deleted after we set the annotation, it doesn't need to be removed. What do you all think? |
The way it looks atm this update would effectively enable any "remediation" controller to widely do anything with no clear expectations or contract with the core MHC. It would leave a lot of unresolved ambiguity that might result in consumers confusion and antipatterns, e.g what happens if the remediatior chooses not to delete the machine but to do something else, what happen if two remediator race against the annotation, etc, etc... We should not enable this programatically until there's a clear particular proposal and plan for it. For the scope of this PR I'd suggest we keep it to solve the particular problem you are trying to overcome here, i.e remediate machines owned by the KCP. To that end I'd suggest we programatically enforce this in the MHC by checking when a machine is a controlPlane one and only set the annotation in that case. Then if we want to generalise the annotation mechanism I think that deserves a particular fleshed out proposal covering all the scenarios mentioned above with clear expectations and contracts between components. |
We need to call out that the expectation is that with this strategy the owner deletes the Machine.
During our discussion we agreed to rework the proposal in a way that touches any Machine with a controller owner, not just KCP-owned Machines. /cc @ncdc |
@fmuyassarov Thanks for the reminder, we won't be tackling external machine remediation in this amendment to the MHC proposal. |
@enxebre Is it sufficiently explicit now that no controller that works with MHC is to do anything but handle deletion now? We can generalize it and relax the contract once we have that subsequent proposal, but for now can't it be understood that the two implementing controllers are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments that might improve the clarity of the changes but otherwise happy with the changes described
Yeh that's exactly my point, it does not need to be understood, It's enforced by code already. You can just enable annotations for KCP now and generalise it afterwards based on the upcoming proposal. Anyway don't take my feedback as a blocker, I'm fine whatever path you choose to go :) |
Thanks @enxebre, I think it's great feedback and I'll make sure to broadcast it when we'll get there in the external remediation proposal. The main reason I want to avoid to build something just for KCP is that it sets a precedent to have some controller specific behavior tied up to another. In general, Cluster API should strive (whenever possible) to be agnostic and provide generic interfaces. |
- The owning controller e.g machineSet controller reconciles to meet number of replicas and start the process to bring up a new machine/node. | ||
- The owning controller observes the annotation and is responsible to remediate the machine. | ||
- The owning controller performs any pre-deletion tasks required. | ||
- The owning controller MUST delete the machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why must the machine be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the purpose of this effort, which is mainly to separate the responsibilities between controllers, we want to make sure the expected outcomes are spelled out.
In the future, when new proposals come out and have new strategies, we'll redesign this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy that the changes are sufficiently descriptive now so "owning controllers" know what they should be doing
/lgtm
/test pull-cluster-api-capd-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a small tiny nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: benmoss, vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@JoelSpeed @enxebre |
/hold Should we squash this down before merging? If not cancel hold, otherwise I'm happy with the content |
Apply suggestions from code review Co-Authored-By: Vince Prignano <vince@vincepri.com> More edits Add actual implementation history Rewrite latest alternative to represent that it was actually implemented Apply suggestions from code review Co-Authored-By: Jason DeTiberus <detiberusj@vmware.com> Make deletion explicit Apply suggestions from code review Co-Authored-By: Joel Speed <Joel.speed@hotmail.co.uk> Reorder downstream controller actions Add missing word
9480cef
to
9e890d7
Compare
I hope this unblocks KPC and enables exploration but I really hope eventually machineSet does not need to know anything about a MHC annotation. Happy leaving to @JoelSpeed to unhold as appropriate |
I echo this sentiment /hold cancel |
@enxebre @JoelSpeed The unhealthy annotation is a placeholder until we have a formalized condition workflow coming in v1alpha3 (as informational only) and with extended use in v1alpha4. |
What this PR does / why we need it:
Updates the MHC proposal with a new strategy so that MHC can support machines managed by KCP.
Related to #2836
@enxebre @vincepri @JoelSpeed
/kind proposal