-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📖 External Remediation Proposal #3190
📖 External Remediation Proposal #3190
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Welcome @n1r1! |
Hi @n1r1. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/check-cla |
/kind feature |
Hi @benmoss, we have spoken at length with the authors of this proposal, and my recollection was that everyone in those sessions agreed to the following plans:
This proposal represents the 2nd item. I don't believe the proposal is specific to rebooting, although it does mention it a couple of times. We do want to offer flexibility to our users for various aspects of the system, and this is one area where we can do so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of nits but overall looks great, happy we're moving forward on this
Could make use of the 'Node Maintenance' proposal? |
@bboreham, once the Node Maintenance Lease is implemented we can use it in the following cases:
Any other uses you had in mind? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n1r1 thanks for working on this proposal, really appreciated!
I like the project embracing new use cases, and with this proposal and the lifecycle hooks proposal, I'm pretty sure we are offering to CAPI users a set of new powerful extension points.
Overall LGTM to me, with a small nit/suggestion about the field name that can be addressed/rediscussed also later in the implementation phase (not blocking now).
... | ||
|
||
// +optional | ||
RemediationTemplate ObjectReference `json:"remediationTemplate,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
go doc missing (It will help to clarify the intent of this doc)
Also, what about adding a prefix like "external" or "custom" in order to make it more explicit that this field is optional and to suggest that if the value is empty a default behavior applies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ExternalRemediationTemplate" could make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be a pointer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also likely be a custom type scoped to the exact type of information needed, rather than corev1.ObjectReference.
Thinking aloud, I suspect the following information would be needed:
apigroup
kind
name
We should also likely avoid the use of storing the version for the referenced type, otherwise we potentially need to worry about migration strategies for these loosely coupled types.
However if we don't encode the version, then instantiating the template would require a way to lookup (through discovery or some other means) the correct version of the resource to create.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ExternalRemediationTemplate" could make sense.
👍
This will need to be a pointer
👍
I suspect the following information would be needed:
apigroup
kind
name
isn't it what we have in MHC CR under externalRemediationTemplate
?
|
||
If a value for remediationTemplate is supplied and the Machine enters an unhealthy state, the template will be instantiated using existing CAPI functionality, with the same name and namespace as the target Machine, and the remediation flow passed to an External Remediation Controller (ERC) watching for that CR. | ||
|
||
No further action (deletion or applying conditions) will be taken by the MachineHealthCheck controller until the Node becomes healthy, when it will locate and delete the instantiated MachineRemediation CR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are currently working with Remediation Controller proposal in CAPM3. This is something we would like to see in place when external remediation proposal is approved. Is anyone against it that MHC will delete the MachineRemediation CR when Node becomes healthy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MHC deleting MachineRemediation 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the concept here should be elaborated on a bit more... Some questions that I have:
Is the idea that MHC continues to perform health checks, just not attempts to mark the Machine for the "normal" remediation paths?
Does MachineHealthCheck wait for the machine remediation CR to report that it's finished prior to allowing for the deletion of the machine remediation CR? If not, I suspect we need to potentially worry about race conditions.
What should be done if the machine remediation controller has performed it's actions and the MHC is still failing for the Machine? Do we have a way to signal that and fall back to the default workflow or do we require the machine remediation controller to somehow handle that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea that MHC continues to perform health checks, just not attempts to mark the Machine for the "normal" remediation paths?
correct
Does MachineHealthCheck wait for the machine remediation CR to report that it's finished prior to allowing for the deletion of the machine remediation CR? If not, I suspect we need to potentially worry about race conditions.
MHC doesn't wait for anything from the ERC or EMR CR. Can you provide examples for race conditions that may occur? we thought that ERC can add a finalizer on the machine remediation CR, if it needs to, and this might help to avoid race conditions
Hi, what is the situation with this proposal, could this be merged soon? And how should we proceed with the implementation after merging? |
@jan-est hi, apologies for not getting back to this in a more timely manner. A lot of us have been heads' down on other things. I hope some of us can free up in the next few days to give this a proper review. Thanks for your continued patience. |
I'll take a look early next week, apologies again for the delay haven't had a bit of time to go over it yet |
|
||
We introduce a generic mechanism for supporting externally provided custom remediation strategies. | ||
|
||
We propose modifying the MachineHealthCheck CRD to support a remediationTemplate, an ObjectReference to a provider-specific template CRD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should avoid any new additions of corev1.ObjectReference to align with future goals of removing the current uses: #2318
We propose modifying the MachineHealthCheck CRD to support a remediationTemplate, an ObjectReference to a provider-specific template CRD. | |
We propose modifying the MachineHealthCheck CRD to support a remediationTemplate, a reference to a provider-specific template CRD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I believe that this relies on generateTemplate(..)
which receives an ObjectReference
.
Can we use generateTemplate()
without having an ObjectReference
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could likely refactor the core logic of generateTemplate into a separate function that takes individual parameters (or an input struct) rather than an ObjectReference, calling that function from both net new code and still providing generateTemplate() for backward compatibility until we've removed existing use of ObjectReference.
|
||
If no value for remediationTemplate is defined for the MachineHealthCheck CR, the existing condition-based deletion flow is preserved. | ||
|
||
If a value for remediationTemplate is supplied and the Machine enters an unhealthy state, the template will be instantiated using existing CAPI functionality, with the same name and namespace as the target Machine, and the remediation flow passed to an External Remediation Controller (ERC) watching for that CR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if the name for the instantiated remediation resource should be generated to avoid potential issues if there exists a previous resource that hadn't been previously cleaned up properly
Or maybe the potential for a name collision is actually a good thing and an indication that we shouldn't be creating a new resource, since there is likely a remediation operation already underway...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I see it, if it exists, it means that the Machine is unhealthy.
If the machine is healthy, it's MHC responsibility to delete that CR.
Or maybe the potential for a name collision is actually a good thing and an indication that we shouldn't be creating a new resource, since there is likely a remediation operation already underway...
exactly
|
||
If a value for remediationTemplate is supplied and the Machine enters an unhealthy state, the template will be instantiated using existing CAPI functionality, with the same name and namespace as the target Machine, and the remediation flow passed to an External Remediation Controller (ERC) watching for that CR. | ||
|
||
No further action (deletion or applying conditions) will be taken by the MachineHealthCheck controller until the Node becomes healthy, when it will locate and delete the instantiated MachineRemediation CR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the concept here should be elaborated on a bit more... Some questions that I have:
Is the idea that MHC continues to perform health checks, just not attempts to mark the Machine for the "normal" remediation paths?
Does MachineHealthCheck wait for the machine remediation CR to report that it's finished prior to allowing for the deletion of the machine remediation CR? If not, I suspect we need to potentially worry about race conditions.
What should be done if the machine remediation controller has performed it's actions and the MHC is still failing for the Machine? Do we have a way to signal that and fall back to the default workflow or do we require the machine remediation controller to somehow handle that?
#### Story 1 | ||
As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can recover from transient errors faster and begin application recovery sooner. | ||
#### Story 2 | ||
As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect hardware issues faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this user story a bit? I'm not quite sure how power cycling results in being able to detect hardware issues faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
machine can go unhealthy due to software or hardware problems.
if you automatically power-cycle the host, it saves some time for an admin that would do exactly that to see if it was some temporary failure or something consistent.
but even if it's consistent, it still can be a software issue, so maybe this needs some clarification.
@beekhof - is this was the intention here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed this with beekhof.
the intention was what I wrote in previous comment - to eliminate transient issues, either caused by software or hardware,
I'll rephrase to reflect that.
thanks
#### Story 2 | ||
As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect hardware issues faster. | ||
#### Story 3 | ||
As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes, so that they are automatically added back to the cluster when I fix the underlying problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide an example of when this would be intended behavior? I worry a bit trying to solve this for the generic case could potentially result in less efficient remediation of at least a subset of problems than if the default remediation processes where followed.
For example, if the underlying server that is being used has a hardware fault preventing bootstrapping from being completed, why would we want to continually restart until a technician can fix the server rather than trying to provision on a different server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide an example of when this would be intended behavior?
routing issues for example that are external to the server itself but prevents it to reach the cluster network
why would we want to continually restart until a technician can fix the server rather than trying to provision on a different server?
That's a good point, and I think that in one of the discussion it was suggested to have some kind of max reboot attempts, but I think this is up to the external remediation controller to decide
... | ||
|
||
// +optional | ||
RemediationTemplate ObjectReference `json:"remediationTemplate,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also likely be a custom type scoped to the exact type of information needed, rather than corev1.ObjectReference.
Thinking aloud, I suspect the following information would be needed:
apigroup
kind
name
We should also likely avoid the use of storing the version for the referenced type, otherwise we potentially need to worry about migration strategies for these loosely coupled types.
However if we don't encode the version, then instantiating the template would require a way to lookup (through discovery or some other means) the correct version of the resource to create.
Is there any documentation on the other controllers that watch that condition? I'll be happy to read and to relate to this in the proposal
I can add a link from the existing MHC doc to this one, or merge them together. whatever works for you |
@n1r1 take a look at the current state of the MHC proposal - https://github.com/kubernetes-sigs/cluster-api/blob/95fe9e2c2c48cb7c765e40fe97861f22765441ff/docs/proposals/20191030-machine-health-checking.md - it talks about OwnerRemediated. My preference would be for you to update the existing MHC doc instead of creating a new one. WDYT @CecileRobertMichon @vincepri @benmoss @detiber @JoelSpeed? |
+1, I think it would be good to update the MHC doc rather than having content related to remediation spread across multiple separate design docs. |
Sounds good to me as well |
Agreed, let's merge them together |
Doing a final review today, from a quick scan looks pretty straighforward |
/milestone v0.3.9 |
hey @vincepri, did you have a chance to review it eventually? thanks |
@n1r1 sorry for the delay! /lgtm |
Squash commits? We should be ready to merge today |
…iation for unhealthy nodes backed by machines. Signed-off-by: Nir <niry@redhat.com>
b49a143
to
11485f4
Compare
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-cluster-api-e2e |
What this PR does / why we need it:
The Cluster API includes an optional Machine Healthcheck Controller component that implements automated health checking capability, however it doesn’t offer any other remediation than replacing the underlying infrastructures.
Environments consisting of hardware based clusters are significantly slower to (re)provision unhealthy machines, so they have a need for a remediation flow that includes at least one attempt at power-cycling unhealthy nodes.
Other environments and vendors also have specific remediation requirements, such as KCP, so there is a need to provide a generic mechanism for implementing custom remediation logic.
Which issue(s) this PR fixes
Fixes #2846