📖 External Remediation Proposal #3190

n1r1 · 2020-06-15T16:36:48Z

What this PR does / why we need it:
The Cluster API includes an optional Machine Healthcheck Controller component that implements automated health checking capability, however it doesn’t offer any other remediation than replacing the underlying infrastructures.

Environments consisting of hardware based clusters are significantly slower to (re)provision unhealthy machines, so they have a need for a remediation flow that includes at least one attempt at power-cycling unhealthy nodes.

Other environments and vendors also have specific remediation requirements, such as KCP, so there is a need to provide a generic mechanism for implementing custom remediation logic.

Which issue(s) this PR fixes
Fixes #2846

k8s-ci-robot · 2020-06-15T16:36:51Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: login-issues@jira.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2020-06-15T16:36:56Z

Welcome @n1r1!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2020-06-15T16:36:56Z

Hi @n1r1. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

n1r1 · 2020-06-15T16:44:17Z

/check-cla

docs/proposals/20200615-external-remediation.md

neolit123 · 2020-06-15T16:48:12Z

/kind feature
/ok-to-test

ncdc · 2020-06-16T18:00:51Z

Hi @benmoss, we have spoken at length with the authors of this proposal, and my recollection was that everyone in those sessions agreed to the following plans:

Modify MachineHealthCheck to use conditions instead of directly remediating; update owning controllers (MachineSet, KubeadmControlPlane) to implement remediation based on the presence of those conditions
Add a proposal that adds optional external remediation

This proposal represents the 2nd item. I don't believe the proposal is specific to rebooting, although it does mention it a couple of times.

We do want to offer flexibility to our users for various aspects of the system, and this is one area where we can do so.

benmoss

A bunch of nits but overall looks great, happy we're moving forward on this

docs/proposals/20200615-external-remediation.md

bboreham · 2020-06-17T17:13:06Z

Could make use of the 'Node Maintenance' proposal?

n1r1 · 2020-06-21T10:21:26Z

@bboreham, once the Node Maintenance Lease is implemented we can use it in the following cases:

If lease exists on the node, it should be excluded from MHC until lease expires
If lease exists on the node, remediation should kept on hold until lease expires
Remediation process (ERC) should obtain a lease before taking any remediation actions and release that lease after its done

Any other uses you had in mind?
In any case, I think we should leave this out for now, as the proposal is valid without that lease, which is currently not implemented anyway.

fabriziopandini

@n1r1 thanks for working on this proposal, really appreciated!
I like the project embracing new use cases, and with this proposal and the lifecycle hooks proposal, I'm pretty sure we are offering to CAPI users a set of new powerful extension points.

Overall LGTM to me, with a small nit/suggestion about the field name that can be addressed/rediscussed also later in the implementation phase (not blocking now).

fabriziopandini · 2020-06-22T08:48:46Z

docs/proposals/20200615-external-remediation.md

+        ...
+
+        // +optional
+        RemediationTemplate ObjectReference `json:"remediationTemplate,omitempty"`


nit
go doc missing (It will help to clarify the intent of this doc)
Also, what about adding a prefix like "external" or "custom" in order to make it more explicit that this field is optional and to suggest that if the value is empty a default behavior applies?

"ExternalRemediationTemplate" could make sense.

This will need to be a pointer

This should also likely be a custom type scoped to the exact type of information needed, rather than corev1.ObjectReference.

Thinking aloud, I suspect the following information would be needed:
apigroup
kind
name

We should also likely avoid the use of storing the version for the referenced type, otherwise we potentially need to worry about migration strategies for these loosely coupled types.

However if we don't encode the version, then instantiating the template would require a way to lookup (through discovery or some other means) the correct version of the resource to create.

"ExternalRemediationTemplate" could make sense.

👍

This will need to be a pointer

👍

I suspect the following information would be needed:
apigroup
kind
name

isn't it what we have in MHC CR under externalRemediationTemplate?

jan-est · 2020-06-30T05:41:53Z

docs/proposals/20200615-external-remediation.md

+
+If a value for remediationTemplate is supplied and the Machine enters an unhealthy state, the template will be instantiated using existing CAPI functionality, with the same name and namespace as the target Machine, and the remediation flow passed to an External Remediation Controller (ERC) watching for that CR.  
+
+No further action (deletion or applying conditions) will be taken by the MachineHealthCheck controller until the Node becomes healthy, when it will locate and delete the instantiated MachineRemediation CR.


we are currently working with Remediation Controller proposal in CAPM3. This is something we would like to see in place when external remediation proposal is approved. Is anyone against it that MHC will delete the MachineRemediation CR when Node becomes healthy?

MHC deleting MachineRemediation 👍

I think the concept here should be elaborated on a bit more... Some questions that I have:

Is the idea that MHC continues to perform health checks, just not attempts to mark the Machine for the "normal" remediation paths?

Does MachineHealthCheck wait for the machine remediation CR to report that it's finished prior to allowing for the deletion of the machine remediation CR? If not, I suspect we need to potentially worry about race conditions.

What should be done if the machine remediation controller has performed it's actions and the MHC is still failing for the Machine? Do we have a way to signal that and fall back to the default workflow or do we require the machine remediation controller to somehow handle that?

Is the idea that MHC continues to perform health checks, just not attempts to mark the Machine for the "normal" remediation paths?

correct

Does MachineHealthCheck wait for the machine remediation CR to report that it's finished prior to allowing for the deletion of the machine remediation CR? If not, I suspect we need to potentially worry about race conditions.

MHC doesn't wait for anything from the ERC or EMR CR. Can you provide examples for race conditions that may occur? we thought that ERC can add a finalizer on the machine remediation CR, if it needs to, and this might help to avoid race conditions

jan-est · 2020-07-14T06:45:22Z

Hi, what is the situation with this proposal, could this be merged soon? And how should we proceed with the implementation after merging?

ncdc · 2020-07-14T14:36:11Z

@jan-est hi, apologies for not getting back to this in a more timely manner. A lot of us have been heads' down on other things. I hope some of us can free up in the next few days to give this a proper review. Thanks for your continued patience.

vincepri · 2020-07-14T19:00:28Z

I'll take a look early next week, apologies again for the delay haven't had a bit of time to go over it yet

detiber · 2020-07-20T19:28:11Z

docs/proposals/20200615-external-remediation.md

+
+We introduce a generic mechanism for supporting externally provided custom remediation strategies.
+
+We propose modifying the MachineHealthCheck CRD to support a remediationTemplate, an ObjectReference to a provider-specific template CRD. 


nit: we should avoid any new additions of corev1.ObjectReference to align with future goals of removing the current uses: #2318

Suggested change

We propose modifying the MachineHealthCheck CRD to support a remediationTemplate, an ObjectReference to a provider-specific template CRD.

We propose modifying the MachineHealthCheck CRD to support a remediationTemplate, a reference to a provider-specific template CRD.

Hmm, I believe that this relies on generateTemplate(..) which receives an ObjectReference.
Can we use generateTemplate() without having an ObjectReference?

We could likely refactor the core logic of generateTemplate into a separate function that takes individual parameters (or an input struct) rather than an ObjectReference, calling that function from both net new code and still providing generateTemplate() for backward compatibility until we've removed existing use of ObjectReference.

detiber · 2020-07-20T19:32:18Z

docs/proposals/20200615-external-remediation.md

+
+If no value for remediationTemplate is defined for the MachineHealthCheck CR, the existing condition-based deletion flow is preserved.
+
+If a value for remediationTemplate is supplied and the Machine enters an unhealthy state, the template will be instantiated using existing CAPI functionality, with the same name and namespace as the target Machine, and the remediation flow passed to an External Remediation Controller (ERC) watching for that CR.  


I'm wondering if the name for the instantiated remediation resource should be generated to avoid potential issues if there exists a previous resource that hadn't been previously cleaned up properly

Or maybe the potential for a name collision is actually a good thing and an indication that we shouldn't be creating a new resource, since there is likely a remediation operation already underway...

As I see it, if it exists, it means that the Machine is unhealthy.
If the machine is healthy, it's MHC responsibility to delete that CR.

Or maybe the potential for a name collision is actually a good thing and an indication that we shouldn't be creating a new resource, since there is likely a remediation operation already underway...

exactly

detiber · 2020-07-20T19:37:52Z

docs/proposals/20200615-external-remediation.md

+
+If a value for remediationTemplate is supplied and the Machine enters an unhealthy state, the template will be instantiated using existing CAPI functionality, with the same name and namespace as the target Machine, and the remediation flow passed to an External Remediation Controller (ERC) watching for that CR.  
+
+No further action (deletion or applying conditions) will be taken by the MachineHealthCheck controller until the Node becomes healthy, when it will locate and delete the instantiated MachineRemediation CR.


I think the concept here should be elaborated on a bit more... Some questions that I have:

Is the idea that MHC continues to perform health checks, just not attempts to mark the Machine for the "normal" remediation paths?

Does MachineHealthCheck wait for the machine remediation CR to report that it's finished prior to allowing for the deletion of the machine remediation CR? If not, I suspect we need to potentially worry about race conditions.

What should be done if the machine remediation controller has performed it's actions and the MHC is still failing for the Machine? Do we have a way to signal that and fall back to the default workflow or do we require the machine remediation controller to somehow handle that?

detiber · 2020-07-20T19:39:26Z

docs/proposals/20200615-external-remediation.md

+#### Story 1
+As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can recover from transient errors faster and begin application recovery sooner.
+#### Story 2
+As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect hardware issues faster.


Can you elaborate on this user story a bit? I'm not quite sure how power cycling results in being able to detect hardware issues faster.

machine can go unhealthy due to software or hardware problems.
if you automatically power-cycle the host, it saves some time for an admin that would do exactly that to see if it was some temporary failure or something consistent.

but even if it's consistent, it still can be a software issue, so maybe this needs some clarification.
@beekhof - is this was the intention here?

discussed this with beekhof.
the intention was what I wrote in previous comment - to eliminate transient issues, either caused by software or hardware,
I'll rephrase to reflect that.
thanks

detiber · 2020-07-20T19:44:30Z

docs/proposals/20200615-external-remediation.md

+#### Story 2
+As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect hardware issues faster.
+#### Story 3
+As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes, so that they are automatically added back to the cluster when I fix the underlying problem.


Can you provide an example of when this would be intended behavior? I worry a bit trying to solve this for the generic case could potentially result in less efficient remediation of at least a subset of problems than if the default remediation processes where followed.

For example, if the underlying server that is being used has a hardware fault preventing bootstrapping from being completed, why would we want to continually restart until a technician can fix the server rather than trying to provision on a different server?

Can you provide an example of when this would be intended behavior?

routing issues for example that are external to the server itself but prevents it to reach the cluster network

why would we want to continually restart until a technician can fix the server rather than trying to provision on a different server?

That's a good point, and I think that in one of the discussion it was suggested to have some kind of max reboot attempts, but I think this is up to the external remediation controller to decide

docs/proposals/20200615-external-remediation.md

detiber · 2020-07-20T20:00:20Z

docs/proposals/20200615-external-remediation.md

+        ...
+
+        // +optional
+        RemediationTemplate ObjectReference `json:"remediationTemplate,omitempty"`


This should also likely be a custom type scoped to the exact type of information needed, rather than corev1.ObjectReference.

Thinking aloud, I suspect the following information would be needed:
apigroup
kind
name

We should also likely avoid the use of storing the version for the referenced type, otherwise we potentially need to worry about migration strategies for these loosely coupled types.

However if we don't encode the version, then instantiating the template would require a way to lookup (through discovery or some other means) the correct version of the resource to create.

n1r1 · 2020-07-30T12:59:12Z

@ncdc

We probably should add some details around how to the controllers that currently operate on the OwnerRemediated condition may need to change to take both owner and external remediation. For example, the MachineSet controller checks to see if OwnerRemediated == false, and if so, it knows to perform owner remediation.

Is there any documentation on the other controllers that watch that condition? I'll be happy to read and to relate to this in the proposal

Also, I'm wondering if instead of creating a separate file for this, we should update the existing doc (https://github.com/kubernetes-sigs/cluster-api/blob/49f88d511f9584cb37bd38522a869ceaea9f2242/docs/proposals/20191030-machine-health-checking.md)?

I can add a link from the existing MHC doc to this one, or merge them together. whatever works for you

ncdc · 2020-07-30T13:58:05Z

@n1r1 take a look at the current state of the MHC proposal - https://github.com/kubernetes-sigs/cluster-api/blob/95fe9e2c2c48cb7c765e40fe97861f22765441ff/docs/proposals/20191030-machine-health-checking.md - it talks about OwnerRemediated.

My preference would be for you to update the existing MHC doc instead of creating a new one. WDYT @CecileRobertMichon @vincepri @benmoss @detiber @JoelSpeed?

detiber · 2020-07-30T14:01:20Z

My preference would be for you to update the existing MHC doc instead of creating a new one.

+1, I think it would be good to update the MHC doc rather than having content related to remediation spread across multiple separate design docs.

vincepri · 2020-07-30T14:19:12Z

Sounds good to me as well

JoelSpeed · 2020-07-30T14:29:14Z

Agreed, let's merge them together

vincepri · 2020-08-20T15:24:33Z

Doing a final review today, from a quick scan looks pretty straighforward

vincepri · 2020-08-20T15:24:46Z

/milestone v0.3.9

n1r1 · 2020-08-24T07:08:28Z

hey @vincepri, did you have a chance to review it eventually?

thanks

ncdc · 2020-08-24T15:10:11Z

@n1r1 sorry for the delay!

/lgtm

vincepri · 2020-08-24T15:12:17Z

Squash commits? We should be ready to merge today

…iation for unhealthy nodes backed by machines. Signed-off-by: Nir <niry@redhat.com>

n1r1 · 2020-08-24T16:31:55Z

thanks @vincepri and @ncdc

squashed.

ncdc · 2020-08-24T17:31:43Z

/lgtm

vincepri · 2020-08-24T17:48:05Z

/approve

k8s-ci-robot · 2020-08-24T17:48:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vincepri · 2020-08-24T19:54:05Z

/test pull-cluster-api-e2e

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jun 15, 2020

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 15, 2020

k8s-ci-robot requested review from benmoss and vincepri June 15, 2020 16:36

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 15, 2020

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 15, 2020

n1r1 changed the title ~~📖 External Remediation Proposal~~ 📖 External Remediation Proposal Jun 15, 2020

neolit123 reviewed Jun 15, 2020

View reviewed changes

docs/proposals/20200615-external-remediation.md Outdated Show resolved Hide resolved

neolit123 reviewed Jun 15, 2020

View reviewed changes

docs/proposals/20200615-external-remediation.md Outdated Show resolved Hide resolved

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. kind/feature Categorizes issue or PR as related to a new feature. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 15, 2020

benmoss reviewed Jun 17, 2020

View reviewed changes

beekhof mentioned this pull request Jun 17, 2020

add non-graceful node shutdown KEP kubernetes/enhancements#1116

Merged

fabriziopandini reviewed Jun 22, 2020

View reviewed changes

jan-est reviewed Jun 30, 2020

View reviewed changes

jan-est mentioned this pull request Jul 7, 2020

Introducing the new CAPM3 Remediation Controller. metal3-io/metal3-docs#118

Merged

gianarb mentioned this pull request Jul 8, 2020

Reassign ElasticIP when control plane is not reachable kubernetes-sigs/cluster-api-provider-packet#141

Closed

detiber reviewed Jul 20, 2020

View reviewed changes

n1r1 requested a review from ncdc August 6, 2020 10:10

k8s-ci-robot added this to the v0.3.9 milestone Aug 20, 2020

ncdc approved these changes Aug 24, 2020

View reviewed changes

k8s-ci-robot assigned ncdc Aug 24, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 24, 2020

Add external remediation proposal which enables opt in external remed…

11485f4

…iation for unhealthy nodes backed by machines. Signed-off-by: Nir <niry@redhat.com>

n1r1 force-pushed the ext-remediation-proposal branch from b49a143 to 11485f4 Compare August 24, 2020 16:30

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 24, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 24, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 24, 2020

k8s-ci-robot merged commit 62cf509 into kubernetes-sigs:master Aug 24, 2020

jan-est mentioned this pull request Sep 8, 2020

✨ MachineHealthCheck now supports external remediation templates #3606

Merged

jan-est mentioned this pull request Oct 22, 2020

Forward-port MHC external remediation to main branch #3848

Closed

arghya88 mentioned this pull request Oct 29, 2020

✨ MHC external remediation #3882

Merged


		If a value for remediationTemplate is supplied and the Machine enters an unhealthy state, the template will be instantiated using existing CAPI functionality, with the same name and namespace as the target Machine, and the remediation flow passed to an External Remediation Controller (ERC) watching for that CR.

		No further action (deletion or applying conditions) will be taken by the MachineHealthCheck controller until the Node becomes healthy, when it will locate and delete the instantiated MachineRemediation CR.


		We introduce a generic mechanism for supporting externally provided custom remediation strategies.

		We propose modifying the MachineHealthCheck CRD to support a remediationTemplate, an ObjectReference to a provider-specific template CRD.


		If no value for remediationTemplate is defined for the MachineHealthCheck CR, the existing condition-based deletion flow is preserved.

		If a value for remediationTemplate is supplied and the Machine enters an unhealthy state, the template will be instantiated using existing CAPI functionality, with the same name and namespace as the target Machine, and the remediation flow passed to an External Remediation Controller (ERC) watching for that CR.

📖 External Remediation Proposal #3190

📖 External Remediation Proposal #3190

Conversation

n1r1 commented Jun 15, 2020

k8s-ci-robot commented Jun 15, 2020

k8s-ci-robot commented Jun 15, 2020

k8s-ci-robot commented Jun 15, 2020

n1r1 commented Jun 15, 2020

neolit123 commented Jun 15, 2020

ncdc commented Jun 16, 2020

benmoss left a comment

Choose a reason for hiding this comment

bboreham commented Jun 17, 2020

n1r1 commented Jun 21, 2020

fabriziopandini left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan-est Jun 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n1r1 Jul 29, 2020 • edited Loading

Choose a reason for hiding this comment

jan-est commented Jul 14, 2020

ncdc commented Jul 14, 2020

vincepri commented Jul 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n1r1 commented Jul 30, 2020

ncdc commented Jul 30, 2020

detiber commented Jul 30, 2020

vincepri commented Jul 30, 2020

JoelSpeed commented Jul 30, 2020

vincepri commented Aug 20, 2020

vincepri commented Aug 20, 2020

n1r1 commented Aug 24, 2020

ncdc commented Aug 24, 2020

vincepri commented Aug 24, 2020

n1r1 commented Aug 24, 2020

ncdc commented Aug 24, 2020

vincepri commented Aug 24, 2020

k8s-ci-robot commented Aug 24, 2020

vincepri commented Aug 24, 2020

fabriziopandini left a comment •

edited

Loading

jan-est Jun 30, 2020 •

edited

Loading

n1r1 Jul 29, 2020 •

edited

Loading