Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📖 External Remediation Proposal #3190

Merged
merged 1 commit into from
Aug 24, 2020

Conversation

n1r1
Copy link
Contributor

@n1r1 n1r1 commented Jun 15, 2020

What this PR does / why we need it:
The Cluster API includes an optional Machine Healthcheck Controller component that implements automated health checking capability, however it doesn’t offer any other remediation than replacing the underlying infrastructures.

Environments consisting of hardware based clusters are significantly slower to (re)provision unhealthy machines, so they have a need for a remediation flow that includes at least one attempt at power-cycling unhealthy nodes.

Other environments and vendors also have specific remediation requirements, such as KCP, so there is a need to provide a generic mechanism for implementing custom remediation logic.

Which issue(s) this PR fixes
Fixes #2846

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jun 15, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @n1r1!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @n1r1. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 15, 2020
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 15, 2020
@n1r1
Copy link
Contributor Author

n1r1 commented Jun 15, 2020

/check-cla

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 15, 2020
@n1r1 n1r1 changed the title 📖 External Remediation Proposal 📖 External Remediation Proposal Jun 15, 2020
@neolit123
Copy link
Member

/kind feature
/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. kind/feature Categorizes issue or PR as related to a new feature. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 15, 2020
@ncdc
Copy link
Contributor

ncdc commented Jun 16, 2020

Hi @benmoss, we have spoken at length with the authors of this proposal, and my recollection was that everyone in those sessions agreed to the following plans:

  1. Modify MachineHealthCheck to use conditions instead of directly remediating; update owning controllers (MachineSet, KubeadmControlPlane) to implement remediation based on the presence of those conditions
  2. Add a proposal that adds optional external remediation

This proposal represents the 2nd item. I don't believe the proposal is specific to rebooting, although it does mention it a couple of times.

We do want to offer flexibility to our users for various aspects of the system, and this is one area where we can do so.

Copy link

@benmoss benmoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of nits but overall looks great, happy we're moving forward on this

docs/proposals/20200615-external-remediation.md Outdated Show resolved Hide resolved
docs/proposals/20200615-external-remediation.md Outdated Show resolved Hide resolved
docs/proposals/20200615-external-remediation.md Outdated Show resolved Hide resolved
docs/proposals/20200615-external-remediation.md Outdated Show resolved Hide resolved
docs/proposals/20200615-external-remediation.md Outdated Show resolved Hide resolved
@bboreham
Copy link
Contributor

Could make use of the 'Node Maintenance' proposal?

@n1r1
Copy link
Contributor Author

n1r1 commented Jun 21, 2020

@bboreham, once the Node Maintenance Lease is implemented we can use it in the following cases:

  1. If lease exists on the node, it should be excluded from MHC until lease expires
  2. If lease exists on the node, remediation should kept on hold until lease expires
  3. Remediation process (ERC) should obtain a lease before taking any remediation actions and release that lease after its done

Any other uses you had in mind?
In any case, I think we should leave this out for now, as the proposal is valid without that lease, which is currently not implemented anyway.

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n1r1 thanks for working on this proposal, really appreciated!
I like the project embracing new use cases, and with this proposal and the lifecycle hooks proposal, I'm pretty sure we are offering to CAPI users a set of new powerful extension points.

Overall LGTM to me, with a small nit/suggestion about the field name that can be addressed/rediscussed also later in the implementation phase (not blocking now).

...

// +optional
RemediationTemplate ObjectReference `json:"remediationTemplate,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit
go doc missing (It will help to clarify the intent of this doc)
Also, what about adding a prefix like "external" or "custom" in order to make it more explicit that this field is optional and to suggest that if the value is empty a default behavior applies?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ExternalRemediationTemplate" could make sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be a pointer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also likely be a custom type scoped to the exact type of information needed, rather than corev1.ObjectReference.

Thinking aloud, I suspect the following information would be needed:
apigroup
kind
name

We should also likely avoid the use of storing the version for the referenced type, otherwise we potentially need to worry about migration strategies for these loosely coupled types.

However if we don't encode the version, then instantiating the template would require a way to lookup (through discovery or some other means) the correct version of the resource to create.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ExternalRemediationTemplate" could make sense.

👍

This will need to be a pointer

👍

I suspect the following information would be needed:
apigroup
kind
name

isn't it what we have in MHC CR under externalRemediationTemplate?


If a value for remediationTemplate is supplied and the Machine enters an unhealthy state, the template will be instantiated using existing CAPI functionality, with the same name and namespace as the target Machine, and the remediation flow passed to an External Remediation Controller (ERC) watching for that CR.

No further action (deletion or applying conditions) will be taken by the MachineHealthCheck controller until the Node becomes healthy, when it will locate and delete the instantiated MachineRemediation CR.
Copy link
Contributor

@jan-est jan-est Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are currently working with Remediation Controller proposal in CAPM3. This is something we would like to see in place when external remediation proposal is approved. Is anyone against it that MHC will delete the MachineRemediation CR when Node becomes healthy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MHC deleting MachineRemediation 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concept here should be elaborated on a bit more... Some questions that I have:

Is the idea that MHC continues to perform health checks, just not attempts to mark the Machine for the "normal" remediation paths?

Does MachineHealthCheck wait for the machine remediation CR to report that it's finished prior to allowing for the deletion of the machine remediation CR? If not, I suspect we need to potentially worry about race conditions.

What should be done if the machine remediation controller has performed it's actions and the MHC is still failing for the Machine? Do we have a way to signal that and fall back to the default workflow or do we require the machine remediation controller to somehow handle that?

Copy link
Contributor Author

@n1r1 n1r1 Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that MHC continues to perform health checks, just not attempts to mark the Machine for the "normal" remediation paths?

correct

Does MachineHealthCheck wait for the machine remediation CR to report that it's finished prior to allowing for the deletion of the machine remediation CR? If not, I suspect we need to potentially worry about race conditions.

MHC doesn't wait for anything from the ERC or EMR CR. Can you provide examples for race conditions that may occur? we thought that ERC can add a finalizer on the machine remediation CR, if it needs to, and this might help to avoid race conditions

@jan-est
Copy link
Contributor

jan-est commented Jul 14, 2020

Hi, what is the situation with this proposal, could this be merged soon? And how should we proceed with the implementation after merging?

@ncdc
Copy link
Contributor

ncdc commented Jul 14, 2020

@jan-est hi, apologies for not getting back to this in a more timely manner. A lot of us have been heads' down on other things. I hope some of us can free up in the next few days to give this a proper review. Thanks for your continued patience.

@vincepri
Copy link
Member

I'll take a look early next week, apologies again for the delay haven't had a bit of time to go over it yet


We introduce a generic mechanism for supporting externally provided custom remediation strategies.

We propose modifying the MachineHealthCheck CRD to support a remediationTemplate, an ObjectReference to a provider-specific template CRD.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should avoid any new additions of corev1.ObjectReference to align with future goals of removing the current uses: #2318

Suggested change
We propose modifying the MachineHealthCheck CRD to support a remediationTemplate, an ObjectReference to a provider-specific template CRD.
We propose modifying the MachineHealthCheck CRD to support a remediationTemplate, a reference to a provider-specific template CRD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I believe that this relies on generateTemplate(..) which receives an ObjectReference.
Can we use generateTemplate() without having an ObjectReference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could likely refactor the core logic of generateTemplate into a separate function that takes individual parameters (or an input struct) rather than an ObjectReference, calling that function from both net new code and still providing generateTemplate() for backward compatibility until we've removed existing use of ObjectReference.


If no value for remediationTemplate is defined for the MachineHealthCheck CR, the existing condition-based deletion flow is preserved.

If a value for remediationTemplate is supplied and the Machine enters an unhealthy state, the template will be instantiated using existing CAPI functionality, with the same name and namespace as the target Machine, and the remediation flow passed to an External Remediation Controller (ERC) watching for that CR.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the name for the instantiated remediation resource should be generated to avoid potential issues if there exists a previous resource that hadn't been previously cleaned up properly

Or maybe the potential for a name collision is actually a good thing and an indication that we shouldn't be creating a new resource, since there is likely a remediation operation already underway...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see it, if it exists, it means that the Machine is unhealthy.
If the machine is healthy, it's MHC responsibility to delete that CR.

Or maybe the potential for a name collision is actually a good thing and an indication that we shouldn't be creating a new resource, since there is likely a remediation operation already underway...

exactly


If a value for remediationTemplate is supplied and the Machine enters an unhealthy state, the template will be instantiated using existing CAPI functionality, with the same name and namespace as the target Machine, and the remediation flow passed to an External Remediation Controller (ERC) watching for that CR.

No further action (deletion or applying conditions) will be taken by the MachineHealthCheck controller until the Node becomes healthy, when it will locate and delete the instantiated MachineRemediation CR.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concept here should be elaborated on a bit more... Some questions that I have:

Is the idea that MHC continues to perform health checks, just not attempts to mark the Machine for the "normal" remediation paths?

Does MachineHealthCheck wait for the machine remediation CR to report that it's finished prior to allowing for the deletion of the machine remediation CR? If not, I suspect we need to potentially worry about race conditions.

What should be done if the machine remediation controller has performed it's actions and the MHC is still failing for the Machine? Do we have a way to signal that and fall back to the default workflow or do we require the machine remediation controller to somehow handle that?

#### Story 1
As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can recover from transient errors faster and begin application recovery sooner.
#### Story 2
As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect hardware issues faster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this user story a bit? I'm not quite sure how power cycling results in being able to detect hardware issues faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

machine can go unhealthy due to software or hardware problems.
if you automatically power-cycle the host, it saves some time for an admin that would do exactly that to see if it was some temporary failure or something consistent.

but even if it's consistent, it still can be a software issue, so maybe this needs some clarification.
@beekhof - is this was the intention here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed this with beekhof.
the intention was what I wrote in previous comment - to eliminate transient issues, either caused by software or hardware,
I'll rephrase to reflect that.
thanks

#### Story 2
As an admin of a hardware based cluster, I would like unhealthy nodes to be power-cycled, so that I can detect hardware issues faster.
#### Story 3
As an admin of a hardware based cluster, I would like the system to keep attempting to power-cycle unhealthy nodes, so that they are automatically added back to the cluster when I fix the underlying problem.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide an example of when this would be intended behavior? I worry a bit trying to solve this for the generic case could potentially result in less efficient remediation of at least a subset of problems than if the default remediation processes where followed.

For example, if the underlying server that is being used has a hardware fault preventing bootstrapping from being completed, why would we want to continually restart until a technician can fix the server rather than trying to provision on a different server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide an example of when this would be intended behavior?

routing issues for example that are external to the server itself but prevents it to reach the cluster network

why would we want to continually restart until a technician can fix the server rather than trying to provision on a different server?

That's a good point, and I think that in one of the discussion it was suggested to have some kind of max reboot attempts, but I think this is up to the external remediation controller to decide

docs/proposals/20200615-external-remediation.md Outdated Show resolved Hide resolved
...

// +optional
RemediationTemplate ObjectReference `json:"remediationTemplate,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also likely be a custom type scoped to the exact type of information needed, rather than corev1.ObjectReference.

Thinking aloud, I suspect the following information would be needed:
apigroup
kind
name

We should also likely avoid the use of storing the version for the referenced type, otherwise we potentially need to worry about migration strategies for these loosely coupled types.

However if we don't encode the version, then instantiating the template would require a way to lookup (through discovery or some other means) the correct version of the resource to create.

@n1r1
Copy link
Contributor Author

n1r1 commented Jul 30, 2020

@ncdc

We probably should add some details around how to the controllers that currently operate on the OwnerRemediated condition may need to change to take both owner and external remediation. For example, the MachineSet controller checks to see if OwnerRemediated == false, and if so, it knows to perform owner remediation.

Is there any documentation on the other controllers that watch that condition? I'll be happy to read and to relate to this in the proposal

Also, I'm wondering if instead of creating a separate file for this, we should update the existing doc (https://github.com/kubernetes-sigs/cluster-api/blob/49f88d511f9584cb37bd38522a869ceaea9f2242/docs/proposals/20191030-machine-health-checking.md)?

I can add a link from the existing MHC doc to this one, or merge them together. whatever works for you

@ncdc
Copy link
Contributor

ncdc commented Jul 30, 2020

@n1r1 take a look at the current state of the MHC proposal - https://github.com/kubernetes-sigs/cluster-api/blob/95fe9e2c2c48cb7c765e40fe97861f22765441ff/docs/proposals/20191030-machine-health-checking.md - it talks about OwnerRemediated.

My preference would be for you to update the existing MHC doc instead of creating a new one. WDYT @CecileRobertMichon @vincepri @benmoss @detiber @JoelSpeed?

@detiber
Copy link
Member

detiber commented Jul 30, 2020

My preference would be for you to update the existing MHC doc instead of creating a new one.

+1, I think it would be good to update the MHC doc rather than having content related to remediation spread across multiple separate design docs.

@vincepri
Copy link
Member

Sounds good to me as well

@JoelSpeed
Copy link
Contributor

Agreed, let's merge them together

@n1r1 n1r1 requested a review from ncdc August 6, 2020 10:10
@vincepri
Copy link
Member

Doing a final review today, from a quick scan looks pretty straighforward

@vincepri
Copy link
Member

/milestone v0.3.9

@k8s-ci-robot k8s-ci-robot added this to the v0.3.9 milestone Aug 20, 2020
@n1r1
Copy link
Contributor Author

n1r1 commented Aug 24, 2020

hey @vincepri, did you have a chance to review it eventually?

thanks

@ncdc
Copy link
Contributor

ncdc commented Aug 24, 2020

@n1r1 sorry for the delay!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 24, 2020
@vincepri
Copy link
Member

Squash commits? We should be ready to merge today

…iation for unhealthy nodes backed by machines.

Signed-off-by: Nir <niry@redhat.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 24, 2020
@n1r1
Copy link
Contributor Author

n1r1 commented Aug 24, 2020

thanks @vincepri and @ncdc

squashed.

@ncdc
Copy link
Contributor

ncdc commented Aug 24, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 24, 2020
@vincepri
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 24, 2020
@vincepri
Copy link
Member

/test pull-cluster-api-e2e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

External Machine remediation