Resources are orphaned when MachineDeployment/MachineSet get deleted #8172

srm09 · 2023-02-24T19:11:09Z

What steps did you take and what happened:
This is a special case wherein if the MachineDeployment and/or MachineSet get deleted, before the controllers have a chance to set the finalizer on the child objects (read Machine and VSphereMachine), the resources created by these objects get orphaned since they get deleted immediately due to the absence of the finalizer on them.

What did you expect to happen:
Consider changing the logic to add a finalizer during resource creation, instead of relying on the controller to set the finalizer during the reconciliation of the resource.
Maybe a temp finalizer gets set during resource creation which gets replaced by the finalizer from the controller during the first reconciliation loop of the object.

Anything else you would like to add:
Part of the problem arises since there is a certain time lag between the object being created by the MachineSet controller and the Machine controller setting the finalizer on the object. Similar behavior is observed in the VSphereMachine controller as well. This issue was highlighted because of a CAPV e2e test orphaning resources.

Environment:

Cluster-api version: ``v1.3.0`

/kind bug

The text was updated successfully, but these errors were encountered:

srm09 · 2023-02-24T19:11:32Z

@sbueringer opened this one for further discussion

fabriziopandini · 2023-02-24T21:10:54Z

@srm09 thanks for reporting this edge case!
/triage accepted

If I'm not mistaken we are already doing something similar for ownerReferences, so I'm not opposed to this given that it makes our system less fragile.

sbueringer · 2023-02-27T08:07:15Z

+1 as discussed.

We just have to check if we run into problems with server-side apply. I.e. we might end up with co-ownership and then the machine controller might be unable to remove the finalizer.

But this should be easy to test.

fabriziopandini · 2023-03-21T10:36:54Z

/help

k8s-ci-robot · 2023-03-21T10:36:55Z

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cahillsf · 2023-03-31T23:41:05Z

👋 I'd like to take a shot at this one. To make sure I understand the path forward, the idea is to:

add a finalizer during the initial creation of the Machine in the machineset_controller
- this prevents the resource from being deleted prior to the machine_controller picking up the Machine in the Reconcile method
update this logic in machine_controller to remove the creation finalizer and add the clusterv1.MachineFinalizer

killianmuldoon · 2023-04-03T13:26:30Z

@cahillsf - I think the easiest first step would be to add the clusterv1.MachineFinalizer during Machine creation. Then we need to test and ensure that this finalizer can actually be removed by the machine controller during deletion. If this works there shouldn't be any need to update the logic in the Machine controller.

cahillsf · 2023-04-03T22:22:42Z

/assign @cahillsf

enxebre · 2023-06-28T14:13:00Z

@srm09 can you help me understand the original issue? I'd expect each controller to block reconciliation until they set their finaliser or let it be a bug otherwise

sbueringer · 2023-06-29T08:51:47Z

@srm09 can you help me understand the original issue? I'd expect each controller to block reconciliation until they set their finaliser or let it be a bug otherwise

As far as I'm aware the usual pattern is that each controller sets the finalizer during normal reconcile. And then as soon as an object has been deleted (e.g. by the user) the controller does some sort of cleanup before it removes its finalizer.

Are you suggesting to instead block the deletion call in the deletion webhook? (in this case the Machine deletion webhook)

I think we end up with the better UX if for the limited cases where we run into problems we already add the finalizer on object creation. This avoids unnecessary retries during deletion by "whoever wants to delete the object" (Kubernetes garbage collector, users, other controllers)

Some more context. Let's say we're running at a high scale. In those scenarios it could take minutes for the Machine controller to set the finalizer if the queue of the Machine controller is overloaded. If in the meantime the MachineSet goes away (either through MD deletion or just a simple rollout of the MD) the Kubernetes garbage collector will trigger the Machine deletion. Because the Machine controller never had a chance to set the finalizer, reconcileDelete is never run. This means we could orphan resources because we never get a chance to run through the proper deletion flow.

I think the current proposed solution is the best approach.

enxebre · 2023-06-29T16:17:08Z

As far as I'm aware the usual pattern is that each controller sets the finalizer during normal reconcile.

Right that is what I was suggesting as well.

Because the Machine controller never had a chance to set the finalizer, reconcileDelete is never run. This means we could orphan resources because we never get a chance to run through the proper deletion flow.

I was asking for some original context because I'm missing how by following the above pattern we could orphan things. i.e. if the Machine controller didn't set the finalizer on a Machine yet then it didn't have the chance to create anything that can be orphaned (It can only create things after setting the finaliser).

I think the current proposed solution is the best approach.

I agree, just cant' see what what was the particular issue in the first place that get us to look into this.

sbueringer · 2023-06-29T16:21:40Z

Ah, got it!

sbueringer · 2023-06-29T16:22:38Z

I was asking for some original context because I'm missing how by following the above pattern we could orphan things. i.e. if the Machine controller didn't set the finalizer on a Machine yet then it didn't have the chance to create anything that can be orphaned (It can only create things after setting the finaliser).

The problem in this specific case is that the MachineSet controller creates the Machine, InfraMachine and BootstrapConfig.

sbueringer · 2023-06-30T10:39:14Z

To summarize what I wrote in the related Slack discussion (https://kubernetes.slack.com/archives/C8TSNPY4T/p1688058691112519):

The normal creation flow looks like this:

MachineSet is created
Machine, InfraMachine, BootstrapConfig is created
Machine controller: First reconcile => add finalizer on Machine
InfraMachine controller: First reconcile => waiting for ownerRef
Machine controller: Next reconcile set ownerRef on InfraMachine & BootstrapConfig
InfraMachine controller: Next reconcile set finalizer synchronously before creating infra, then create infra

This way finalizers are always on the objects before actual infrastructure is created and it doesn't matter when the MachineSet is deleted.

@srm09 As far as I can see the VSphereMachine controller is not adding the finalizer synchronously. It just adds it and then patches the VSphereMachine with defer, while as far as I can tell creating infra in the same reconcile. But that would be (just) a bug in CAPV.

That being said. I tested the current PR and went through it with the debugger. It definitely works as expected.
It guarantees that we run through reconcileDelete, for every MachineSet Machine. So we could consider merging it just to make it 100% robust even though we can't point to a specific bug that makes it necessary. I guess in any case it would simplify the reasoning why deletion is safe in every case.

Result of the discussion was that we agreed to move forward with this change. If we observe any unexpected behavior we now have a reasoning why we made this change and in the worst case can consider a revert.

Thx @enxebre for raising this point!

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 24, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 24, 2023

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 21, 2023

k8s-ci-robot assigned cahillsf Apr 3, 2023

cahillsf mentioned this issue Apr 3, 2023

🌱 add MachineFinalizer during machine computation #8463

Merged

k8s-ci-robot closed this as completed in #8463 Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resources are orphaned when MachineDeployment/MachineSet get deleted #8172

Resources are orphaned when MachineDeployment/MachineSet get deleted #8172

srm09 commented Feb 24, 2023

srm09 commented Feb 24, 2023

fabriziopandini commented Feb 24, 2023

sbueringer commented Feb 27, 2023

fabriziopandini commented Mar 21, 2023

k8s-ci-robot commented Mar 21, 2023

cahillsf commented Mar 31, 2023

killianmuldoon commented Apr 3, 2023

cahillsf commented Apr 3, 2023

enxebre commented Jun 28, 2023

sbueringer commented Jun 29, 2023 •

edited

Loading

enxebre commented Jun 29, 2023 •

edited

Loading

sbueringer commented Jun 29, 2023

sbueringer commented Jun 29, 2023

sbueringer commented Jun 30, 2023

Resources are orphaned when MachineDeployment/MachineSet get deleted #8172

Resources are orphaned when MachineDeployment/MachineSet get deleted #8172

Comments

srm09 commented Feb 24, 2023

srm09 commented Feb 24, 2023

fabriziopandini commented Feb 24, 2023

sbueringer commented Feb 27, 2023

fabriziopandini commented Mar 21, 2023

k8s-ci-robot commented Mar 21, 2023

Guidelines

cahillsf commented Mar 31, 2023

killianmuldoon commented Apr 3, 2023

cahillsf commented Apr 3, 2023

enxebre commented Jun 28, 2023

sbueringer commented Jun 29, 2023 • edited Loading

enxebre commented Jun 29, 2023 • edited Loading

sbueringer commented Jun 29, 2023

sbueringer commented Jun 29, 2023

sbueringer commented Jun 30, 2023

sbueringer commented Jun 29, 2023 •

edited

Loading

enxebre commented Jun 29, 2023 •

edited

Loading