Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Karmada support removal of dangling resources? #5071

Open
mszacillo opened this issue Jun 20, 2024 · 3 comments
Open

Does Karmada support removal of dangling resources? #5071

mszacillo opened this issue Jun 20, 2024 · 3 comments
Labels
kind/question Indicates an issue that is a support question.

Comments

@mszacillo
Copy link
Contributor

Hello!

Our general question is: Does Karmada support a way to remove dangling resources in the event of a cluster failure?

Please provide an in-depth description of the question you have:

We have been running cluster failover tests for our Flink applications and have run into a few interesting scenarios. One instance in particular is in the case that there is a network partition between the Karmada control plane and the control plane of one of the member clusters. Here is an example scenario:

  1. A FlinkDeployment (and other secrets that are required) get deployed to Cluster X.

  2. We shut off the nodes for the control plane of Cluster X to simulate a network partition. Eventually Karmada will taint Cluster X as NoExecute, and attempt to reschedule the application elsewhere.

  3. The FlinkDeployment gets rescheduled to Cluster Y. The previously scheduled FlinkDeployment on Cluster X continues to run.

  4. We turn on the nodes for Cluster X once again. Karmada is able to reconnect to the cluster, and we end up with a dangling FlinkDeployment on Cluster X, since the ResourceBinding now points to Cluster Y.

Is there a way we can have Karmada reconcile these types of dangling resources and remove them from the cluster that has recovered? For example, even if the ResourceBinding only points to 1 cluster, there are still multiple Works scheduled across multiple clusters. I would assume that Karmada should be able to reconcile that the (# of works) != (# of replicas), and attempt to remove the dangling work.

Environment:

  • Karmada version: v1.9.0
  • Kubernetes version: v1.29
@mszacillo mszacillo added the kind/question Indicates an issue that is a support question. label Jun 20, 2024
@RainbowMango
Copy link
Member

RainbowMango commented Jun 21, 2024

Yes, Karmada already supports this scenario now.

After the application(FlinkDeployment) gets rescheduled from cluster X to cluster Y, then, the cluster X will be removed from the relevant ResourceBinding. Something like:

apiVersion: work.karmada.io/v1alpha2
kind: ResourceBinding
spec:
  clusters:
  - name: cluster-Y  # only cluster-Y will be present
    replicas: 1

When the ResourceBinding controller tries to sync the latest ResourceBinding, it will find and remove the dangling work resources(actually we say orphan work in the code) targeted to legacy cluster-X.

Note that, the ResourceBinding controller only triggers the the deletion of those orphan works (by setting a non-nil deletionTimestamp), but those works will be eventually removed until the network recovery.

@mszacillo
Copy link
Contributor Author

Ah, thanks for pointing this out! Did some tests and confirmed this does eventually happen which is good to know.

Out of curiosity, what happens if a ResourceBinding is deleted from a cluster that is in a bad state? Does the orphan work eventually get removed as well in that case?

@RainbowMango
Copy link
Member

Out of curiosity, what happens if a ResourceBinding is deleted from a cluster that is in a bad state? Does the orphan work eventually get removed as well in that case?

I guess you mean the behaviors when a ResourceBinding is deleted from Karmada.
In that case, all work propagated by this ResourceBinding would be removed, resulting in all workload propagated by the ResourceBinding to member clusters being removed eventually.

Note that,

  1. Karmada treats the ResourceBinding as internal resources, it is not supposed to be removed by any third-party system.
  2. The owner of ResourceBinding is the resource template, in other words, once the resource template is gone, the ResourceBinding will be removed automatically, that's the default behavior by now.
  3. We are still working on another proposal that hopes to provide a mechanism to keep the resources from member clusters in case of resource template deletion. See the discussion at Does karmada can prevent removal of these managed resources #4709

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

2 participants