-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does Karmada support removal of dangling resources? #5071
Comments
Yes, Karmada already supports this scenario now. After the application(FlinkDeployment) gets rescheduled from cluster X to cluster Y, then, the cluster X will be removed from the relevant ResourceBinding. Something like: apiVersion: work.karmada.io/v1alpha2
kind: ResourceBinding
spec:
clusters:
- name: cluster-Y # only cluster-Y will be present
replicas: 1 When the ResourceBinding controller tries to sync the latest ResourceBinding, it will find and remove the dangling work resources(actually we say orphan work in the code) targeted to legacy cluster-X. Note that, the ResourceBinding controller only triggers the the deletion of those orphan works (by setting a non-nil deletionTimestamp), but those works will be eventually removed until the network recovery. |
Ah, thanks for pointing this out! Did some tests and confirmed this does eventually happen which is good to know. Out of curiosity, what happens if a ResourceBinding is deleted from a cluster that is in a bad state? Does the orphan work eventually get removed as well in that case? |
I guess you mean the behaviors when a ResourceBinding is deleted from Note that,
|
Hello!
Our general question is:
Does Karmada support a way to remove dangling resources in the event of a cluster failure?
Please provide an in-depth description of the question you have:
We have been running cluster failover tests for our Flink applications and have run into a few interesting scenarios. One instance in particular is in the case that there is a network partition between the Karmada control plane and the control plane of one of the member clusters. Here is an example scenario:
A FlinkDeployment (and other secrets that are required) get deployed to Cluster X.
We shut off the nodes for the control plane of Cluster X to simulate a network partition. Eventually Karmada will taint Cluster X as NoExecute, and attempt to reschedule the application elsewhere.
The FlinkDeployment gets rescheduled to Cluster Y. The previously scheduled FlinkDeployment on Cluster X continues to run.
We turn on the nodes for Cluster X once again. Karmada is able to reconnect to the cluster, and we end up with a dangling FlinkDeployment on Cluster X, since the ResourceBinding now points to Cluster Y.
Is there a way we can have Karmada reconcile these types of dangling resources and remove them from the cluster that has recovered? For example, even if the ResourceBinding only points to 1 cluster, there are still multiple Works scheduled across multiple clusters. I would assume that Karmada should be able to reconcile that the (# of works) != (# of replicas), and attempt to remove the dangling work.
Environment:
The text was updated successfully, but these errors were encountered: