Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple cluster affinity groups not working as expected #4990

Open
vicaya opened this issue May 28, 2024 · 6 comments
Open

Multiple cluster affinity groups not working as expected #4990

vicaya opened this issue May 28, 2024 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@vicaya
Copy link

vicaya commented May 28, 2024

What happened:
According to https://karmada.io/docs/userguide/scheduling/resource-propagating/#multiple-cluster-affinity-groups ,
there are 2 potential use cases: 1. local bursts to cloud; 2. primary failover to backup. I tested the use case 2 with the following policy with a simple workload httpbin:

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
 name: failover-test
spec:
 #...
 placement:
  clusterAffinities:
   - affinityName: primary
     clusterNames:
      - c0
   - affinityName: backup
     clusterNames:
      - c1
  #...
  1. Verified that both clusters are ready
  2. Deployed the workload. It got scheduled to c0 as expected.
  3. Disconnected (paused) cluster c0. Verified that c0 became not ready and workload got rescheduled to c1. So far so good.
  4. Reconnected (resumed) cluster c0. Verified that c0 became ready and the workload were running on both clusters. But after a while workload on c0 got deleted and kept running on c1.
  5. Disconnected (paused) cluster c1. Verified that c1 became not ready, while c0 is ready. The workload never failed back to c0.

What you expected to happen:

  1. For step 4, I expected the workload to move back to c0 according the spec order.
  2. For step 5, I expected the workload to fail back to c0, as it's the only ready cluster.

How to reproduce it (as minimally and precisely as possible):

See the the above steps to reproduce the problem. It's as minimal as you can get.

Anything else we need to know?:

Please provide a working example policy for the primary failover to backup use case. Make sure workload would move back to primary from backup when primary is ready again.

Environment:

  • Karmada version: 1.9.1 (created by karmadactl init, and than edited controller timeout options to make failover happen faster)
  • kubectl-karmada or karmadactl version (the result of kubectl-karmada version or karmadactl version): 1.9.1
  • Others:
@vicaya vicaya added the kind/bug Categorizes issue or PR as related to a bug. label May 28, 2024
@dominicqi
Copy link

I think this should be a bug.

affinityIndex := getAffinityIndex(rb.Spec.Placement.ClusterAffinities, rb.Status.SchedulerObservedAffinityName)

affinityIndex not always from zero.
Should the design here be such that it always stays on the backup cluster, and when there is a problem with the backup cluster, it moves to the primary cluster, or should it transfer back to the old cluster after the primary cluster recovers? Or should the user be allowed to choose how to handle it?

@vicaya
Copy link
Author

vicaya commented May 28, 2024

Should the design here be such that it always stays on the backup cluster, and when there is a problem with the backup cluster, it moves to the primary cluster, or should it transfer back to the old cluster after the primary cluster recovers? Or should the user be allowed to choose how to handle it?

IMO, primary is primary for a reason and backup is usually intended for holding the workload temporarily until the primary recovers. OTOH, I can see an option e.g. failbackOnly: True for move the workload back to primary only when the backup fails, might be useful.

@XiShanYongYe-Chang
Copy link
Member

Hi @vicaya, As you describe, this is the expected behavior. When multiple cluster groups are scheduled, if the current group is not suitable, the next group will be enabled and there will be no fallback.

How about try with this:

 placement:
  clusterAffinities:
   - affinityName: primary
     clusterNames:
      - c0
   - affinityName: backup
     clusterNames:
      - c0
      - c1

@dominicqi
Copy link

Hi @XiShanYongYe-Chang
I understand, but if it is a real primary-backup cluster, how should we ensure that the workload returns to the primary cluster in the end? Should we do this by adding and then removing taints?

@XiShanYongYe-Chang
Copy link
Member

Hi @dominicqi, you can try the rebalance feature. It will be released in v1.10, the day after tomorrow.

@vicaya
Copy link
Author

vicaya commented May 29, 2024

you can try the rebalance feature

Are you talking about #4840? Are you saying that with the same config as above, rebalancer will move the workload back to primary? If ttlSecondsAfterFinished is not specified, would the scheduler keep rebalancing indefinitely? Hopefully, we don't have to apply the rebalancer CR separately to do persistent rebalancing.

I also tried to use staticWeightList along with maxGroups: 1 to make sure all the replicas are in the the primary cluster with higher weight, which doesn't work after failover either. Hope the rebalancer would make this work as well, at the expense of verbosity and clarity.

Primary/backup scenario is such a common use case, it'd be great if the multiple cluster affinity groups would work out of the box as intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Status: No status
Development

No branches or pull requests

3 participants