Add a label/annotation to the resource being rescheduled during failover #4969

Dyex719 · 2024-05-22T14:40:58Z

What would you like to be added:
We propose adding a label/annotation during failover so that webhooks like Kyverno can perform the necessary checks/changes before the job is rescheduled. We are also open to discussing other ideas and contributing back to the community.

Why is this needed:
Stateful applications may need to read the last saved state to resume processing after failover. This may involve a change in the spec so that the path to read from can be specified.
It would be useful to know when a failover happened so that stateful applications can perform the necessary checks/changes before restarting.

In our particular use-case where we are migrating Flink applications using Karmada. The step by step process would be:

Karmada deploys the FlinkDeployment CR to a member cluster
The CR fails over due to a cluster failover / application failover
The label/annotation would get added during this process (For example: "failover" : true)
When Karmada reschedules the application to a different member cluster, a webhook like Kyverno could mutate the spec by checking for this label ("failover" : true) and restart the application from the last state only if this label exists
Resume processing of the stateful application from the last state

chaunceyjiang · 2024-05-27T03:09:10Z

It seems this issue is very similar to the pause we have been discussing before. /cc @XiShanYongYe-Chang

#4688

#4421

XiShanYongYe-Chang · 2024-05-27T12:32:44Z

Hi @Dyex719, thanks for your feedback.

I have a few questions. Are you referring to the labels being automatically added in by the system? When Karmada reschedules, how does the system trigger the webhook or Kverno to run?

Dyex719 · 2024-05-27T22:09:25Z

Hi @XiShanYongYe-Chang,
Thanks for getting back to us!

Are you referring to the labels being automatically added in by the system?

Yes, Karmada would add a label after failover so that the next time it is rescheduled it would have the label.

how does the system trigger the webhook or Kverno to run?

Kyverno would be deployed on the member clusters and is therefore called in the scheduling process. Kyverno would check if the label exists on the spec and if it does, read the last state accordingly.

XiShanYongYe-Chang · 2024-05-28T02:34:24Z

I'sorry, I still don't understand the whole process.

Hi @chaunceyjiang do you understand the requirement?

chaunceyjiang · 2024-05-28T09:23:54Z

My understanding is that @Dyex719 wants a label to indicate that the current resource is undergoing failover. This is because he expects this transition state to be recognized by other third-party software.

If I remember correctly, when a resource is transitioning, there will be a GracefulEvictionTasks in the derived ResourceBinding to indicate the tasks currently being transferred.

XiShanYongYe-Chang · 2024-05-28T12:23:57Z

If I remember correctly, when a resource is transitioning, there will be a GracefulEvictionTasks in the derived ResourceBinding to indicate the tasks currently being transferred.

Yes, the cluster to be removed will be placed here:

karmada/pkg/apis/work/v1alpha2/binding_types.go

Lines 97 to 111 in d676996

    
           // GracefulEvictionTasks holds the eviction tasks that are expected to perform 
        
           // the eviction in a graceful way. 
        
           // The intended workflow is: 
        
           // 1. Once the controller(such as 'taint-manager') decided to evict the resource that 
        
           //    is referenced by current ResourceBinding or ClusterResourceBinding from a target 
        
           //    cluster, it removes(or scale down the replicas) the target from Clusters(.spec.Clusters) 
        
           //    and builds a graceful eviction task. 
        
           // 2. The scheduler may perform a re-scheduler and probably select a substitute cluster 
        
           //    to take over the evicting workload(resource). 
        
           // 3. The graceful eviction controller takes care of the graceful eviction tasks and 
        
           //    performs the final removal after the workload(resource) is available on the substitute 
        
           //    cluster or exceed the grace termination period(defaults to 10 minutes). 
        
           // 
        
           // +optional 
        
           GracefulEvictionTasks []GracefulEvictionTask `json:"gracefulEvictionTasks,omitempty"`

This should be more specific than label.

Dyex719 · 2024-05-28T15:27:08Z

Hi @XiShanYongYe-Chang and @chaunceyjiang,
Thank you for the comments.

As far as I understand, using the GracefulEvictionTasks would not work in some scenarios, for example when the eviction mode is immediate (

karmada/pkg/controllers/applicationfailover/rb_application_failover_controller.go

Line 176 in d676996

case policyv1alpha1.Immediately:

)

To decouple this, maybe we can add a field in the ResourceBindingStatus? I am including the time here as well as that could be useful but maybe it is not necessary.

type ResourceBindingStatus struct {
         ...
	// LastFailoverTime represents the latest timestamp when a workload was failed over.
	// It is represented in RFC3339 form (like '2006-01-02T15:04:05Z') and is in UTC.
	// +optional
	LastFailoverTime *metav1.Time `json:"lastFailoverTime,omitempty"`
	...

XiShanYongYe-Chang · 2024-05-29T03:01:34Z

Thanks, I understand what you mean by this field.

One more question,

before the job is rescheduled.

Do we need to pause and wait for the verification to complete before rescheduling? Or does it mean that the rescheduling logic can be executed synchronously, regardless of whether validation is being performed?

Dyex719 · 2024-05-29T13:05:15Z

Verification/Mutation is done on the member clusters and is performed after Karmada has rescheduled the job. The flow is:

Karmada Rescheduled job from Cluster A to Cluster B
Verification of label and reading from last state is done by Kyverno/Webhook on Cluster B
Job is restored from last state on Cluster B.

So the pausing is done by Kyverno/Webhook on the member cluster which I believe are blocking in nature.

XiShanYongYe-Chang · 2024-05-30T03:14:15Z

Thanks for your explanation @Dyex719

It looks like the Karmada control plane doesn't need to sense check and pause.

I thought that labels were added during rescheduling and cleaned up after rescheduling before. When will these labels be cleaned up, as you describe?

Dyex719 · 2024-05-30T11:06:22Z

We didn't think about a cleanup strategy, the idea was mainly to update the label with the latest time in case of multiple failures.

The label is only added on rescheduling due to failure, and would not be added if the job is being scheduled for the first time. What are your reasons for needing a cleanup operation?

XiShanYongYe-Chang · 2024-05-30T12:25:14Z

Thanks, I get it.
Invite more guys to help take a look.
cc @chaunceyjiang @RainbowMango @whitewindmills @chaosi-zju

mszacillo · 2024-05-30T13:20:29Z

Hi @XiShanYongYe-Chang, @chaunceyjiang, thanks for taking a look at this issue!

At the moment we've been able to make a quick fix for this by altering the ResourceBindingStatus as was previously suggested. We added a method updateFailoverStatus to the rb_application_failover_controller (and will be added to cluster failover controller), which appends a new condition to the ResourceBindingStatus indicating a previous failover.

In pkg/controllers/binding/common.go we check the failover condition, and if present we append a label to the work being created during rescheduling. We would prefer using an actual field in the status rather than a condition, but for some reason I had difficulty getting the LastFailoverTime status to update correctly. Will need to investigate this.

You can view the code here: master...mszacillo:karmada:failover-label, but please note this is not PR ready and was mostly just a quick fix for testing on our end!

XiShanYongYe-Chang · 2024-05-31T01:31:17Z

Hi @mszacillo @Dyex719 thanks

Can I ask conveniently, which company are you from? Have you started using Karmada?

mszacillo · 2024-05-31T12:21:19Z

No problem. We're from Bloomberg - at the moment we aren't using Karmada in production environments, but we are investigating the work necessary to get Karmada working for stateful application failover. Perhaps this is something we can discuss more during the community meeting.

XiShanYongYe-Chang · 2024-06-03T01:14:23Z

@mszacillo Thanks a lot~

RainbowMango · 2024-06-04T08:27:53Z

I'm interested in this use case that helps stateful workloads, like FlinkDeployment, to resume processing.

When Karmada reschedules the application to a different member cluster, a webhook like Kyverno could mutate the spec by checking for this label ("failover" : true) and restart the application from the last state only if this label exists

@Dyex719 Can you share with us which FlinkDeploymentSpec filed exactly the Kyverno would mutate?

In addition, as far as I know, there will be a checkpoint storage, do you need to migrate the data across clusters?

Dyex719 · 2024-06-04T10:23:09Z

Hi @RainbowMango,

Please refer to this section and the upgrade modes
There are a few fields that would be mutated by Kyverno:

initialSavepointPath: “desired checkpoint path to be resumed from (s3p://)”
upgradeMode: savepoint 
state: running

The checkpoint storage will need to be replicated across data centers so that the state can be resumed from the last checkpoint. If this is supported no migration is needed. Essentially, the checkpoint storage should be independent of cluster failure.

RainbowMango · 2024-06-05T03:31:29Z

Thanks, the idea is impressive!
Then, the Kyverno won't be involved in complex logic, like checking or syncing checkpoint and so on(that's my concern).
Kyverno just needs to be aware that the FlinkDeployment in creating is migrating from another cluster, and subsequently, it will adjust accordingly based on the configuration(like labels) of the FlinkDeloyment itself.

Dyex719 · 2024-06-05T12:57:53Z

Yup, that's exactly correct! There is a few small issues with how Flink works though -

Flink creates a new jobID every time a new job is scheduled so when migration happens another new ID is created. This is a problem because the checkpoint to restore from is of the form s3p:///jobID/<checkpoint_number>. Here the jobID to restore from is the previous jobID before migration takes place.

Since this previous jobID is not stored anywhere, we will need Karmada to carry this field over.
If you think Karmada should support stateful applications like this, we can talk about how to handle such cases. I created #5006 to discuss about a generic framework for such stateful applications.

mszacillo · 2024-06-05T19:15:42Z

@RainbowMango Since there are a lot of moving parts here with varying degrees of urgency, would it be recommended for us to create a proposal document with all the requirements for supporting FlinkDeployment (and other stateful) failover?

RainbowMango · 2024-06-06T06:46:18Z

@Dyex719 Yeah, I can see the jobId(.status.jobStatus.jobId) on the FlinkDeployment, but I don't know much about it.
Each time FlinkDeployment launches a job, it will assign a unique ID for it, which happens to be used as the checkpoint storage path, resulting in this checkpoint being usable only by the current job. Please correct me if I'm wrong.
Are there any improvements that could be made at Flink operator? Like, allow the user to specific job ID or specific checkpoint path. Just a guess. What do you think?

RainbowMango · 2024-06-06T07:01:44Z

@mszacillo Sure! That would be great to have a proposal to address all these things!

I believe that the migration of a stateful application is a very challenging task, but it's something of great value. What I'm thinking is that perphas there are tasks that can be done before the migration, such as scalable mechanisms to allow users to do some preparatory work, and some tasks can be done after the migration, just like the approach we are talking about here.
Thank you in advance.
Here is the proposal template, you might need.

Dyex719 · 2024-06-06T14:18:49Z

@RainbowMango

Each time FlinkDeployment launches a job, it will assign a unique ID for it, which happens to be used as the checkpoint storage path, resulting in this checkpoint being usable only by the current job.

This is correct.

Yeah, I can see the #4905 (comment)

As you mentioned, the job ID is only present in the status so this cannot be accessed by Kyverno as the job is not running yet. This is the main problem.

Are there any improvements that could be made at Flink operator? Like, allow the user to specific job ID or specific checkpoint path?

This is technically possible to add a static job ID to create a specific checkpoint path but is not ideal. Any sort of user error like creating a new job or restarting from a checkpoint may result in overwriting previous checkpoints which is dangerous.

There is some interest in the community to make the process of reading from a checkpoint easier, for example : https://issues.apache.org/jira/browse/FLINK-9043. However this issue has been open for a long time.

We will work on the proposal and can then hopefully talk about this later!

RainbowMango · 2025-01-06T07:06:42Z

/close

This requirement has been addressed by the #5788 and the feature has been included in release-1.12.

karmada-bot · 2025-01-06T07:06:46Z

@RainbowMango: Closing this issue.

In response to this:

/close

This requirement has been addressed by the #5788 and the feature has been included in release-1.12.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Dyex719 added the kind/feature Categorizes issue or PR as related to a new feature. label May 22, 2024

github-project-automation bot added this to Karmada Overall Backlog May 22, 2024

This was referenced Jul 1, 2024

Stateful Proposal doc Dyex719/karmada#1

Closed

Stateful Failover Proposal #5116

Open

mszacillo mentioned this issue Jul 20, 2024

Proposal for multiple pod template support #5085

Open

Dyex719 mentioned this issue Jul 25, 2024

Add failover history information #5251

Open

karmada-bot closed this as completed Jan 6, 2025

RainbowMango mentioned this issue Jan 6, 2025

[Feature] Stateful Application Failover Support #5788

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a label/annotation to the resource being rescheduled during failover #4969

Add a label/annotation to the resource being rescheduled during failover #4969

Dyex719 commented May 22, 2024

chaunceyjiang commented May 27, 2024

XiShanYongYe-Chang commented May 27, 2024

Dyex719 commented May 27, 2024

XiShanYongYe-Chang commented May 28, 2024

chaunceyjiang commented May 28, 2024

XiShanYongYe-Chang commented May 28, 2024

Dyex719 commented May 28, 2024 •

edited

Loading

XiShanYongYe-Chang commented May 29, 2024

Dyex719 commented May 29, 2024

XiShanYongYe-Chang commented May 30, 2024

Dyex719 commented May 30, 2024

XiShanYongYe-Chang commented May 30, 2024

mszacillo commented May 30, 2024

XiShanYongYe-Chang commented May 31, 2024

mszacillo commented May 31, 2024

XiShanYongYe-Chang commented Jun 3, 2024

RainbowMango commented Jun 4, 2024

Dyex719 commented Jun 4, 2024

RainbowMango commented Jun 5, 2024

Dyex719 commented Jun 5, 2024 •

edited

Loading

mszacillo commented Jun 5, 2024

RainbowMango commented Jun 6, 2024

RainbowMango commented Jun 6, 2024

Dyex719 commented Jun 6, 2024

RainbowMango commented Jan 6, 2025

karmada-bot commented Jan 6, 2025

Add a label/annotation to the resource being rescheduled during failover #4969

Add a label/annotation to the resource being rescheduled during failover #4969

Comments

Dyex719 commented May 22, 2024

chaunceyjiang commented May 27, 2024

XiShanYongYe-Chang commented May 27, 2024

Dyex719 commented May 27, 2024

XiShanYongYe-Chang commented May 28, 2024

chaunceyjiang commented May 28, 2024

XiShanYongYe-Chang commented May 28, 2024

Dyex719 commented May 28, 2024 • edited Loading

XiShanYongYe-Chang commented May 29, 2024

Dyex719 commented May 29, 2024

XiShanYongYe-Chang commented May 30, 2024

Dyex719 commented May 30, 2024

XiShanYongYe-Chang commented May 30, 2024

mszacillo commented May 30, 2024

XiShanYongYe-Chang commented May 31, 2024

mszacillo commented May 31, 2024

XiShanYongYe-Chang commented Jun 3, 2024

RainbowMango commented Jun 4, 2024

Dyex719 commented Jun 4, 2024

RainbowMango commented Jun 5, 2024

Dyex719 commented Jun 5, 2024 • edited Loading

mszacillo commented Jun 5, 2024

RainbowMango commented Jun 6, 2024

RainbowMango commented Jun 6, 2024

Dyex719 commented Jun 6, 2024

RainbowMango commented Jan 6, 2025

karmada-bot commented Jan 6, 2025

Dyex719 commented May 28, 2024 •

edited

Loading

Dyex719 commented Jun 5, 2024 •

edited

Loading