[Feature] Stateful Application Failover Support #5788

RainbowMango · 2024-11-05T13:14:16Z

Summary
Karmada’s scheduling logic runs on the assumption that resources that are scheduled and rescheduled are stateless. In some cases, users may desire to conserve a certain state so that applications can resume from where they left off in the previous cluster.

For CRDs dealing with data-processing (such as Flink or Spark), it can be particularly useful to restart applications from a previous checkpoint. That way applications can seamlessly resume processing data while avoiding double processing.

This feature aims to introduce a generalized way for users to define application state preservation in the context of cluster-to-cluster failovers.

Proposal

state preservation (@Dyex719, Stateful Failover Proposal #5116)

Iteration Tasks -- Part-1: Ensure scheduler skips clusters where triggers the failover

API change: Comment use case for PurgeMode Immediately base on discussion on Add failover history information #5251 (comment). (@RainbowMango)
API change: Introduce PurgeMode to GracefulEvictionTask in ResourceBinding. (@mszacillo, #5816)
Make changes to GracefulEvictCluster() to set PurgeMode during eviction process. (@XiShanYongYe-Chang, #5821)
Make changes to the RB application failover controller and CRB application failover controller to build eviction task for PurgeMode Immediately. (@mszacillo, Failover controllers now build eviction tasks for purgemode immediately #5881)
Make changes to the taint controller to config eviction task when evicting ResourceBinding and evicting ClusterResourceBinding. (@XiShanYongYe-Chang, Update taint-manager to config eviction task with purgeMode #5879)
- Note that: Here we can not guarantee PurgeMode Immediately works as expected, as at this time Karmada might can not talk to the member clusters due to a network break. Set PurgeMode with Graciously by default as a compromise. ??
Make changes to binding controller and cluster binding controller to cleanup works from cluster in eviction task and purge mode is immediately. (@mszacillo, Cleanup works from cluster if purgemode is immediately #5889)
Double confirm if we need to make changes to the graceful eviction controller.

Iteration Tasks -- Part-2: state preservation and feed

API change: Introduce StatePreservation to PropagationPolicy. (See the API design here) (@RainbowMango, Introduce StatePreservation to PropagationPolicy API #5885)
API change: Introduce PreservedLabelState to ResourceBinding. (See the API design here) (@RainbowMango, Introduce StatePreservation to PropagationPolicy API #5885)
Make changes to RB/CRB application controller to build PreservedLabelState when triggering eviction. (@XiShanYongYe-Chang, Build PreservedLabelState when triggering evition in RB/CRB application controller #5887)
Make changes to taint manager to build PreservedLabelState when triggering eviction.
Make changes to RB/CRB controller to feed the PreservedLabelState to new clusters(failover to). (@XiShanYongYe-Chang , Inject preservedLabelState to the failover to clusters #5893)
Double confirm if we need to introduce a default label to distinguish the failover type.(Waiting for real-world use case).

Iteration Tasks -- Part-3: failover history
The failover history might be optional as we don't rely on it.
TBD: based on #5251

Related issues:

The text was updated successfully, but these errors were encountered:

mszacillo · 2024-11-05T19:17:47Z

Looks great, thank you!

Could we add a checklist item to include a default failoverType label onto the resource that has been failed over?

RainbowMango · 2024-11-06T10:24:29Z

Could we add a checklist item to include a default failoverType label onto the resource that has been failed over?

I don't have a strong feeling that we do need it, because according to the draft design, you can declare the label name to whatever you expects. For instance, you can declare the label name with karmada.io/failover-flink-checkpoint.
Then, you can configure the Kyverno with that label. Am I right?

RainbowMango · 2024-11-06T10:30:08Z

@mszacillo I'm trying to split the whole feature into small pieces, hoping more people could get involved and accelerate development.

For now, it's working in progress, but glad you noticed it, let me know if you have any comments or questions.

mszacillo · 2024-11-06T13:08:18Z

@RainbowMango I think that's a good idea, and having this feature available faster would be great. :)

Do you have a preference on who will be working on which task? If not I can pick up the introduction of PurgeMode to the GracefulEvictionTask today.

In addition, could we start a slack working group channel? Given the time differences, I think being able to have more rapid conversations on slack would improve the implementation pace.

mszacillo · 2024-11-06T13:09:58Z

I don't have a strong feeling that we do need it, because according to the draft design, you can declare the label name to whatever you expects.

That's true, we can simply declare our own label name for the use-case. In the case of a failover, it might be helpful to distinguish between cluster + application failovers, and only Karmada has the context. But perhaps I'm creating a use-case before it's even appeared.

RainbowMango · 2024-11-07T02:46:33Z

Do you have a preference on who will be working on which task? If not I can pick up the introduction of PurgeMode to the GracefulEvictionTask today.

Sure go for it! Assigned this task to you.
I think you are the feature owner, it would be great if you could work on it :)
Generally speaking, anyone can take the task without an assignment by leaving a comment here. The issue owner(it's me in this case) will assign it by adding the name to the end of the task.

RainbowMango · 2024-11-07T02:50:24Z

In the case of a failover, it might be helpful to distinguish between cluster + application failovers, and only Karmada has the context. But perhaps I'm creating a use-case before it's even appeared.

Yeah, the only benefit I can see is that it might help to distinguish failover types, but I think there is no rush to do it until there is a solid use case. I added a checklist item for this; we can revisit it later.

Double confirm if we need to introduce a default label to distinguish the failover type.(Waiting for real-world use case).

RainbowMango · 2024-11-16T08:19:17Z

Make changes to the RB application failover controller and CRB application failover controller to build eviction task for PurgeMode Immediately. (@mszacillo)

@mszacillo assigned this task to you according to the discussion on #5821 (review).

mszacillo · 2024-11-26T16:56:28Z

@RainbowMango I've got some extra bandwidth, so I can also pick up "Make changes to binding controller and cluster binding controller to cleanup works from cluster in eviction task and purge mode is immediately."

RainbowMango · 2024-11-27T02:42:17Z

Great! Assigned that to you.

@mszacillo @XiShanYongYe-Chang
In the coming release(v1.12) we will provide the support of application failover. For now, I don't see any blockers, so please let me know if I missed anything.

mszacillo · 2024-12-06T04:49:47Z

Hi @RainbowMango @XiShanYongYe-Chang,

I was able to do some testing of the application failover feature - although it works, I've noticed that the state preservation label is only retained on the work temporarily. This was causing some confusion for me while doing the verification. Essentially, what happens is:

The work is scheduled on cluster-A. I cordoned some nodes on Cluster-A and killed one of the deployment's pods so that the workload is unhealthy.
After 20 seconds toleration, the workload fails over. The gracefulEvictionTask is generated, the statePreservation label is created, and appended to the rescheduled work.
Roughly ~1 second after, the rb_graceful_eviction_controller cleans up the gracefulEvictionTask and triggers a reconcile of the ResourceBinding.
The ResourceBinding controller causes a resync, and while running ensureWork, it does not inject the PreservedLabelState since the gracefulEvictionTask no longer exists. This causes the statePreservation label to be removed from the work almost as soon as it was added...

It would be ideal if this label could be retained, since it may be confusing for users for the label to suddenly disappear right after their workloads failover. My first thought was we could change the reconciliation for ResourceBinding to skip sync if the a gracefulEvictionTask is removed, but I'm not sure if this would have side effects.

The other option would be to retain the statePreservation label directly on the ResourceBinding, so that it will always be appended to the work in the case of resyncs. Let me know if either of you have opinions.

XiShanYongYe-Chang · 2024-12-06T08:04:55Z

Hi @mszacillo, thanks for your feedback!
It's true that this is the case.
When a new workload is started in a new cluster and its status is determined as healthy by the resource interpreter, the process of eviction task is triggered. After that, the label is not injected into the new workload.

Roughly ~1 second after, the rb_graceful_eviction_controller cleans up the gracefulEvictionTask
I didn't expect the new workload to be healthy so quickly.

If we want to continue to inject these labels, we may need to think again.

mszacillo · 2024-12-06T15:30:01Z

If we want to continue to inject these labels, we may need to think again.

To be fair, my test was using a generic nginx deployment. I can do some e2e testing with our Flink use-case, since all that needs to occur is for our custom webhook to see the label, and append the latest checkpoint.

mszacillo · 2024-12-09T22:22:56Z

Ran some tests using FlinkDeployments and seems to work as expected! I think we can keep the label as ephemeral. :)

I think this is more so a potential issue for applications that have a very short startup time (become healthy very quickly), but I don't believe this is something that would be common in more complex CRDs.

XiShanYongYe-Chang · 2024-12-10T01:57:12Z

Thanks for your test, if there is any problem with this point in the future, we can continue to communicate and optimize it.

RainbowMango · 2024-12-10T09:56:20Z

Ran some tests using FlinkDeployments and seems to work as expected! I think we can keep the label as ephemeral. :)

I think this is more so a potential issue for applications that have a very short startup time (become healthy very quickly), but I don't believe this is something that would be common in more complex CRDs.

Glad to hear that!

I understand that the state preservation label should be removed once it has been consumed, as it can only be used once.
What you are concerned about is essentially observability, we can consider improving it specifically to ensure the migration process is fully observable.

RainbowMango · 2024-12-10T10:33:15Z

@mszacillo I see you added two agendas to today's meeting, I wonder if we can move them to the next meeting or schedule another meeting? I'm not feeling well today, I'm afraid I can not attend this meeting. I also left a message for you on Slack.

RainbowMango · 2024-12-16T06:58:49Z

Ran some tests using FlinkDeployments and seems to work as expected! I think we can keep the label as ephemeral. :)

@mszacillo
Can you share the PropagationPolicy configuration here? I'm about to write a blog for the v1.12 release, I hope to add an example to demonstrate the statePreservation usage.

mszacillo · 2024-12-16T15:02:34Z

Can you share the PropagationPolicy configuration here? I'm about to write a blog for the v1.12 release, I hope to add an example to demonstrate the statePreservation usage.

Hi @RainbowMango! Yes, here is the configuration I was using for our testing, which has been working well so far. For the jsonPath I didn't use a valid jsonPath, but I just wanted a static value for the failover key so that we can signal our custom webhook to fetch a latest checkpoint for the application to recover from.

spec:
  conflictResolution: Abort
  failover:
    application:
      decisionConditions:
        tolerationSeconds: 120
      purgeMode: Immediately
      statePreservation:
        rules:
          - aliasLabelName: resourcebinding.karmada.io/failover
            jsonPath: "true"

XiShanYongYe-Chang · 2024-12-17T02:25:16Z

For the jsonPath I didn't use a valid jsonPath, but I just wanted a static value

~~Hi @mszacillo we don't have the ability to get static values when parsing jsonPath, so I'm wondering if you've added extra extensions?~~

Ignore me.
It works, it looks like I've learned a new point, thx! I also need to add a new test case.

RainbowMango · 2024-12-17T13:09:23Z

@mszacillo
I'm glad to hear that it works.
However, this usage is not what we expected. The configuration appears ambiguous, to be more precise since the jsonPath is a non-exist path, the behavior becomes unpredictable, and the controller might ignore this rule resulting in no label being injected during the failover process.

So, could you share more details about what Kyverno does when it detects a FlinkDeployment with the label resourcebinding.karmada.io/failover: true? Let's see if we need to enhance the feature by providing support like only injecting the configured label during failover instead of trying to grab status.

mszacillo · 2024-12-17T14:53:02Z

Completely fair! Let's say its user error. :)

After double checking, I noticed that the flink-operator should publish checkpointInfo to the status of the job, which we can re-use as part of the propagationpolicy and remove the kyverno hook entirely. I'll try to upgrade our operator version so that it includes these fields and let you know how the test goes.

[edit by @RainbowMango]:
See the CheckpointInfo definition here: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/reference/#checkpointinfo

mszacillo · 2024-12-17T19:27:37Z

@RainbowMango

I've updated our PropagationPolicy to instead grab the published jobID from the status of the FlinkDeployment, which is more in line with the intended usage of this feature:

spec:
  conflictResolution: Abort
  failover:
    application:
      decisionConditions:
        tolerationSeconds: 120
      purgeMode: Immediately
      statePreservation:
        rules:
          - aliasLabelName: resourcebinding.karmada.io/failover-jobid
            jsonPath: "{ .jobStatus.jobId }"

Our kyverno policy reads in the jobID so that it can fetch the latest checkpoint for the job, which is saved on a shared data store under the path /<shared-path>/<job-namespace>/<jobId>/checkpoints. The policy fetches the latest valid key from the list which we then use to update the initialSavepointPath. That way the job correctly restarts from the last saved state.

job:
   initialSavepointPath: /<shared-path>/<job-namespace>/<jobId>/checkpoints/chk-45

We were being a little clever and keeping the jobID static for the time being, which is how we were able to get away with just setting a static failover flag.

To add more context for our use-case, in order to successfully failover we need the latest checkpoint path. The existing checkpointInfo that Flink Operator publishes only provides the triggerId, rather than the full path to the latest checkpoint:

jobStatus:
  checkpointInfo:
     formatType: FULL
     lastCheckpoint:
        formatType: FULL
        timeStamp: 1734462491693
        triggerType: PERIODIC
      lastPeriodicCheckpointTimestamp: 1734462491693
      triggerId: 447b790bb9c88da9ae753f00f45acb0e
      triggerTimestamp: 1734462506836
      triggerType: PERIODIC
  jobId: e6fdb5c0997c11b0c62d796b3df25e86

It would be simple enough to just augment the checkpointInfo status to add in the checkpointId, but checkpointInfo was deprecated recently in favor of using FlinkStateSnapshots, which are different CRDs that can track the latest savepoint and checkpoint information: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/snapshots/. We hadn't considered updating this status ourselves before since we don't want to maintain a separate fork of the operator.

So what's the move going forward? For now I believe we'll stick with the existing kyverno webhook which fetches the latest checkpoint using the previous jobID which is appended by Karmada and the state preservation feature. Longer term, we have two options:

Augment flink's jobStatus.checkpointInfo to include the checkpointId, which we can directly pass in using Karmada's state preservation and set the initialSavepointPath using Kyverno. I'll check with Flink community to see what they feel.
Read the status of FlinkStateSnapshot resource to fetch the checkpoint for the job failing over. I'll have to research this option as this is a separate resource, and it would require the statePreservation feature to fetch status information from a different binding, which is not currently supported.

RainbowMango · 2024-12-18T04:37:06Z

So what's the move going forward? For now I believe we'll stick with the existing kyverno webhook which fetches the latest checkpoint using the previous jobID which is appended by Karmada and the state preservation feature.

Yeah, I agree. That's exactly the usage that we designed.
Thanks for the clarification!
The whole process would be:

Karmada grabs { .jobStatus.jobId } from old FlinkDeployment
Karmada feeds the preservation to new FlinkDeployment with the label resourcebinding.karmada.io/failover-jobid : <jobID>
Kyverno gets the checkpoint data via the job id, and generates the initialSavepointPath: /<shared-path>/<job-namespace>/<jobId>/checkpoints/chk-45.
Kyverno injects the initialSavepointPath to the new FlinkDeployment
FlinkOperator then restart the Flink from the savepoint.

mszacillo · 2024-12-18T17:01:39Z

No problem, and yes exactly! It would be amazing if we can get the upstream flink operator community to add a checkpointPath field to their checkpointStatus. It would simplify the flow even further:

Karmada grabs { .jobStatus.checkpointInfo.checkpointPath} from old FlinkDeployment
Karmada feeds the info into a corresponding label to the new FlinkDeployment
Kyverno injects the value of the label into the initialSavepointPath.
FlinkOperator starts Flink from the checkpoint.

But for now we'll keep our logic as is.

RainbowMango · 2024-12-19T05:05:31Z

But for now we'll keep our logic as is.

Sure, that's amazing!

It would be amazing if we can get the upstream flink operator community to add a checkpointPath field to their checkpointStatus.

This sounds very challenging, just share my two cents:

The Flink operator has already deprecated the checkpointInfo, and it is unlikely to iterate this feature anymore.
The checkpointPath looks like a field that frequently changes, so it would not be practical for the operator to update it regularly.

Anyway, you can talk to the community and get me updated.

mszacillo · 2024-12-20T03:13:26Z

Hi @RainbowMango @XiShanYongYe-Chang,

I've been working on adding support for the cluster failover by generalizing the BuildTaskOptions method to support resources that do and don't have FailoverBehavior defined. I performed some e2e testing with a mix of different resources and things seem to be working well.

I haven't published a PR yet, but the changes are in my branch here. Please take a look and let me know if you have comments or concerns.

RainbowMango · 2024-12-20T09:39:52Z

Emm, it appears that you want to apply the application failover behavior to the process of cluster failover. That's interesting!

I've been thinking about how to iterate the cluster failover feature since the last release, I strongly believe this feature should be cautiously evaluated before using it in production, that's exactly the reason why we disabled it by default in release-1.12. I will let you know if I come out with more detailed ideas or plans about the cluster failover feature. Before that, I will hesitate to push features related to cluster failover.

mszacillo · 2024-12-20T13:40:52Z

I strongly believe this feature should be cautiously evaluated before using it in production, that's exactly the reason why we disabled it by default in release-1.12

That makes sense! We're doing a bit more stress testing on our side in terms of the application failover, so until this feature is a bit more mature having the feature be off by default is fine.

it appears that you want to apply the application failover behavior to the process of cluster failover

The thinking here was that we already generate graceful eviction tasks for all resources that are evicted from a cluster due to a taint. Obviously a cluster can have a large variety of different resources on it which may not have failover configured, so they should not have any graceful eviction options related to failover. That was why I decided to generalize the BuildTaskOptions method.

For cluster failover, is there a reason why we are apprehensive about re-using the method of how we preserve state via GracefulEvictionTasks? I may be missing context here.

RainbowMango added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 5, 2024

github-project-automation bot added this to Karmada Overall Backlog Nov 5, 2024

RainbowMango mentioned this issue Nov 6, 2024

Add failover history information #5251

Open

mszacillo mentioned this issue Nov 13, 2024

feat(stateful failover support) Introduce PurgeMode to GracefulEvictionTask in ResourceBinding #5816

Merged

XiShanYongYe-Chang mentioned this issue Nov 15, 2024

Support PurgeMode setting in evection tasks #5821

Merged

RainbowMango moved this to Planned In Release 1.12 in Karmada Overall Backlog Nov 15, 2024

RainbowMango mentioned this issue Nov 16, 2024

Update application failover purge mode comments #5826

Merged

XiShanYongYe-Chang mentioned this issue Nov 25, 2024

Update taint-manager to config eviction task with purgeMode #5879

Merged

mszacillo mentioned this issue Nov 26, 2024

Failover controllers now build eviction tasks for purgemode immediately #5881

Merged

RainbowMango mentioned this issue Nov 26, 2024

Introduce StatePreservation to PropagationPolicy API #5885

Merged

XiShanYongYe-Chang mentioned this issue Nov 27, 2024

Build PreservedLabelState when triggering evition in RB/CRB application controller #5887

Merged

mszacillo mentioned this issue Nov 28, 2024

Cleanup works from cluster if purgemode is immediately #5889

Merged

This was referenced Nov 28, 2024

Add support for CRDs for the Karmada Descheduler #4905

Closed

Random reaction time for Application Failover #4208

Closed

XiShanYongYe-Chang mentioned this issue Nov 28, 2024

Inject preservedLabelState to the failover to clusters #5893

Merged

RainbowMango mentioned this issue Nov 29, 2024

Add stateful application failover status injection feature gate #5897

Merged

This was referenced Nov 29, 2024

publish release note for v1.12.0 #5896

Merged

Add doc for stateful application failover karmada-io/website#749

Merged

RainbowMango added this to the v1.12 milestone Dec 17, 2024

XiShanYongYe-Chang mentioned this issue Dec 17, 2024

Add test case for parsing static value with parseJSONValue func #5959

Closed

mszacillo mentioned this issue Jan 4, 2025

Stateful Failover Proposal #5116

Open

This was referenced Jan 6, 2025

Descheduling support for FlinkDeployments #5987

Open

Add a label/annotation to the resource being rescheduled during failover #4969

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Stateful Application Failover Support #5788

[Feature] Stateful Application Failover Support #5788

RainbowMango commented Nov 5, 2024 •

edited

Loading

mszacillo commented Nov 5, 2024

RainbowMango commented Nov 6, 2024

RainbowMango commented Nov 6, 2024

mszacillo commented Nov 6, 2024

mszacillo commented Nov 6, 2024

RainbowMango commented Nov 7, 2024

RainbowMango commented Nov 7, 2024 •

edited

Loading

RainbowMango commented Nov 16, 2024

mszacillo commented Nov 26, 2024

RainbowMango commented Nov 27, 2024

mszacillo commented Dec 6, 2024

XiShanYongYe-Chang commented Dec 6, 2024

mszacillo commented Dec 6, 2024

mszacillo commented Dec 9, 2024 •

edited

Loading

XiShanYongYe-Chang commented Dec 10, 2024

RainbowMango commented Dec 10, 2024

RainbowMango commented Dec 10, 2024

RainbowMango commented Dec 16, 2024

mszacillo commented Dec 16, 2024

XiShanYongYe-Chang commented Dec 17, 2024 •

edited

Loading

RainbowMango commented Dec 17, 2024 •

edited

Loading

mszacillo commented Dec 17, 2024 •

edited by RainbowMango

Loading

mszacillo commented Dec 17, 2024 •

edited

Loading

RainbowMango commented Dec 18, 2024

mszacillo commented Dec 18, 2024

RainbowMango commented Dec 19, 2024

mszacillo commented Dec 20, 2024

RainbowMango commented Dec 20, 2024

mszacillo commented Dec 20, 2024

[Feature] Stateful Application Failover Support #5788

[Feature] Stateful Application Failover Support #5788

Comments

RainbowMango commented Nov 5, 2024 • edited Loading

mszacillo commented Nov 5, 2024

RainbowMango commented Nov 6, 2024

RainbowMango commented Nov 6, 2024

mszacillo commented Nov 6, 2024

mszacillo commented Nov 6, 2024

RainbowMango commented Nov 7, 2024

RainbowMango commented Nov 7, 2024 • edited Loading

RainbowMango commented Nov 16, 2024

mszacillo commented Nov 26, 2024

RainbowMango commented Nov 27, 2024

mszacillo commented Dec 6, 2024

XiShanYongYe-Chang commented Dec 6, 2024

mszacillo commented Dec 6, 2024

mszacillo commented Dec 9, 2024 • edited Loading

XiShanYongYe-Chang commented Dec 10, 2024

RainbowMango commented Dec 10, 2024

RainbowMango commented Dec 10, 2024

RainbowMango commented Dec 16, 2024

mszacillo commented Dec 16, 2024

XiShanYongYe-Chang commented Dec 17, 2024 • edited Loading

RainbowMango commented Dec 17, 2024 • edited Loading

mszacillo commented Dec 17, 2024 • edited by RainbowMango Loading

mszacillo commented Dec 17, 2024 • edited Loading

RainbowMango commented Dec 18, 2024

mszacillo commented Dec 18, 2024

RainbowMango commented Dec 19, 2024

mszacillo commented Dec 20, 2024

RainbowMango commented Dec 20, 2024

mszacillo commented Dec 20, 2024

RainbowMango commented Nov 5, 2024 •

edited

Loading

RainbowMango commented Nov 7, 2024 •

edited

Loading

mszacillo commented Dec 9, 2024 •

edited

Loading

XiShanYongYe-Chang commented Dec 17, 2024 •

edited

Loading

RainbowMango commented Dec 17, 2024 •

edited

Loading

mszacillo commented Dec 17, 2024 •

edited by RainbowMango

Loading

mszacillo commented Dec 17, 2024 •

edited

Loading