Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Stateful Application Failover Support #5788

Open
10 of 14 tasks
RainbowMango opened this issue Nov 5, 2024 · 29 comments
Open
10 of 14 tasks

[Feature] Stateful Application Failover Support #5788

RainbowMango opened this issue Nov 5, 2024 · 29 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@RainbowMango
Copy link
Member

RainbowMango commented Nov 5, 2024

Summary
Karmada’s scheduling logic runs on the assumption that resources that are scheduled and rescheduled are stateless. In some cases, users may desire to conserve a certain state so that applications can resume from where they left off in the previous cluster.

For CRDs dealing with data-processing (such as Flink or Spark), it can be particularly useful to restart applications from a previous checkpoint. That way applications can seamlessly resume processing data while avoiding double processing.

This feature aims to introduce a generalized way for users to define application state preservation in the context of cluster-to-cluster failovers.

Proposal

Iteration Tasks -- Part-1: Ensure scheduler skips clusters where triggers the failover

Iteration Tasks -- Part-2: state preservation and feed

Iteration Tasks -- Part-3: failover history
The failover history might be optional as we don't rely on it.
TBD: based on #5251

Related issues:

@RainbowMango RainbowMango added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 5, 2024
@mszacillo
Copy link
Contributor

Looks great, thank you!

Could we add a checklist item to include a default failoverType label onto the resource that has been failed over?

@RainbowMango
Copy link
Member Author

Could we add a checklist item to include a default failoverType label onto the resource that has been failed over?

I don't have a strong feeling that we do need it, because according to the draft design, you can declare the label name to whatever you expects. For instance, you can declare the label name with karmada.io/failover-flink-checkpoint.
Then, you can configure the Kyverno with that label. Am I right?

@RainbowMango
Copy link
Member Author

@mszacillo I'm trying to split the whole feature into small pieces, hoping more people could get involved and accelerate development.

For now, it's working in progress, but glad you noticed it, let me know if you have any comments or questions.

@mszacillo
Copy link
Contributor

@RainbowMango I think that's a good idea, and having this feature available faster would be great. :)

Do you have a preference on who will be working on which task? If not I can pick up the introduction of PurgeMode to the GracefulEvictionTask today.

In addition, could we start a slack working group channel? Given the time differences, I think being able to have more rapid conversations on slack would improve the implementation pace.

@mszacillo
Copy link
Contributor

I don't have a strong feeling that we do need it, because according to the draft design, you can declare the label name to whatever you expects.

That's true, we can simply declare our own label name for the use-case. In the case of a failover, it might be helpful to distinguish between cluster + application failovers, and only Karmada has the context. But perhaps I'm creating a use-case before it's even appeared.

@RainbowMango
Copy link
Member Author

Do you have a preference on who will be working on which task? If not I can pick up the introduction of PurgeMode to the GracefulEvictionTask today.

Sure go for it! Assigned this task to you.
I think you are the feature owner, it would be great if you could work on it :)
Generally speaking, anyone can take the task without an assignment by leaving a comment here. The issue owner(it's me in this case) will assign it by adding the name to the end of the task.

@RainbowMango
Copy link
Member Author

RainbowMango commented Nov 7, 2024

In the case of a failover, it might be helpful to distinguish between cluster + application failovers, and only Karmada has the context. But perhaps I'm creating a use-case before it's even appeared.

Yeah, the only benefit I can see is that it might help to distinguish failover types, but I think there is no rush to do it until there is a solid use case. I added a checklist item for this; we can revisit it later.

Double confirm if we need to introduce a default label to distinguish the failover type.(Waiting for real-world use case).

@RainbowMango
Copy link
Member Author

Make changes to the RB application failover controller and CRB application failover controller to build eviction task for PurgeMode Immediately. (@mszacillo)

@mszacillo assigned this task to you according to the discussion on #5821 (review).

@mszacillo
Copy link
Contributor

@RainbowMango I've got some extra bandwidth, so I can also pick up "Make changes to binding controller and cluster binding controller to cleanup works from cluster in eviction task and purge mode is immediately."

@RainbowMango
Copy link
Member Author

Great! Assigned that to you.

@mszacillo @XiShanYongYe-Chang
In the coming release(v1.12) we will provide the support of application failover. For now, I don't see any blockers, so please let me know if I missed anything.

@mszacillo
Copy link
Contributor

Hi @RainbowMango @XiShanYongYe-Chang,

I was able to do some testing of the application failover feature - although it works, I've noticed that the state preservation label is only retained on the work temporarily. This was causing some confusion for me while doing the verification. Essentially, what happens is:

  1. The work is scheduled on cluster-A. I cordoned some nodes on Cluster-A and killed one of the deployment's pods so that the workload is unhealthy.
  2. After 20 seconds toleration, the workload fails over. The gracefulEvictionTask is generated, the statePreservation label is created, and appended to the rescheduled work.
  3. Roughly ~1 second after, the rb_graceful_eviction_controller cleans up the gracefulEvictionTask and triggers a reconcile of the ResourceBinding.
  4. The ResourceBinding controller causes a resync, and while running ensureWork, it does not inject the PreservedLabelState since the gracefulEvictionTask no longer exists. This causes the statePreservation label to be removed from the work almost as soon as it was added...

It would be ideal if this label could be retained, since it may be confusing for users for the label to suddenly disappear right after their workloads failover. My first thought was we could change the reconciliation for ResourceBinding to skip sync if the a gracefulEvictionTask is removed, but I'm not sure if this would have side effects.

The other option would be to retain the statePreservation label directly on the ResourceBinding, so that it will always be appended to the work in the case of resyncs. Let me know if either of you have opinions.

@XiShanYongYe-Chang
Copy link
Member

Hi @mszacillo, thanks for your feedback!
It's true that this is the case.
When a new workload is started in a new cluster and its status is determined as healthy by the resource interpreter, the process of eviction task is triggered. After that, the label is not injected into the new workload.

Roughly ~1 second after, the rb_graceful_eviction_controller cleans up the gracefulEvictionTask
I didn't expect the new workload to be healthy so quickly.

If we want to continue to inject these labels, we may need to think again.

@mszacillo
Copy link
Contributor

If we want to continue to inject these labels, we may need to think again.

To be fair, my test was using a generic nginx deployment. I can do some e2e testing with our Flink use-case, since all that needs to occur is for our custom webhook to see the label, and append the latest checkpoint.

@mszacillo
Copy link
Contributor

mszacillo commented Dec 9, 2024

Ran some tests using FlinkDeployments and seems to work as expected! I think we can keep the label as ephemeral. :)

I think this is more so a potential issue for applications that have a very short startup time (become healthy very quickly), but I don't believe this is something that would be common in more complex CRDs.

@XiShanYongYe-Chang
Copy link
Member

Thanks for your test, if there is any problem with this point in the future, we can continue to communicate and optimize it.

@RainbowMango
Copy link
Member Author

Ran some tests using FlinkDeployments and seems to work as expected! I think we can keep the label as ephemeral. :)

I think this is more so a potential issue for applications that have a very short startup time (become healthy very quickly), but I don't believe this is something that would be common in more complex CRDs.

Glad to hear that!

I understand that the state preservation label should be removed once it has been consumed, as it can only be used once.
What you are concerned about is essentially observability, we can consider improving it specifically to ensure the migration process is fully observable.

@RainbowMango
Copy link
Member Author

@mszacillo I see you added two agendas to today's meeting, I wonder if we can move them to the next meeting or schedule another meeting? I'm not feeling well today, I'm afraid I can not attend this meeting. I also left a message for you on Slack.

@RainbowMango
Copy link
Member Author

Ran some tests using FlinkDeployments and seems to work as expected! I think we can keep the label as ephemeral. :)

@mszacillo
Can you share the PropagationPolicy configuration here? I'm about to write a blog for the v1.12 release, I hope to add an example to demonstrate the statePreservation usage.

@mszacillo
Copy link
Contributor

Can you share the PropagationPolicy configuration here? I'm about to write a blog for the v1.12 release, I hope to add an example to demonstrate the statePreservation usage.

Hi @RainbowMango! Yes, here is the configuration I was using for our testing, which has been working well so far. For the jsonPath I didn't use a valid jsonPath, but I just wanted a static value for the failover key so that we can signal our custom webhook to fetch a latest checkpoint for the application to recover from.

spec:
  conflictResolution: Abort
  failover:
    application:
      decisionConditions:
        tolerationSeconds: 120
      purgeMode: Immediately
      statePreservation:
        rules:
          - aliasLabelName: resourcebinding.karmada.io/failover
            jsonPath: "true"

@RainbowMango RainbowMango added this to the v1.12 milestone Dec 17, 2024
@XiShanYongYe-Chang
Copy link
Member

XiShanYongYe-Chang commented Dec 17, 2024

For the jsonPath I didn't use a valid jsonPath, but I just wanted a static value

Hi @mszacillo we don't have the ability to get static values when parsing jsonPath, so I'm wondering if you've added extra extensions?

Ignore me.
It works, it looks like I've learned a new point, thx! I also need to add a new test case.

@RainbowMango
Copy link
Member Author

RainbowMango commented Dec 17, 2024

@mszacillo
I'm glad to hear that it works.
However, this usage is not what we expected. The configuration appears ambiguous, to be more precise since the jsonPath is a non-exist path, the behavior becomes unpredictable, and the controller might ignore this rule resulting in no label being injected during the failover process.

So, could you share more details about what Kyverno does when it detects a FlinkDeployment with the label resourcebinding.karmada.io/failover: true? Let's see if we need to enhance the feature by providing support like only injecting the configured label during failover instead of trying to grab status.

@mszacillo
Copy link
Contributor

mszacillo commented Dec 17, 2024

Completely fair! Let's say its user error. :)

After double checking, I noticed that the flink-operator should publish checkpointInfo to the status of the job, which we can re-use as part of the propagationpolicy and remove the kyverno hook entirely. I'll try to upgrade our operator version so that it includes these fields and let you know how the test goes.

[edit by @RainbowMango]:
See the CheckpointInfo definition here: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/reference/#checkpointinfo

@mszacillo
Copy link
Contributor

mszacillo commented Dec 17, 2024

@RainbowMango

I've updated our PropagationPolicy to instead grab the published jobID from the status of the FlinkDeployment, which is more in line with the intended usage of this feature:

spec:
  conflictResolution: Abort
  failover:
    application:
      decisionConditions:
        tolerationSeconds: 120
      purgeMode: Immediately
      statePreservation:
        rules:
          - aliasLabelName: resourcebinding.karmada.io/failover-jobid
            jsonPath: "{ .jobStatus.jobId }"

Our kyverno policy reads in the jobID so that it can fetch the latest checkpoint for the job, which is saved on a shared data store under the path /<shared-path>/<job-namespace>/<jobId>/checkpoints. The policy fetches the latest valid key from the list which we then use to update the initialSavepointPath. That way the job correctly restarts from the last saved state.

job:
   initialSavepointPath: /<shared-path>/<job-namespace>/<jobId>/checkpoints/chk-45

We were being a little clever and keeping the jobID static for the time being, which is how we were able to get away with just setting a static failover flag.


To add more context for our use-case, in order to successfully failover we need the latest checkpoint path. The existing checkpointInfo that Flink Operator publishes only provides the triggerId, rather than the full path to the latest checkpoint:

jobStatus:
  checkpointInfo:
     formatType: FULL
     lastCheckpoint:
        formatType: FULL
        timeStamp: 1734462491693
        triggerType: PERIODIC
      lastPeriodicCheckpointTimestamp: 1734462491693
      triggerId: 447b790bb9c88da9ae753f00f45acb0e
      triggerTimestamp: 1734462506836
      triggerType: PERIODIC
  jobId: e6fdb5c0997c11b0c62d796b3df25e86

It would be simple enough to just augment the checkpointInfo status to add in the checkpointId, but checkpointInfo was deprecated recently in favor of using FlinkStateSnapshots, which are different CRDs that can track the latest savepoint and checkpoint information: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/snapshots/. We hadn't considered updating this status ourselves before since we don't want to maintain a separate fork of the operator.

So what's the move going forward? For now I believe we'll stick with the existing kyverno webhook which fetches the latest checkpoint using the previous jobID which is appended by Karmada and the state preservation feature. Longer term, we have two options:

  1. Augment flink's jobStatus.checkpointInfo to include the checkpointId, which we can directly pass in using Karmada's state preservation and set the initialSavepointPath using Kyverno. I'll check with Flink community to see what they feel.
  2. Read the status of FlinkStateSnapshot resource to fetch the checkpoint for the job failing over. I'll have to research this option as this is a separate resource, and it would require the statePreservation feature to fetch status information from a different binding, which is not currently supported.

@RainbowMango
Copy link
Member Author

So what's the move going forward? For now I believe we'll stick with the existing kyverno webhook which fetches the latest checkpoint using the previous jobID which is appended by Karmada and the state preservation feature.

Yeah, I agree. That's exactly the usage that we designed.
Thanks for the clarification!
The whole process would be:

  1. Karmada grabs { .jobStatus.jobId } from old FlinkDeployment
  2. Karmada feeds the preservation to new FlinkDeployment with the label resourcebinding.karmada.io/failover-jobid : <jobID>
  3. Kyverno gets the checkpoint data via the job id, and generates the initialSavepointPath: /<shared-path>/<job-namespace>/<jobId>/checkpoints/chk-45.
  4. Kyverno injects the initialSavepointPath to the new FlinkDeployment
  5. FlinkOperator then restart the Flink from the savepoint.

@mszacillo
Copy link
Contributor

No problem, and yes exactly! It would be amazing if we can get the upstream flink operator community to add a checkpointPath field to their checkpointStatus. It would simplify the flow even further:

  1. Karmada grabs { .jobStatus.checkpointInfo.checkpointPath} from old FlinkDeployment
  2. Karmada feeds the info into a corresponding label to the new FlinkDeployment
  3. Kyverno injects the value of the label into the initialSavepointPath.
  4. FlinkOperator starts Flink from the checkpoint.

But for now we'll keep our logic as is.

@RainbowMango
Copy link
Member Author

But for now we'll keep our logic as is.

Sure, that's amazing!

It would be amazing if we can get the upstream flink operator community to add a checkpointPath field to their checkpointStatus.

This sounds very challenging, just share my two cents:

  1. The Flink operator has already deprecated the checkpointInfo, and it is unlikely to iterate this feature anymore.
  2. The checkpointPath looks like a field that frequently changes, so it would not be practical for the operator to update it regularly.

Anyway, you can talk to the community and get me updated.

@mszacillo
Copy link
Contributor

Hi @RainbowMango @XiShanYongYe-Chang,

I've been working on adding support for the cluster failover by generalizing the BuildTaskOptions method to support resources that do and don't have FailoverBehavior defined. I performed some e2e testing with a mix of different resources and things seem to be working well.

I haven't published a PR yet, but the changes are in my branch here. Please take a look and let me know if you have comments or concerns.

@RainbowMango
Copy link
Member Author

Emm, it appears that you want to apply the application failover behavior to the process of cluster failover. That's interesting!

I've been thinking about how to iterate the cluster failover feature since the last release, I strongly believe this feature should be cautiously evaluated before using it in production, that's exactly the reason why we disabled it by default in release-1.12. I will let you know if I come out with more detailed ideas or plans about the cluster failover feature. Before that, I will hesitate to push features related to cluster failover.

@mszacillo
Copy link
Contributor

I strongly believe this feature should be cautiously evaluated before using it in production, that's exactly the reason why we disabled it by default in release-1.12

That makes sense! We're doing a bit more stress testing on our side in terms of the application failover, so until this feature is a bit more mature having the feature be off by default is fine.

it appears that you want to apply the application failover behavior to the process of cluster failover

The thinking here was that we already generate graceful eviction tasks for all resources that are evicted from a cluster due to a taint. Obviously a cluster can have a large variety of different resources on it which may not have failover configured, so they should not have any graceful eviction options related to failover. That was why I decided to generalize the BuildTaskOptions method.

For cluster failover, is there a reason why we are apprehensive about re-using the method of how we preserve state via GracefulEvictionTasks? I may be missing context here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
Status: Planned In Release 1.12
Development

No branches or pull requests

3 participants