-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Stateful Application Failover Support #5788
Comments
Looks great, thank you! Could we add a checklist item to include a default failoverType label onto the resource that has been failed over? |
I don't have a strong feeling that we do need it, because according to the draft design, you can declare the label name to whatever you expects. For instance, you can declare the label name with |
@mszacillo I'm trying to split the whole feature into small pieces, hoping more people could get involved and accelerate development. For now, it's working in progress, but glad you noticed it, let me know if you have any comments or questions. |
@RainbowMango I think that's a good idea, and having this feature available faster would be great. :) Do you have a preference on who will be working on which task? If not I can pick up the introduction of PurgeMode to the GracefulEvictionTask today. In addition, could we start a slack working group channel? Given the time differences, I think being able to have more rapid conversations on slack would improve the implementation pace. |
That's true, we can simply declare our own label name for the use-case. In the case of a failover, it might be helpful to distinguish between cluster + application failovers, and only Karmada has the context. But perhaps I'm creating a use-case before it's even appeared. |
Sure go for it! Assigned this task to you. |
Yeah, the only benefit I can see is that it might help to distinguish failover types, but I think there is no rush to do it until there is a solid use case. I added a checklist item for this; we can revisit it later.
|
@mszacillo assigned this task to you according to the discussion on #5821 (review). |
@RainbowMango I've got some extra bandwidth, so I can also pick up "Make changes to binding controller and cluster binding controller to cleanup works from cluster in eviction task and purge mode is immediately." |
Great! Assigned that to you. @mszacillo @XiShanYongYe-Chang |
Hi @RainbowMango @XiShanYongYe-Chang, I was able to do some testing of the application failover feature - although it works, I've noticed that the state preservation label is only retained on the work temporarily. This was causing some confusion for me while doing the verification. Essentially, what happens is:
It would be ideal if this label could be retained, since it may be confusing for users for the label to suddenly disappear right after their workloads failover. My first thought was we could change the reconciliation for ResourceBinding to skip sync if the a gracefulEvictionTask is removed, but I'm not sure if this would have side effects. The other option would be to retain the statePreservation label directly on the ResourceBinding, so that it will always be appended to the work in the case of resyncs. Let me know if either of you have opinions. |
Hi @mszacillo, thanks for your feedback!
If we want to continue to inject these labels, we may need to think again. |
To be fair, my test was using a generic nginx deployment. I can do some e2e testing with our Flink use-case, since all that needs to occur is for our custom webhook to see the label, and append the latest checkpoint. |
Ran some tests using FlinkDeployments and seems to work as expected! I think we can keep the label as ephemeral. :) I think this is more so a potential issue for applications that have a very short startup time (become healthy very quickly), but I don't believe this is something that would be common in more complex CRDs. |
Thanks for your test, if there is any problem with this point in the future, we can continue to communicate and optimize it. |
Glad to hear that! I understand that the state preservation label should be removed once it has been consumed, as it can only be used once. |
@mszacillo I see you added two agendas to today's meeting, I wonder if we can move them to the next meeting or schedule another meeting? I'm not feeling well today, I'm afraid I can not attend this meeting. I also left a message for you on Slack. |
@mszacillo |
Hi @RainbowMango! Yes, here is the configuration I was using for our testing, which has been working well so far. For the jsonPath I didn't use a valid jsonPath, but I just wanted a static value for the failover key so that we can signal our custom webhook to fetch a latest checkpoint for the application to recover from. spec:
conflictResolution: Abort
failover:
application:
decisionConditions:
tolerationSeconds: 120
purgeMode: Immediately
statePreservation:
rules:
- aliasLabelName: resourcebinding.karmada.io/failover
jsonPath: "true" |
Ignore me. |
@mszacillo So, could you share more details about what Kyverno does when it detects a FlinkDeployment with the label |
Completely fair! Let's say its user error. :) After double checking, I noticed that the flink-operator should publish checkpointInfo to the status of the job, which we can re-use as part of the propagationpolicy and remove the kyverno hook entirely. I'll try to upgrade our operator version so that it includes these fields and let you know how the test goes. [edit by @RainbowMango]: |
I've updated our PropagationPolicy to instead grab the published jobID from the status of the FlinkDeployment, which is more in line with the intended usage of this feature:
Our kyverno policy reads in the jobID so that it can fetch the latest checkpoint for the job, which is saved on a shared data store under the path
We were being a little clever and keeping the jobID static for the time being, which is how we were able to get away with just setting a static failover flag. To add more context for our use-case, in order to successfully failover we need the latest checkpoint path. The existing
It would be simple enough to just augment the checkpointInfo status to add in the checkpointId, but checkpointInfo was deprecated recently in favor of using So what's the move going forward? For now I believe we'll stick with the existing kyverno webhook which fetches the latest checkpoint using the previous jobID which is appended by Karmada and the state preservation feature. Longer term, we have two options:
|
Yeah, I agree. That's exactly the usage that we designed.
|
No problem, and yes exactly! It would be amazing if we can get the upstream flink operator community to add a checkpointPath field to their checkpointStatus. It would simplify the flow even further:
But for now we'll keep our logic as is. |
Sure, that's amazing!
This sounds very challenging, just share my two cents:
Anyway, you can talk to the community and get me updated. |
Hi @RainbowMango @XiShanYongYe-Chang, I've been working on adding support for the cluster failover by generalizing the I haven't published a PR yet, but the changes are in my branch here. Please take a look and let me know if you have comments or concerns. |
Emm, it appears that you want to apply the application failover behavior to the process of cluster failover. That's interesting! I've been thinking about how to iterate the cluster failover feature since the last release, I strongly believe this feature should be cautiously evaluated before using it in production, that's exactly the reason why we disabled it by default in release-1.12. I will let you know if I come out with more detailed ideas or plans about the cluster failover feature. Before that, I will hesitate to push features related to cluster failover. |
That makes sense! We're doing a bit more stress testing on our side in terms of the application failover, so until this feature is a bit more mature having the feature be off by default is fine.
The thinking here was that we already generate graceful eviction tasks for all resources that are evicted from a cluster due to a taint. Obviously a cluster can have a large variety of different resources on it which may not have failover configured, so they should not have any graceful eviction options related to failover. That was why I decided to generalize the BuildTaskOptions method. For cluster failover, is there a reason why we are apprehensive about re-using the method of how we preserve state via GracefulEvictionTasks? I may be missing context here. |
Summary
Karmada’s scheduling logic runs on the assumption that resources that are scheduled and rescheduled are stateless. In some cases, users may desire to conserve a certain state so that applications can resume from where they left off in the previous cluster.
For CRDs dealing with data-processing (such as Flink or Spark), it can be particularly useful to restart applications from a previous checkpoint. That way applications can seamlessly resume processing data while avoiding double processing.
This feature aims to introduce a generalized way for users to define application state preservation in the context of cluster-to-cluster failovers.
Proposal
Iteration Tasks -- Part-1: Ensure scheduler skips clusters where triggers the failover
Immediately
. (@mszacillo, Failover controllers now build eviction tasks for purgemode immediately #5881)Graciously
by default as a compromise. ??Iteration Tasks -- Part-2: state preservation and feed
StatePreservation
to PropagationPolicy. (See the API design here) (@RainbowMango, Introduce StatePreservation to PropagationPolicy API #5885)PreservedLabelState
to ResourceBinding. (See the API design here) (@RainbowMango, Introduce StatePreservation to PropagationPolicy API #5885)PreservedLabelState
when triggering eviction. (@XiShanYongYe-Chang, Build PreservedLabelState when triggering evition in RB/CRB application controller #5887)PreservedLabelState
when triggering eviction.PreservedLabelState
to new clusters(failover to). (@XiShanYongYe-Chang , Inject preservedLabelState to the failover to clusters #5893)Iteration Tasks -- Part-3: failover history
The failover history might be optional as we don't rely on it.
TBD: based on #5251
Related issues:
The text was updated successfully, but these errors were encountered: