-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a label/annotation to the resource being rescheduled during failover #4969
Comments
It seems this issue is very similar to the |
Hi @Dyex719, thanks for your feedback. I have a few questions. Are you referring to the labels being automatically added in by the system? When Karmada reschedules, how does the system trigger the webhook or Kverno to run? |
Hi @XiShanYongYe-Chang,
Yes, Karmada would add a label after failover so that the next time it is rescheduled it would have the label.
Kyverno would be deployed on the member clusters and is therefore called in the scheduling process. Kyverno would check if the label exists on the spec and if it does, read the last state accordingly. |
I'sorry, I still don't understand the whole process. Hi @chaunceyjiang do you understand the requirement? |
My understanding is that @Dyex719 wants a label to indicate that the current resource is undergoing failover. This is because he expects this transition state to be recognized by other third-party software. If I remember correctly, when a resource is transitioning, there will be a |
Yes, the cluster to be removed will be placed here: karmada/pkg/apis/work/v1alpha2/binding_types.go Lines 97 to 111 in d676996
This should be more specific than label. |
Hi @XiShanYongYe-Chang and @chaunceyjiang, As far as I understand, using the GracefulEvictionTasks would not work in some scenarios, for example when the eviction mode is immediate ( karmada/pkg/controllers/applicationfailover/rb_application_failover_controller.go Line 176 in d676996
To decouple this, maybe we can add a field in the ResourceBindingStatus? I am including the time here as well as that could be useful but maybe it is not necessary.
|
Thanks, I understand what you mean by this field. One more question,
Do we need to pause and wait for the verification to complete before rescheduling? Or does it mean that the rescheduling logic can be executed synchronously, regardless of whether validation is being performed? |
Verification/Mutation is done on the member clusters and is performed after Karmada has rescheduled the job. The flow is:
So the pausing is done by Kyverno/Webhook on the member cluster which I believe are blocking in nature. |
Thanks for your explanation @Dyex719 It looks like the Karmada control plane doesn't need to sense check and pause. I thought that labels were added during rescheduling and cleaned up after rescheduling before. When will these labels be cleaned up, as you describe? |
We didn't think about a cleanup strategy, the idea was mainly to update the label with the latest time in case of multiple failures. The label is only added on rescheduling due to failure, and would not be added if the job is being scheduled for the first time. What are your reasons for needing a cleanup operation? |
Thanks, I get it. |
Hi @XiShanYongYe-Chang, @chaunceyjiang, thanks for taking a look at this issue! At the moment we've been able to make a quick fix for this by altering the ResourceBindingStatus as was previously suggested. We added a method In You can view the code here: master...mszacillo:karmada:failover-label, but please note this is not PR ready and was mostly just a quick fix for testing on our end! |
Hi @mszacillo @Dyex719 thanks Can I ask conveniently, which company are you from? Have you started using Karmada? |
No problem. We're from Bloomberg - at the moment we aren't using Karmada in production environments, but we are investigating the work necessary to get Karmada working for stateful application failover. Perhaps this is something we can discuss more during the community meeting. |
@mszacillo Thanks a lot~ |
I'm interested in this use case that helps stateful workloads, like FlinkDeployment, to resume processing.
@Dyex719 Can you share with us which FlinkDeploymentSpec filed exactly the Kyverno would mutate? In addition, as far as I know, there will be a checkpoint storage, do you need to migrate the data across clusters? |
Hi @RainbowMango, Please refer to this section and the upgrade modes
The checkpoint storage will need to be replicated across data centers so that the state can be resumed from the last checkpoint. If this is supported no migration is needed. Essentially, the checkpoint storage should be independent of cluster failure. |
Thanks, the idea is impressive! |
Yup, that's exactly correct! There is a few small issues with how Flink works though - Flink creates a new jobID every time a new job is scheduled so when migration happens another new ID is created. This is a problem because the checkpoint to restore from is of the form s3p:///jobID/<checkpoint_number>. Here the jobID to restore from is the previous jobID before migration takes place. Since this previous jobID is not stored anywhere, we will need Karmada to carry this field over. |
@RainbowMango Since there are a lot of moving parts here with varying degrees of urgency, would it be recommended for us to create a proposal document with all the requirements for supporting FlinkDeployment (and other stateful) failover? |
@Dyex719 Yeah, I can see the jobId(.status.jobStatus.jobId) on the FlinkDeployment, but I don't know much about it. |
@mszacillo Sure! That would be great to have a proposal to address all these things! I believe that the migration of a stateful application is a very challenging task, but it's something of great value. What I'm thinking is that perphas there are tasks that can be done before the migration, such as scalable mechanisms to allow users to do some preparatory work, and some tasks can be done after the migration, just like the approach we are talking about here. |
This is correct.
As you mentioned, the job ID is only present in the status so this cannot be accessed by Kyverno as the job is not running yet. This is the main problem.
This is technically possible to add a static job ID to create a specific checkpoint path but is not ideal. Any sort of user error like creating a new job or restarting from a checkpoint may result in overwriting previous checkpoints which is dangerous. There is some interest in the community to make the process of reading from a checkpoint easier, for example : https://issues.apache.org/jira/browse/FLINK-9043. However this issue has been open for a long time. We will work on the proposal and can then hopefully talk about this later! |
/close This requirement has been addressed by the #5788 and the feature has been included in release-1.12. |
@RainbowMango: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What would you like to be added:
We propose adding a label/annotation during failover so that webhooks like Kyverno can perform the necessary checks/changes before the job is rescheduled. We are also open to discussing other ideas and contributing back to the community.
Why is this needed:
Stateful applications may need to read the last saved state to resume processing after failover. This may involve a change in the spec so that the path to read from can be specified.
It would be useful to know when a failover happened so that stateful applications can perform the necessary checks/changes before restarting.
In our particular use-case where we are migrating Flink applications using Karmada. The step by step process would be:
The text was updated successfully, but these errors were encountered: