Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a label/annotation to the resource being rescheduled during failover #4969

Closed
Dyex719 opened this issue May 22, 2024 · 26 comments
Closed
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@Dyex719
Copy link

Dyex719 commented May 22, 2024

What would you like to be added:
We propose adding a label/annotation during failover so that webhooks like Kyverno can perform the necessary checks/changes before the job is rescheduled. We are also open to discussing other ideas and contributing back to the community.

Why is this needed:
Stateful applications may need to read the last saved state to resume processing after failover. This may involve a change in the spec so that the path to read from can be specified.
It would be useful to know when a failover happened so that stateful applications can perform the necessary checks/changes before restarting.

In our particular use-case where we are migrating Flink applications using Karmada. The step by step process would be:

  1. Karmada deploys the FlinkDeployment CR to a member cluster
  2. The CR fails over due to a cluster failover / application failover
  3. The label/annotation would get added during this process (For example: "failover" : true)
  4. When Karmada reschedules the application to a different member cluster, a webhook like Kyverno could mutate the spec by checking for this label ("failover" : true) and restart the application from the last state only if this label exists
  5. Resume processing of the stateful application from the last state
@Dyex719 Dyex719 added the kind/feature Categorizes issue or PR as related to a new feature. label May 22, 2024
@chaunceyjiang
Copy link
Member

It seems this issue is very similar to the pause we have been discussing before. /cc @XiShanYongYe-Chang

#4688

#4421

@XiShanYongYe-Chang
Copy link
Member

Hi @Dyex719, thanks for your feedback.

I have a few questions. Are you referring to the labels being automatically added in by the system? When Karmada reschedules, how does the system trigger the webhook or Kverno to run?

@Dyex719
Copy link
Author

Dyex719 commented May 27, 2024

Hi @XiShanYongYe-Chang,
Thanks for getting back to us!

Are you referring to the labels being automatically added in by the system?

Yes, Karmada would add a label after failover so that the next time it is rescheduled it would have the label.

how does the system trigger the webhook or Kverno to run?

Kyverno would be deployed on the member clusters and is therefore called in the scheduling process. Kyverno would check if the label exists on the spec and if it does, read the last state accordingly.

@XiShanYongYe-Chang
Copy link
Member

I'sorry, I still don't understand the whole process.

Hi @chaunceyjiang do you understand the requirement?

@chaunceyjiang
Copy link
Member

My understanding is that @Dyex719 wants a label to indicate that the current resource is undergoing failover. This is because he expects this transition state to be recognized by other third-party software.

If I remember correctly, when a resource is transitioning, there will be a GracefulEvictionTasks in the derived ResourceBinding to indicate the tasks currently being transferred.

@XiShanYongYe-Chang
Copy link
Member

If I remember correctly, when a resource is transitioning, there will be a GracefulEvictionTasks in the derived ResourceBinding to indicate the tasks currently being transferred.

Yes, the cluster to be removed will be placed here:

// GracefulEvictionTasks holds the eviction tasks that are expected to perform
// the eviction in a graceful way.
// The intended workflow is:
// 1. Once the controller(such as 'taint-manager') decided to evict the resource that
// is referenced by current ResourceBinding or ClusterResourceBinding from a target
// cluster, it removes(or scale down the replicas) the target from Clusters(.spec.Clusters)
// and builds a graceful eviction task.
// 2. The scheduler may perform a re-scheduler and probably select a substitute cluster
// to take over the evicting workload(resource).
// 3. The graceful eviction controller takes care of the graceful eviction tasks and
// performs the final removal after the workload(resource) is available on the substitute
// cluster or exceed the grace termination period(defaults to 10 minutes).
//
// +optional
GracefulEvictionTasks []GracefulEvictionTask `json:"gracefulEvictionTasks,omitempty"`

This should be more specific than label.

@Dyex719
Copy link
Author

Dyex719 commented May 28, 2024

Hi @XiShanYongYe-Chang and @chaunceyjiang,
Thank you for the comments.

As far as I understand, using the GracefulEvictionTasks would not work in some scenarios, for example when the eviction mode is immediate (

)

To decouple this, maybe we can add a field in the ResourceBindingStatus? I am including the time here as well as that could be useful but maybe it is not necessary.

type ResourceBindingStatus struct {
         ...
	// LastFailoverTime represents the latest timestamp when a workload was failed over.
	// It is represented in RFC3339 form (like '2006-01-02T15:04:05Z') and is in UTC.
	// +optional
	LastFailoverTime *metav1.Time `json:"lastFailoverTime,omitempty"`
	...

@XiShanYongYe-Chang
Copy link
Member

Thanks, I understand what you mean by this field.

One more question,

before the job is rescheduled.

Do we need to pause and wait for the verification to complete before rescheduling? Or does it mean that the rescheduling logic can be executed synchronously, regardless of whether validation is being performed?

@Dyex719
Copy link
Author

Dyex719 commented May 29, 2024

Verification/Mutation is done on the member clusters and is performed after Karmada has rescheduled the job. The flow is:

  1. Karmada Rescheduled job from Cluster A to Cluster B
  2. Verification of label and reading from last state is done by Kyverno/Webhook on Cluster B
  3. Job is restored from last state on Cluster B.

So the pausing is done by Kyverno/Webhook on the member cluster which I believe are blocking in nature.

@XiShanYongYe-Chang
Copy link
Member

Thanks for your explanation @Dyex719

It looks like the Karmada control plane doesn't need to sense check and pause.

I thought that labels were added during rescheduling and cleaned up after rescheduling before. When will these labels be cleaned up, as you describe?

@Dyex719
Copy link
Author

Dyex719 commented May 30, 2024

We didn't think about a cleanup strategy, the idea was mainly to update the label with the latest time in case of multiple failures.

The label is only added on rescheduling due to failure, and would not be added if the job is being scheduled for the first time. What are your reasons for needing a cleanup operation?

@XiShanYongYe-Chang
Copy link
Member

Thanks, I get it.
Invite more guys to help take a look.
cc @chaunceyjiang @RainbowMango @whitewindmills @chaosi-zju

@mszacillo
Copy link
Contributor

Hi @XiShanYongYe-Chang, @chaunceyjiang, thanks for taking a look at this issue!

At the moment we've been able to make a quick fix for this by altering the ResourceBindingStatus as was previously suggested. We added a method updateFailoverStatus to the rb_application_failover_controller (and will be added to cluster failover controller), which appends a new condition to the ResourceBindingStatus indicating a previous failover.

In pkg/controllers/binding/common.go we check the failover condition, and if present we append a label to the work being created during rescheduling. We would prefer using an actual field in the status rather than a condition, but for some reason I had difficulty getting the LastFailoverTime status to update correctly. Will need to investigate this.

You can view the code here: master...mszacillo:karmada:failover-label, but please note this is not PR ready and was mostly just a quick fix for testing on our end!

@XiShanYongYe-Chang
Copy link
Member

Hi @mszacillo @Dyex719 thanks

Can I ask conveniently, which company are you from? Have you started using Karmada?

@mszacillo
Copy link
Contributor

No problem. We're from Bloomberg - at the moment we aren't using Karmada in production environments, but we are investigating the work necessary to get Karmada working for stateful application failover. Perhaps this is something we can discuss more during the community meeting.

@XiShanYongYe-Chang
Copy link
Member

@mszacillo Thanks a lot~

@RainbowMango
Copy link
Member

I'm interested in this use case that helps stateful workloads, like FlinkDeployment, to resume processing.

When Karmada reschedules the application to a different member cluster, a webhook like Kyverno could mutate the spec by checking for this label ("failover" : true) and restart the application from the last state only if this label exists

@Dyex719 Can you share with us which FlinkDeploymentSpec filed exactly the Kyverno would mutate?

In addition, as far as I know, there will be a checkpoint storage, do you need to migrate the data across clusters?

@Dyex719
Copy link
Author

Dyex719 commented Jun 4, 2024

Hi @RainbowMango,

Please refer to this section and the upgrade modes
There are a few fields that would be mutated by Kyverno:

initialSavepointPath: “desired checkpoint path to be resumed from (s3p://)”
upgradeMode: savepoint 
state: running

The checkpoint storage will need to be replicated across data centers so that the state can be resumed from the last checkpoint. If this is supported no migration is needed. Essentially, the checkpoint storage should be independent of cluster failure.

@RainbowMango
Copy link
Member

Thanks, the idea is impressive!
Then, the Kyverno won't be involved in complex logic, like checking or syncing checkpoint and so on(that's my concern).
Kyverno just needs to be aware that the FlinkDeployment in creating is migrating from another cluster, and subsequently, it will adjust accordingly based on the configuration(like labels) of the FlinkDeloyment itself.

@Dyex719
Copy link
Author

Dyex719 commented Jun 5, 2024

Yup, that's exactly correct! There is a few small issues with how Flink works though -

Flink creates a new jobID every time a new job is scheduled so when migration happens another new ID is created. This is a problem because the checkpoint to restore from is of the form s3p:///jobID/<checkpoint_number>. Here the jobID to restore from is the previous jobID before migration takes place.

Since this previous jobID is not stored anywhere, we will need Karmada to carry this field over.
If you think Karmada should support stateful applications like this, we can talk about how to handle such cases. I created #5006 to discuss about a generic framework for such stateful applications.

@mszacillo
Copy link
Contributor

@RainbowMango Since there are a lot of moving parts here with varying degrees of urgency, would it be recommended for us to create a proposal document with all the requirements for supporting FlinkDeployment (and other stateful) failover?

@RainbowMango
Copy link
Member

@Dyex719 Yeah, I can see the jobId(.status.jobStatus.jobId) on the FlinkDeployment, but I don't know much about it.
Each time FlinkDeployment launches a job, it will assign a unique ID for it, which happens to be used as the checkpoint storage path, resulting in this checkpoint being usable only by the current job. Please correct me if I'm wrong.
Are there any improvements that could be made at Flink operator? Like, allow the user to specific job ID or specific checkpoint path. Just a guess. What do you think?

@RainbowMango
Copy link
Member

@mszacillo Sure! That would be great to have a proposal to address all these things!

I believe that the migration of a stateful application is a very challenging task, but it's something of great value. What I'm thinking is that perphas there are tasks that can be done before the migration, such as scalable mechanisms to allow users to do some preparatory work, and some tasks can be done after the migration, just like the approach we are talking about here.
Thank you in advance.
Here is the proposal template, you might need.

@Dyex719
Copy link
Author

Dyex719 commented Jun 6, 2024

@RainbowMango

Each time FlinkDeployment launches a job, it will assign a unique ID for it, which happens to be used as the checkpoint storage path, resulting in this checkpoint being usable only by the current job.

This is correct.

Yeah, I can see the #4905 (comment)

As you mentioned, the job ID is only present in the status so this cannot be accessed by Kyverno as the job is not running yet. This is the main problem.

Are there any improvements that could be made at Flink operator? Like, allow the user to specific job ID or specific checkpoint path?

This is technically possible to add a static job ID to create a specific checkpoint path but is not ideal. Any sort of user error like creating a new job or restarting from a checkpoint may result in overwriting previous checkpoints which is dangerous.

There is some interest in the community to make the process of reading from a checkpoint easier, for example : https://issues.apache.org/jira/browse/FLINK-9043. However this issue has been open for a long time.

We will work on the proposal and can then hopefully talk about this later!

@RainbowMango
Copy link
Member

/close

This requirement has been addressed by the #5788 and the feature has been included in release-1.12.

@karmada-bot
Copy link
Collaborator

@RainbowMango: Closing this issue.

In response to this:

/close

This requirement has been addressed by the #5788 and the feature has been included in release-1.12.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
Status: No status
Development

No branches or pull requests

6 participants