Proposal for multiple pod template support #5085

mszacillo · 2024-06-22T23:56:07Z

What type of PR is this?

/kind design

What this PR does / why we need it:
Described in document.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
Proposal doc for CRD scheduling improvements. Posting proposal following discussion in community meeting.

Does this PR introduce a user-facing change?:

NONE

karmada-bot · 2024-06-22T23:56:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign garrybest for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov-commenter · 2024-06-23T00:07:23Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 28.29%. Comparing base (c8acebc) to head (c026200).
Report is 34 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5085      +/-   ##
==========================================
+ Coverage   28.21%   28.29%   +0.08%     
==========================================
  Files         632      632              
  Lines       43568    43635      +67     
==========================================
+ Hits        12291    12345      +54     
- Misses      30381    30388       +7     
- Partials      896      902       +6

Flag	Coverage Δ
unittests	`28.29% <ø> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mszacillo · 2024-06-24T19:38:46Z

@RainbowMango - I've added this proposal to the discussion section of the community meeting tomorrow. By chance, would it be possible to move the meeting 30 minutes earlier? I've got a conflict at the moment.

RainbowMango · 2024-06-25T03:45:44Z

@RainbowMango - I've added this proposal to the discussion section of the community meeting tomorrow. By chance, would it be possible to move the meeting 30 minutes earlier? I've got a conflict at the moment.

I'm ok with it since this is the only topic for this meeting. I'll send a notice to the mailing group and slack channel, and gather feedback.

docs/proposals/scheduling/crd-scheduling-improvements/crd-scheduling-improvements.md

mszacillo · 2024-06-25T17:51:28Z

@RainbowMango Thanks so much for reviewing this with me during the community meeting!

Just to add more context here, as I heard that is also work being done to support CRDs with multiple pod templates (like FlinkDeployment, or TensorFlow jobs for instance). For the FlinkDeployment, we cannot have replicas for the same job scheduled on different clusters - meaning we either schedule all pods on one cluster, or do not schedule at all. Once we schedule the CRD to a member cluster, all pod scheduling will be taken care of by the Flink operator.

Thinking about this more, I think it makes more sense to approach this by using one of your suggestions, which was to make Components the top level API Definition, and have replicas defined within each individual component. If we need all replicas to be scheduled on one cluster, we can set the spreadConstraints on the related PropagationPolicy.

whitewindmills · 2024-07-02T03:34:32Z

/assign

…uling Signed-off-by: mszacillo <mszacillo@bloomberg.net>

mszacillo · 2024-07-12T22:31:36Z

docs/proposals/scheduling/multi-podtemplate-support/multiple-pod-template-support.md

+    . . .
+
+    // The total number of replicas scheduled by this resource. Each replica will represented by exactly one component of the resource.
+    TotalReplicas int32 `json:"totalReplicas,omitempty"`


I've included this field as a replacement for the existing Replicas field, which is used very frequently within the Karmada codebase. Even though we are introducing the concept of components, Karmada will still ultimately be scheduling replicas - so I believe this slight refactor will make the implementation of this change less complex. This is again making an assumption that resources with more than 1 component will be scheduled on the same cluster.

mszacillo · 2024-07-12T22:32:17Z

docs/proposals/scheduling/multi-podtemplate-support/multiple-pod-template-support.md

+
+For the proposed implementation, please refer to the next section.
+
+### Accurate Estimator Changes


Apologies if the explanation of the implementation proposal is a little bit dense. I can format this to be a little more clear, but otherwise please let me know if you have any questions that I can clarify.

mszacillo · 2024-07-12T22:34:12Z

Hi @RainbowMango @whitewindmills, I've gone ahead and made an update to the proposal doc with a more precise explanation of the implementation details. Please let me know if you have any comments or questions - and apologies in advance if this is a little dense, perhaps I can type this in LateX and attach an image of the algorithm specific sections. :)

Quick note - there still needs to be some work done on how multiple components would work from a customized resource modeling perspective. I'll try to add a section to that this weekend.

Monokaix · 2024-07-17T09:33:23Z

docs/proposals/scheduling/multi-podtemplate-support/multiple-pod-template-support.md

+During maxReplica estimation, we will take the sum of all resource requirement for the CRD.
+
+Total_CPU = component_1.replicas * (component_1.cpu) + component_2.replicas * (component_2.cpu) = (1 * 1) + (2 * 1) = 3 CPU.
+Total_Memory = component_1.replicas * (component_1.memory) + component_2.replicas * (component_2.memory) = (1 * 2GB) + (2 * 1GB) = 8GB.


Oops, yes :)

Monokaix · 2024-07-17T09:36:06Z

So this design doesn't consider replica division and also is not an completely accurate calculation cause fragment issue is unavoidable, am I right?

mszacillo · 2024-07-17T14:30:58Z

So this design doesn't consider replica division and also is not an completely accurate calculation cause fragment issue is unavoidable, am I right?

Yes this approach would not consider divided replicas based off the use-cases we've compiled (#5115). At the moment, most CRDs get scheduled to a single cluster rather than spread across multiple.

In terms of the precision of the calculation, you're right that it's not completely accurate. It's an estimate of how many CRDs could be fully packed on the destination cluster. However, we do guarantee that at least 1 CRD can be scheduled on a member. So at least there should never be a scenario where we schedule a CRD to a member cluster that does not have sufficient resources to hold it.

RainbowMango · 2024-07-20T11:20:30Z

Hi @mszacillo I'm going to take two weeks off and might be slow to respond during this time.

I believe this feature is extremely important, and I'll be focusing on this once I get back. Given this feature would get controllers, and schedulers involved, it's not that easy to come up with an ideal solution in a short time.

By the way, I guess, with the help of default resource interpreter of FlinkDeployment(thanks for your contribution, by the way), this probably not a blocker for you, am I right? Do you think the Application Failover Feature has a higher priority than this?

mszacillo · 2024-07-20T17:27:09Z

Hi @RainbowMango, thanks for the heads up and enjoy your time off!

this probably not a blocker for you, am I right? Do you think the Application Failover Feature has a higher priority than this?

Yes, this is currently not a blocker for us. We can get around with the existing maxReplica estimation while we determine a solution for multiple podTemplate support, and we are instead focusing on the failover feature enhancements. For our MVP using Karmada we need two things:

Karmada should attach a label / annotation to the workload if it fails over to another cluster: Add a label/annotation to the resource being rescheduled during failover #4969. This is currently implemented and we're simply doing some testing and some internal code review before opening up a PR upstream. This change will also address an existing bug that was detailed here: fix: do not reschedules the workload to the same unhealthy cluster when application failover enabled #4215, but guaranteeing Karmada will not reschedule workloads to the cluster it has failed-over from.
Address the limitation in the FederatedResourceQuota discussed in FederatedResourceQuota should be failover friendly #5179, to which I am currently implementing a workaround.

After we've completed the above tickets our order of priority will be publishing the implementation for the later steps of the failover history proposal, and then working on the multiple pod template support.

karmada-bot added the kind/design Categorizes issue or PR as related to design. label Jun 22, 2024

karmada-bot requested review from Poor12 and Tingtal June 22, 2024 23:56

karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 22, 2024

RainbowMango reviewed Jun 25, 2024

View reviewed changes

docs/proposals/scheduling/crd-scheduling-improvements/crd-scheduling-improvements.md Outdated Show resolved Hide resolved

This was referenced Jun 26, 2024

Adding FlinkDeployment CRD to supported third party resource customizations #5023

Merged

[Feature] Multi-components workload scheduling #5115

Open

mszacillo force-pushed the crd-scheduling-proposal branch from 295c2ae to 8c59ff6 Compare July 1, 2024 21:36

karmada-bot assigned whitewindmills Jul 2, 2024

mszacillo force-pushed the crd-scheduling-proposal branch 3 times, most recently from 646c98e to 3f2e4c2 Compare July 9, 2024 15:03

Proposal document for improvement to accurate estimator for CRD sched…

c026200

…uling Signed-off-by: mszacillo <mszacillo@bloomberg.net>

mszacillo force-pushed the crd-scheduling-proposal branch from 3f2e4c2 to c026200 Compare July 12, 2024 22:28

mszacillo changed the title ~~Proposal for accurate estimator improvement for CRD scheduling~~ Proposal for multiple pod template support Jul 12, 2024

mszacillo commented Jul 12, 2024

View reviewed changes

RainbowMango mentioned this pull request Jul 17, 2024

FederatedResourceQuota should be failover friendly #5179

Open

Monokaix reviewed Jul 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for multiple pod template support #5085

Proposal for multiple pod template support #5085

mszacillo commented Jun 22, 2024

karmada-bot commented Jun 22, 2024

codecov-commenter commented Jun 23, 2024 •

edited

Loading

mszacillo commented Jun 24, 2024

RainbowMango commented Jun 25, 2024

mszacillo commented Jun 25, 2024 •

edited

Loading

whitewindmills commented Jul 2, 2024

mszacillo Jul 12, 2024

mszacillo Jul 12, 2024

mszacillo commented Jul 12, 2024 •

edited

Loading

Monokaix Jul 17, 2024

mszacillo Jul 17, 2024

Monokaix commented Jul 17, 2024

mszacillo commented Jul 17, 2024

RainbowMango commented Jul 20, 2024

mszacillo commented Jul 20, 2024


		For the proposed implementation, please refer to the next section.

		### Accurate Estimator Changes

Proposal for multiple pod template support #5085

Are you sure you want to change the base?

Proposal for multiple pod template support #5085

Conversation

mszacillo commented Jun 22, 2024

karmada-bot commented Jun 22, 2024

codecov-commenter commented Jun 23, 2024 • edited Loading

Codecov Report

mszacillo commented Jun 24, 2024

RainbowMango commented Jun 25, 2024

mszacillo commented Jun 25, 2024 • edited Loading

whitewindmills commented Jul 2, 2024

mszacillo Jul 12, 2024

Choose a reason for hiding this comment

mszacillo Jul 12, 2024

Choose a reason for hiding this comment

mszacillo commented Jul 12, 2024 • edited Loading

Monokaix Jul 17, 2024

Choose a reason for hiding this comment

mszacillo Jul 17, 2024

Choose a reason for hiding this comment

Monokaix commented Jul 17, 2024

mszacillo commented Jul 17, 2024

RainbowMango commented Jul 20, 2024

mszacillo commented Jul 20, 2024

codecov-commenter commented Jun 23, 2024 •

edited

Loading

mszacillo commented Jun 25, 2024 •

edited

Loading

mszacillo commented Jul 12, 2024 •

edited

Loading