Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for multiple pod template support #5085

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mszacillo
Copy link
Contributor

What type of PR is this?

/kind design

What this PR does / why we need it:
Described in document.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
Proposal doc for CRD scheduling improvements. Posting proposal following discussion in community meeting.

Does this PR introduce a user-facing change?:

NONE

@karmada-bot karmada-bot added the kind/design Categorizes issue or PR as related to design. label Jun 22, 2024
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign garrybest for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 22, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jun 23, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 28.29%. Comparing base (c8acebc) to head (c026200).
Report is 34 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5085      +/-   ##
==========================================
+ Coverage   28.21%   28.29%   +0.08%     
==========================================
  Files         632      632              
  Lines       43568    43635      +67     
==========================================
+ Hits        12291    12345      +54     
- Misses      30381    30388       +7     
- Partials      896      902       +6     
Flag Coverage Δ
unittests 28.29% <ø> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mszacillo
Copy link
Contributor Author

@RainbowMango - I've added this proposal to the discussion section of the community meeting tomorrow. By chance, would it be possible to move the meeting 30 minutes earlier? I've got a conflict at the moment.

@RainbowMango
Copy link
Member

@RainbowMango - I've added this proposal to the discussion section of the community meeting tomorrow. By chance, would it be possible to move the meeting 30 minutes earlier? I've got a conflict at the moment.

I'm ok with it since this is the only topic for this meeting. I'll send a notice to the mailing group and slack channel, and gather feedback.

@mszacillo
Copy link
Contributor Author

mszacillo commented Jun 25, 2024

@RainbowMango Thanks so much for reviewing this with me during the community meeting!

Just to add more context here, as I heard that is also work being done to support CRDs with multiple pod templates (like FlinkDeployment, or TensorFlow jobs for instance). For the FlinkDeployment, we cannot have replicas for the same job scheduled on different clusters - meaning we either schedule all pods on one cluster, or do not schedule at all. Once we schedule the CRD to a member cluster, all pod scheduling will be taken care of by the Flink operator.

Thinking about this more, I think it makes more sense to approach this by using one of your suggestions, which was to make Components the top level API Definition, and have replicas defined within each individual component. If we need all replicas to be scheduled on one cluster, we can set the spreadConstraints on the related PropagationPolicy.

@whitewindmills
Copy link
Member

/assign

@mszacillo mszacillo force-pushed the crd-scheduling-proposal branch 3 times, most recently from 646c98e to 3f2e4c2 Compare July 9, 2024 15:03
…uling

Signed-off-by: mszacillo <mszacillo@bloomberg.net>
@mszacillo mszacillo changed the title Proposal for accurate estimator improvement for CRD scheduling Proposal for multiple pod template support Jul 12, 2024
. . .

// The total number of replicas scheduled by this resource. Each replica will represented by exactly one component of the resource.
TotalReplicas int32 `json:"totalReplicas,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included this field as a replacement for the existing Replicas field, which is used very frequently within the Karmada codebase. Even though we are introducing the concept of components, Karmada will still ultimately be scheduling replicas - so I believe this slight refactor will make the implementation of this change less complex. This is again making an assumption that resources with more than 1 component will be scheduled on the same cluster.


For the proposed implementation, please refer to the next section.

### Accurate Estimator Changes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if the explanation of the implementation proposal is a little bit dense. I can format this to be a little more clear, but otherwise please let me know if you have any questions that I can clarify.

@mszacillo
Copy link
Contributor Author

mszacillo commented Jul 12, 2024

Hi @RainbowMango @whitewindmills, I've gone ahead and made an update to the proposal doc with a more precise explanation of the implementation details. Please let me know if you have any comments or questions - and apologies in advance if this is a little dense, perhaps I can type this in LateX and attach an image of the algorithm specific sections. :)

Quick note - there still needs to be some work done on how multiple components would work from a customized resource modeling perspective. I'll try to add a section to that this weekend.

During maxReplica estimation, we will take the sum of all resource requirement for the CRD.

Total_CPU = component_1.replicas * (component_1.cpu) + component_2.replicas * (component_2.cpu) = (1 * 1) + (2 * 1) = 3 CPU.
Total_Memory = component_1.replicas * (component_1.memory) + component_2.replicas * (component_2.memory) = (1 * 2GB) + (2 * 1GB) = 8GB.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't 4G?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes :)

@Monokaix
Copy link

So this design doesn't consider replica division and also is not an completely accurate calculation cause fragment issue is unavoidable, am I right?

@mszacillo
Copy link
Contributor Author

So this design doesn't consider replica division and also is not an completely accurate calculation cause fragment issue is unavoidable, am I right?

Yes this approach would not consider divided replicas based off the use-cases we've compiled (#5115). At the moment, most CRDs get scheduled to a single cluster rather than spread across multiple.

In terms of the precision of the calculation, you're right that it's not completely accurate. It's an estimate of how many CRDs could be fully packed on the destination cluster. However, we do guarantee that at least 1 CRD can be scheduled on a member. So at least there should never be a scenario where we schedule a CRD to a member cluster that does not have sufficient resources to hold it.

@RainbowMango
Copy link
Member

Hi @mszacillo I'm going to take two weeks off and might be slow to respond during this time.

I believe this feature is extremely important, and I'll be focusing on this once I get back. Given this feature would get controllers, and schedulers involved, it's not that easy to come up with an ideal solution in a short time.

By the way, I guess, with the help of default resource interpreter of FlinkDeployment(thanks for your contribution, by the way), this probably not a blocker for you, am I right? Do you think the Application Failover Feature has a higher priority than this?

@mszacillo
Copy link
Contributor Author

Hi @RainbowMango, thanks for the heads up and enjoy your time off!

this probably not a blocker for you, am I right? Do you think the Application Failover Feature has a higher priority than this?

Yes, this is currently not a blocker for us. We can get around with the existing maxReplica estimation while we determine a solution for multiple podTemplate support, and we are instead focusing on the failover feature enhancements. For our MVP using Karmada we need two things:

  1. Karmada should attach a label / annotation to the workload if it fails over to another cluster: Add a label/annotation to the resource being rescheduled during failover #4969. This is currently implemented and we're simply doing some testing and some internal code review before opening up a PR upstream. This change will also address an existing bug that was detailed here: fix: do not reschedules the workload to the same unhealthy cluster when application failover enabled #4215, but guaranteeing Karmada will not reschedule workloads to the cluster it has failed-over from.

  2. Address the limitation in the FederatedResourceQuota discussed in FederatedResourceQuota should be failover friendly #5179, to which I am currently implementing a workaround.

After we've completed the above tickets our order of priority will be publishing the implementation for the later steps of the failover history proposal, and then working on the multiple pod template support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants