-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define FairShare Score for Cohorts; Generalize to Hierarchical Case #4313
Define FairShare Score for Cohorts; Generalize to Hierarchical Case #4313
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
ack |
The code itself lgtm, but the impact of the change is non-obvious to me. Let me ask some clarifying questions.
Is this change transparent to the flat cohorts? If so, why? I see all tests passing , but wondering if there might be some corner cases we change relative to the current logic, impacting users of FairSharing with flat cohorts.
What will remain missing to complete fair sharing with hierarchical cohorts? Is this PR adding some functionality which could already be tested in scheduler_test.go? If so then I believe it would nicely demonstrate the impact of the change. |
Yes, it does not change behavior in flat Cohorts. This is because we're solving for the case where there may be some lendable resource available, but not in the direct parent. In the flat case - it has to be available in the direct parent, or the CQ should not be above its nominal quota anyway
We need to use this function to help with admission, and during preemption. The algorithm is more involved than described in the KEP - it is not as simple as finding the lowest DominantResource score, and admitting that CQ - as in that case the Cohort that the CQ is part of could already be out of balance, and admission to that CQ may make the situation worse.
Not yet. The scheduling and preemption changes are rather involved, and I plan to send them out as separate PRs |
This is my focus for this PR: to make sure we don't break what is already working, or if we fix, then we have a release not. I think this is subtle. Looking at the old code: func (r ResourceNode) calculateLendable() map[corev1.ResourceName]int64 {
lendable := make(map[corev1.ResourceName]int64, len(r.SubtreeQuota))
for fr, q := range r.SubtreeQuota {
lendable[fr.Resource] += q
}
return lendable
} So, IIUC for a CQ this is equal to Nominal based on kueue/pkg/cache/resource_node.go Line 163 in 260070f
In the new code we have we add func potentialAvailable(node hierarchicalResourceNode, fr resources.FlavorResource) int64 {
r := node.getResourceNode()
if !node.HasParent() {
return r.SubtreeQuota[fr] // returned for cohort
}
available := r.guaranteedQuota(fr) + potentialAvailable(node.parentHRN(), fr)
if borrowingLimit := r.Quotas[fr].BorrowingLimit; borrowingLimit != nil {
maxWithBorrowing := r.SubtreeQuota[fr] + *borrowingLimit
available = min(maxWithBorrowing, available)
}
return available
} now assuming no borrowing limit this equals So, for the "cohort" the values was essentially The code-wise the PR looks ok, seeing no problem for the current semantics: @gabesaba please double check if my reasoning about making sure the results are the same is correct. If I missed something please flag and follow up. |
LGTM label has been added. Git tree hash: 30c980145862c47284fe1bbcd8ffec7124ca5f68
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gabesaba, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
root := node | ||
for root.HasParent() { | ||
root = root.parentHRN() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this could be simplified with getRootUnsafe
, but feel free for follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah no this is first level parent, forget
What type of PR is this?
/kind feature
What this PR does / why we need it:
We define FairShare (DominantResourceShare) score for Cohorts.
We additionally support the case where a CQ or Cohort has some lendable resource available to it, but not in its direct parent. Rather than counting only lendable resources in the parent Cohort, we use the
potentiallyAvailable
function to see how much capacity is available to that Cohort, when it borrows from its parent. While this doesn't differentiate capacity available in the parent Cohort's subtree, versus capacity that the parent Cohort borrows, the FairShare score should only be used for local comparisons: comparing two children of the same node. Therefore, this will serve as a valid denominator.Which issue(s) this PR fixes:
Part of #3759
Special notes for your reviewer:
No release note for this change, as we will do a single release note for the entire feature.
Does this PR introduce a user-facing change?