Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define FairShare Score for Cohorts; Generalize to Hierarchical Case #4313

Merged
merged 1 commit into from
Feb 21, 2025

Conversation

gabesaba
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

We define FairShare (DominantResourceShare) score for Cohorts.

We additionally support the case where a CQ or Cohort has some lendable resource available to it, but not in its direct parent. Rather than counting only lendable resources in the parent Cohort, we use the potentiallyAvailable function to see how much capacity is available to that Cohort, when it borrows from its parent. While this doesn't differentiate capacity available in the parent Cohort's subtree, versus capacity that the parent Cohort borrows, the FairShare score should only be used for local comparisons: comparing two children of the same node. Therefore, this will serve as a valid denominator.

Which issue(s) this PR fixes:

Part of #3759

Special notes for your reviewer:

No release note for this change, as we will do a single release note for the entire feature.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. labels Feb 19, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 19, 2025
Copy link

netlify bot commented Feb 19, 2025

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit ca44b8a
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/67b5b47728e7b80008893169
😎 Deploy Preview https://deploy-preview-4313--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@mimowo
Copy link
Contributor

mimowo commented Feb 19, 2025

ack

@mimowo
Copy link
Contributor

mimowo commented Feb 19, 2025

The code itself lgtm, but the impact of the change is non-obvious to me. Let me ask some clarifying questions.

Rather than counting only lendable resources in the parent Cohort, we use the potentiallyAvailable function

Is this change transparent to the flat cohorts? If so, why? I see all tests passing , but wondering if there might be some corner cases we change relative to the current logic, impacting users of FairSharing with flat cohorts.

No release note for this change, as we will do a single release note for the entire feature.

What will remain missing to complete fair sharing with hierarchical cohorts?

Is this PR adding some functionality which could already be tested in scheduler_test.go? If so then I believe it would nicely demonstrate the impact of the change.

@gabesaba
Copy link
Contributor Author

The code itself lgtm, but the impact of the change is non-obvious to me. Let me ask some clarifying questions.

Rather than counting only lendable resources in the parent Cohort, we use the potentiallyAvailable function

Is this change transparent to the flat cohorts? If so, why? I see all tests passing , but wondering if there might be some corner cases we change relative to the current logic, impacting users of FairSharing with flat cohorts.

Yes, it does not change behavior in flat Cohorts. This is because we're solving for the case where there may be some lendable resource available, but not in the direct parent. In the flat case - it has to be available in the direct parent, or the CQ should not be above its nominal quota anyway

No release note for this change, as we will do a single release note for the entire feature.

What will remain missing to complete fair sharing with hierarchical cohorts?

We need to use this function to help with admission, and during preemption. The algorithm is more involved than described in the KEP - it is not as simple as finding the lowest DominantResource score, and admitting that CQ - as in that case the Cohort that the CQ is part of could already be out of balance, and admission to that CQ may make the situation worse.

Is this PR adding some functionality which could already be tested in scheduler_test.go? If so then I believe it would nicely demonstrate the impact of the change.

Not yet. The scheduling and preemption changes are rather involved, and I plan to send them out as separate PRs

@mimowo
Copy link
Contributor

mimowo commented Feb 21, 2025

but not in the direct parent. In the flat case - it has to be available in the direct parent, or the CQ should not be above its nominal quota anyway

This is my focus for this PR: to make sure we don't break what is already working, or if we fix, then we have a release not. I think this is subtle. Looking at the old code:

func (r ResourceNode) calculateLendable() map[corev1.ResourceName]int64 {
        lendable := make(map[corev1.ResourceName]int64, len(r.SubtreeQuota))
        for fr, q := range r.SubtreeQuota {
                lendable[fr.Resource] += q
        }
        return lendable
}

So, IIUC for a CQ this is equal to Nominal based on

cq.resourceNode.SubtreeQuota[fr] = quota.Nominal

In the new code we have we add potentialAvailable, so :

func potentialAvailable(node hierarchicalResourceNode, fr resources.FlavorResource) int64 {
	r := node.getResourceNode()
	if !node.HasParent() {
		return r.SubtreeQuota[fr]    // returned for cohort
	}
	available := r.guaranteedQuota(fr) + potentialAvailable(node.parentHRN(), fr)
	if borrowingLimit := r.Quotas[fr].BorrowingLimit; borrowingLimit != nil {
		maxWithBorrowing := r.SubtreeQuota[fr] + *borrowingLimit
		available = min(maxWithBorrowing, available)
	}
	return available
}

now assuming no borrowing limit this equals guaranteedQuota + potentialAvailable in the whole subtree, so it may be way above nominal. So, IIUC calculateLendable changes its semantic for CQ (leaf nodes), but this is not that relevant because for DRF we always use calculateLendable on the parent: https://github.com/kubernetes-sigs/kueue/pull/4313/files#diff-e822642151fb9037f5ada06309828ea5c2ecb4a3cb9619ae445ce138622511a4L64

So, for the "cohort" the values was essentially SubtreeQuota which is by definition the "lendable" quota in the subtree. Similarly, in that case the potentialAvailable returns SubtreeQuota for the cohort in this line with "returned for cohort" comment. So, I think this is ok.

The code-wise the PR looks ok, seeing no problem for the current semantics:
/lgtm
/approve

@gabesaba please double check if my reasoning about making sure the results are the same is correct. If I missed something please flag and follow up.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 30c980145862c47284fe1bbcd8ffec7124ca5f68

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gabesaba, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 21, 2025
Comment on lines +88 to +91
root := node
for root.HasParent() {
root = root.parentHRN()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could be simplified with getRootUnsafe, but feel free for follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah no this is first level parent, forget

@k8s-ci-robot k8s-ci-robot merged commit 610b247 into kubernetes-sigs:main Feb 21, 2025
19 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.11 milestone Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants