From e28793ebb8219e57536315521ef978ea26764fd6 Mon Sep 17 00:00:00 2001 From: B1F030 <646337422@qq.com> Date: Fri, 24 Nov 2023 14:58:20 +0800 Subject: [PATCH 1/2] KEP-1224: Lending Limit to the cohort Co-authored-by: kerthcet Signed-off-by: B1F030 <646337422@qq.com> --- keps/1224-lending-limit/README.md | 162 ++++++++++++++++++++++++++++++ keps/1224-lending-limit/kep.yaml | 33 ++++++ 2 files changed, 195 insertions(+) create mode 100644 keps/1224-lending-limit/README.md create mode 100644 keps/1224-lending-limit/kep.yaml diff --git a/keps/1224-lending-limit/README.md b/keps/1224-lending-limit/README.md new file mode 100644 index 0000000000..805e37294f --- /dev/null +++ b/keps/1224-lending-limit/README.md @@ -0,0 +1,162 @@ +# KEP-1224: Introducing lendingLimit to help reserve guaranteed resources + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Kueue LendingLimit API](#kueue-lendinglimit-api) + - [Note](#note) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit Tests](#unit-tests) + - [Integration tests](#integration-tests) + - [Graduation Criteria](#graduation-criteria) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Summary + +Under the current implementation, one ClusterQueue's resources could be borrowed completely by others in the same cohort, this improves the resource utilization to some extent, but sometimes, user wants to reserve some resources only for private usage. + +This proposal provides a guarantee mechanism for users to solve this problem. They can have a reservation of resource quota that will never be borrowed by other clusterqueues in the same cohort. + +## Motivation + +Sometimes we want to keep some resources for guaranteed usage, so that when new jobs come into queue, they can be admitted immediately. + +Under the current implementation, we are using `BorrowingLimit` to define the maximum amount of quota that this ClusterQueue is allowed to borrow. But this may cause another ClusterQueue in the same cohort to run out of resources. + +Even if we set the `Preemption`, it still needs some time and spends a lot of unnecessary cost. + +So we need a reservation design for resource requests and security reasons: `LendingLimit`, to claim the quota allowed to lend, reserve a certain amount of resources to ensure that they will never be borrowed. + +### Goals + +- Implement `LendingLimit`, users can have a reservation of guaranteed resource by claiming the `LendingLimit`. + +### Non-Goals + +- Replace `BorrowingLimit` to some extent in the future. + +## Proposal + +In this proposal, `LendingLimit` is defined. The `ClusterQueue` will be limited to lend the specified quota to other ClusterQueues in the same cohort. + +### User Stories (Optional) + +#### Story 1 + +In order to ensure the full utilization of resources, we generally set `BorrowingLimit` to max, but this may cause a `ClusterQueue` to run out of its all resources, and make the new incoming job slow to response for the slow preemption. This could be worse in a competitive cluster, jobs will borrow resources and be reclaimed over and over. + +So we want to reserve some resources for a `ClusterQueue`, so that any incoming jobs in the `ClusterQueue` can get admitted immediately. + +### Notes/Constraints/Caveats (Optional) + +With both BorrowingLimit and LendingLimit configured, one clusterQueue may not be able to borrow up to the limit just because we reserved the lending limit quota of resource. + +To reduce confusion, we will recommend to users to only set borrowingLimit or lendingLimit, but not both, even though both will be supported at the same time. + +### Risks and Mitigations + +None. + +## Design Details + +### Kueue LendingLimit API + +Modify ResourceQuota API object: + +```go +type ResourceQuota struct { + [...] + + // lendingLimit is the maximum amount of unused quota for the [flavor, resource] + // combination that this ClusterQueue can lend to other ClusterQueues in the same cohort. + // In total, at a given time, ClusterQueue reserves for its exclusive use + // a quantity of quota equals to nominalQuota - lendingLimit. + // If null, it means that there is no lending limit. + // If not null, it must be non-negative. + // lendingLimit must be null if spec.cohort is empty. + // +optional + LendingLimit *resource.Quantity `json:"lendingLimit,omitempty"` +} +``` + +#### Note + +We have considered adding this status field, but deprecated it in the end. Because unused resources from multiple CQs compose a single pool of shareable resources, and we cannot precisely calculate this value. + +So there is no concept of A is borrowing from B. A is borrowing from all the unused resource of B, C and any other CQs in the cohort. + +```go +type ResourceUsage struct { + [...] + + // Lended is a quantity of quota that is lended to other ClusterQueues in the cohort. + Lended resource.Quantity `json:"lended,omitempty"` +} +``` + +### Test Plan + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + +None. + +#### Unit Tests + +- `pkg/cache`: `2023-11-15` - `86.5%` +- `pkg/controller/core/`: `2023-11-15` - `16.5%` +- `pkg/metrics/`: `2023-11-15` - `45.7%` +- `pkg/scheduler/`: `2023-11-15` - `80.6%` +- `pkg/scheduler/flavorassigner`: `2023-11-15` - `80.5%` +- `pkg/scheduler/preemption`: `2023-11-15` - `94.0%` + +#### Integration tests + + + +- No new workloads can be admitted when the `LendingLimit` greater than `NominalQuota` or less than `0`. +- In a cohort with 2 ClusterQueues a, b and single ResourceFlavor: + - When cq-b's LendingLimit set: + - When cq-a's BorrowingLimit unset, cq-a can borrow as much as `cq-b's LendingLimit`. + - When cq-a's BorrowingLimit set, cq-a can borrow as much as `min(cq-b's LendingLimit, cq-a's BorrowingLimit)`. +- In a cohort with 3 ClusterQueues a, b, c and single ResourceFlavor: + - When cq-b's LendingLimit set, cq-c's LendingLimit unset: + - When cq-a's BorrowingLimit unset, cq-a can borrow as much as `(cq-b's LendingLimit + cq-c's NominalQuota)`. + - When cq-a's BorrowingLimit set, cq-a can borrow as much as `min((cq-b's LendingLimit + cq-c's NominalQuota), cq-a's BorrowingLimit)`. + - When cq-b and cq-c's LendingLimit both set: + - When cq-a's BorrowingLimit unset, cq-a can borrow as much as `(cq-b's LendingLimit + cq-c's LendingLimit)`. + - When cq-a's BorrowingLimit set, cq-a can borrow as much as `min((cq-b's LendingLimit + cq-c's LendingLimit), cq-a's BorrowingLimit)`. +- In a ClusterQueue with 2 ResourceFlavors a, b: + - When rf-b's LendingLimit set, and FlavorFungibility set to `WhenCanBorrow: Borrow`: + - When rf-b's BorrowingLimit unset, cq-a can borrow as much as `cq-b's LendingLimit`. + - When rf-b's BorrowingLimit set, cq-a can borrow as much as `min(cq-b's LendingLimit, cq-a's BorrowingLimit)`. + +### Graduation Criteria + +## Implementation History + +## Drawbacks + +## Alternatives + +- `GuaranteedQuota` which defines the quota for reservation is functionally similar to `LendingLimit`, but to align with `BorrowingLimit`, we chose the latter. + diff --git a/keps/1224-lending-limit/kep.yaml b/keps/1224-lending-limit/kep.yaml new file mode 100644 index 0000000000..d6dab3fa55 --- /dev/null +++ b/keps/1224-lending-limit/kep.yaml @@ -0,0 +1,33 @@ +title: Introducing lendingLimit to help reserve guaranteed resources +kep-number: 1224 +authors: + - "@B1F030" + - "@kerthcet" +status: implementable +creation-date: 2023-11-13 +reviewers: + - "@alculquicondor" + - "@kerthcet" + - "@tenzen-y" +approvers: + - "@alculquicondor" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v0.6" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v0.6" + beta: "v0.7" + stable: "v0.8" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: LendingLimit +disable-supported: true From b2ae68958931fe3df00e1fd82307fd8b317233c2 Mon Sep 17 00:00:00 2001 From: B1F030 <646337422@qq.com> Date: Fri, 1 Dec 2023 19:06:19 +0800 Subject: [PATCH 2/2] add test for whenCanBorrow: TryNextFlavor Co-authored-by: kerthcet Signed-off-by: B1F030 <646337422@qq.com> --- keps/1224-lending-limit/README.md | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/keps/1224-lending-limit/README.md b/keps/1224-lending-limit/README.md index 805e37294f..d3ae607ffb 100644 --- a/keps/1224-lending-limit/README.md +++ b/keps/1224-lending-limit/README.md @@ -63,8 +63,6 @@ So we want to reserve some resources for a `ClusterQueue`, so that any incoming With both BorrowingLimit and LendingLimit configured, one clusterQueue may not be able to borrow up to the limit just because we reserved the lending limit quota of resource. -To reduce confusion, we will recommend to users to only set borrowingLimit or lendingLimit, but not both, even though both will be supported at the same time. - ### Risks and Mitigations None. @@ -93,7 +91,7 @@ type ResourceQuota struct { #### Note -We have considered adding this status field, but deprecated it in the end. Because unused resources from multiple CQs compose a single pool of shareable resources, and we cannot precisely calculate this value. +We have considered adding this status field, but discarded it. Because unused resources from multiple CQs compose a single pool of shareable resources, we cannot precisely calculate this value. So there is no concept of A is borrowing from B. A is borrowing from all the unused resource of B, C and any other CQs in the cohort. @@ -145,10 +143,10 @@ After the implementation PR is merged, add the names of the tests here. - When cq-b and cq-c's LendingLimit both set: - When cq-a's BorrowingLimit unset, cq-a can borrow as much as `(cq-b's LendingLimit + cq-c's LendingLimit)`. - When cq-a's BorrowingLimit set, cq-a can borrow as much as `min((cq-b's LendingLimit + cq-c's LendingLimit), cq-a's BorrowingLimit)`. -- In a ClusterQueue with 2 ResourceFlavors a, b: - - When rf-b's LendingLimit set, and FlavorFungibility set to `WhenCanBorrow: Borrow`: - - When rf-b's BorrowingLimit unset, cq-a can borrow as much as `cq-b's LendingLimit`. - - When rf-b's BorrowingLimit set, cq-a can borrow as much as `min(cq-b's LendingLimit, cq-a's BorrowingLimit)`. +- In a cohort with 2 ClusterQueues cq-a, cq-b and 2 ResourceFlavors rf-a, rf-b: + - When rf-b's LendingLimit set, and cq-a's FlavorFungibility set to `WhenCanBorrow: TryNextFlavor`: + - When rf-a's BorrowingLimit unset, cq-a can borrow as much as `rf-b's LendingLimit`. + - When rf-a's BorrowingLimit set, cq-a can borrow as much as `min(rf-b's LendingLimit, rf-a's BorrowingLimit)`. ### Graduation Criteria