-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-1224: Lending Limit to the cohort #1331
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,162 @@ | ||
# KEP-1224: Introducing lendingLimit to help reserve guaranteed resources | ||
|
||
<!-- toc --> | ||
- [Summary](#summary) | ||
- [Motivation](#motivation) | ||
- [Goals](#goals) | ||
- [Non-Goals](#non-goals) | ||
- [Proposal](#proposal) | ||
- [User Stories (Optional)](#user-stories-optional) | ||
- [Story 1](#story-1) | ||
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) | ||
- [Risks and Mitigations](#risks-and-mitigations) | ||
- [Design Details](#design-details) | ||
- [Kueue LendingLimit API](#kueue-lendinglimit-api) | ||
- [Note](#note) | ||
- [Test Plan](#test-plan) | ||
- [Prerequisite testing updates](#prerequisite-testing-updates) | ||
- [Unit Tests](#unit-tests) | ||
- [Integration tests](#integration-tests) | ||
- [Graduation Criteria](#graduation-criteria) | ||
- [Implementation History](#implementation-history) | ||
- [Drawbacks](#drawbacks) | ||
- [Alternatives](#alternatives) | ||
<!-- /toc --> | ||
|
||
## Summary | ||
|
||
Under the current implementation, one ClusterQueue's resources could be borrowed completely by others in the same cohort, this improves the resource utilization to some extent, but sometimes, user wants to reserve some resources only for private usage. | ||
|
||
This proposal provides a guarantee mechanism for users to solve this problem. They can have a reservation of resource quota that will never be borrowed by other clusterqueues in the same cohort. | ||
|
||
## Motivation | ||
|
||
Sometimes we want to keep some resources for guaranteed usage, so that when new jobs come into queue, they can be admitted immediately. | ||
|
||
Under the current implementation, we are using `BorrowingLimit` to define the maximum amount of quota that this ClusterQueue is allowed to borrow. But this may cause another ClusterQueue in the same cohort to run out of resources. | ||
|
||
Even if we set the `Preemption`, it still needs some time and spends a lot of unnecessary cost. | ||
|
||
So we need a reservation design for resource requests and security reasons: `LendingLimit`, to claim the quota allowed to lend, reserve a certain amount of resources to ensure that they will never be borrowed. | ||
|
||
### Goals | ||
|
||
- Implement `LendingLimit`, users can have a reservation of guaranteed resource by claiming the `LendingLimit`. | ||
|
||
### Non-Goals | ||
|
||
- Replace `BorrowingLimit` to some extent in the future. | ||
|
||
## Proposal | ||
|
||
In this proposal, `LendingLimit` is defined. The `ClusterQueue` will be limited to lend the specified quota to other ClusterQueues in the same cohort. | ||
|
||
### User Stories (Optional) | ||
|
||
#### Story 1 | ||
|
||
In order to ensure the full utilization of resources, we generally set `BorrowingLimit` to max, but this may cause a `ClusterQueue` to run out of its all resources, and make the new incoming job slow to response for the slow preemption. This could be worse in a competitive cluster, jobs will borrow resources and be reclaimed over and over. | ||
|
||
So we want to reserve some resources for a `ClusterQueue`, so that any incoming jobs in the `ClusterQueue` can get admitted immediately. | ||
|
||
### Notes/Constraints/Caveats (Optional) | ||
|
||
With both BorrowingLimit and LendingLimit configured, one clusterQueue may not be able to borrow up to the limit just because we reserved the lending limit quota of resource. | ||
|
||
To reduce confusion, we will recommend to users to only set borrowingLimit or lendingLimit, but not both, even though both will be supported at the same time. | ||
|
||
### Risks and Mitigations | ||
B1F030 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
None. | ||
|
||
## Design Details | ||
|
||
### Kueue LendingLimit API | ||
|
||
Modify ResourceQuota API object: | ||
|
||
```go | ||
type ResourceQuota struct { | ||
[...] | ||
|
||
// lendingLimit is the maximum amount of unused quota for the [flavor, resource] | ||
// combination that this ClusterQueue can lend to other ClusterQueues in the same cohort. | ||
// In total, at a given time, ClusterQueue reserves for its exclusive use | ||
// a quantity of quota equals to nominalQuota - lendingLimit. | ||
// If null, it means that there is no lending limit. | ||
// If not null, it must be non-negative. | ||
// lendingLimit must be null if spec.cohort is empty. | ||
// +optional | ||
LendingLimit *resource.Quantity `json:"lendingLimit,omitempty"` | ||
} | ||
``` | ||
|
||
#### Note | ||
|
||
We have considered adding this status field, but deprecated it in the end. Because unused resources from multiple CQs compose a single pool of shareable resources, and we cannot precisely calculate this value. | ||
B1F030 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
So there is no concept of A is borrowing from B. A is borrowing from all the unused resource of B, C and any other CQs in the cohort. | ||
|
||
```go | ||
type ResourceUsage struct { | ||
[...] | ||
|
||
// Lended is a quantity of quota that is lended to other ClusterQueues in the cohort. | ||
Lended resource.Quantity `json:"lended,omitempty"` | ||
alculquicondor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
``` | ||
|
||
### Test Plan | ||
|
||
[x] I/we understand the owners of the involved components may require updates to | ||
existing tests to make this code solid enough prior to committing the changes necessary | ||
to implement this enhancement. | ||
|
||
##### Prerequisite testing updates | ||
|
||
None. | ||
|
||
#### Unit Tests | ||
|
||
- `pkg/cache`: `2023-11-15` - `86.5%` | ||
- `pkg/controller/core/`: `2023-11-15` - `16.5%` | ||
- `pkg/metrics/`: `2023-11-15` - `45.7%` | ||
- `pkg/scheduler/`: `2023-11-15` - `80.6%` | ||
B1F030 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `pkg/scheduler/flavorassigner`: `2023-11-15` - `80.5%` | ||
- `pkg/scheduler/preemption`: `2023-11-15` - `94.0%` | ||
|
||
#### Integration tests | ||
|
||
<!-- | ||
Describe what tests will be added to ensure proper quality of the enhancement. | ||
|
||
After the implementation PR is merged, add the names of the tests here. | ||
--> | ||
|
||
- No new workloads can be admitted when the `LendingLimit` greater than `NominalQuota` or less than `0`. | ||
- In a cohort with 2 ClusterQueues a, b and single ResourceFlavor: | ||
- When cq-b's LendingLimit set: | ||
- When cq-a's BorrowingLimit unset, cq-a can borrow as much as `cq-b's LendingLimit`. | ||
- When cq-a's BorrowingLimit set, cq-a can borrow as much as `min(cq-b's LendingLimit, cq-a's BorrowingLimit)`. | ||
B1F030 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- In a cohort with 3 ClusterQueues a, b, c and single ResourceFlavor: | ||
- When cq-b's LendingLimit set, cq-c's LendingLimit unset: | ||
- When cq-a's BorrowingLimit unset, cq-a can borrow as much as `(cq-b's LendingLimit + cq-c's NominalQuota)`. | ||
- When cq-a's BorrowingLimit set, cq-a can borrow as much as `min((cq-b's LendingLimit + cq-c's NominalQuota), cq-a's BorrowingLimit)`. | ||
- When cq-b and cq-c's LendingLimit both set: | ||
- When cq-a's BorrowingLimit unset, cq-a can borrow as much as `(cq-b's LendingLimit + cq-c's LendingLimit)`. | ||
- When cq-a's BorrowingLimit set, cq-a can borrow as much as `min((cq-b's LendingLimit + cq-c's LendingLimit), cq-a's BorrowingLimit)`. | ||
- In a ClusterQueue with 2 ResourceFlavors a, b: | ||
- When rf-b's LendingLimit set, and FlavorFungibility set to `WhenCanBorrow: Borrow`: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is the default fungibility policy. Is the feature really impacted by different policies? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just in case we missed something, this can be intersected with borrowingLimit on exist testcases. I'm ok to miss this case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest adding tests for the However, we can confirm in the implementation PR if it's case is valuable. It means that I'm ok without updating here now. |
||
- When rf-b's BorrowingLimit unset, cq-a can borrow as much as `cq-b's LendingLimit`. | ||
- When rf-b's BorrowingLimit set, cq-a can borrow as much as `min(cq-b's LendingLimit, cq-a's BorrowingLimit)`. | ||
|
||
### Graduation Criteria | ||
|
||
## Implementation History | ||
|
||
## Drawbacks | ||
|
||
## Alternatives | ||
|
||
- `GuaranteedQuota` which defines the quota for reservation is functionally similar to `LendingLimit`, but to align with `BorrowingLimit`, we chose the latter. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
title: Introducing lendingLimit to help reserve guaranteed resources | ||
kep-number: 1224 | ||
authors: | ||
- "@B1F030" | ||
B1F030 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- "@kerthcet" | ||
status: implementable | ||
creation-date: 2023-11-13 | ||
reviewers: | ||
- "@alculquicondor" | ||
- "@kerthcet" | ||
- "@tenzen-y" | ||
approvers: | ||
- "@alculquicondor" | ||
|
||
# The target maturity stage in the current dev cycle for this KEP. | ||
stage: alpha | ||
|
||
# The most recent milestone for which work toward delivery of this KEP has been | ||
# done. This can be the current (upcoming) milestone, if it is being actively | ||
# worked on. | ||
latest-milestone: "v0.6" | ||
|
||
# The milestone at which this feature was, or is targeted to be, at each stage. | ||
milestone: | ||
alpha: "v0.6" | ||
beta: "v0.7" | ||
stable: "v0.8" | ||
|
||
# The following PRR answers are required at alpha release | ||
# List the feature gate name and the components for which it must be enabled | ||
feature-gates: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We add a feature gate here for experiment. |
||
- name: LendingLimit | ||
disable-supported: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, there is such a confusion. However, I don't think that we should recommend users not to use both
BorrowingLimit
andLendingLimit
. Because using bothBorrowingLimit
andLendingLimit
would be helpful for jobs with autoscale / elastic semantics.I would suggest to remove Line 66.
@alculquicondor @B1F030 WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok either way. I remember seeing a similar recommendation in YARN not to use borrowingLimit and weight at the same time.
Another alternative would be some form of troubleshooting, for example:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to take this case (
BorrowingLimit
andLendingLimit
) as a troubleshooting technique.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. I agree to remove Line 66.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not about we should only set
borrowingLimit
orlendingLimit
or both, but a practice that if we configured thelendingLimit
for demand, thenborrowingLimit
could be omitted for the maximum resource utilization as there's no borrowingLimit.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plz update the words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kerthcet Does this mean the following situation?
cq-a and cq-b belong to the same cohort.
cq-a:
cpu: 10
borrowingLimit: 5
cq-b:
cpu: 5
lendingLimit: 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can set the borrowingLimit or lendingLimit by your demands, but if we set the lendingLimit in cq-b, then cq-a can leave the borrowingLimit unset to consume the resources as much as possible, as
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it makes sense. Thanks!