Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1224: Lending Limit to the cohort #1331

Merged
merged 3 commits into from
Dec 4, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions keps/1224-lending-limit/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# KEP-1224: Introducing lendingLimit to help reserve guaranteed resources

<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories (Optional)](#user-stories-optional)
- [Story 1](#story-1)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Kueue LendingLimit API](#kueue-lendinglimit-api)
- [Note](#note)
- [Test Plan](#test-plan)
- [Prerequisite testing updates](#prerequisite-testing-updates)
- [Unit Tests](#unit-tests)
- [Integration tests](#integration-tests)
- [Graduation Criteria](#graduation-criteria)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
<!-- /toc -->

## Summary

Under the current implementation, one ClusterQueue's resources could be borrowed completely by others in the same cohort, this improves the resource utilization to some extent, but sometimes, user wants to reserve some resources only for private usage.

This proposal provides a guarantee mechanism for users to solve this problem. They can have a reservation of resource quota that will never be borrowed by other clusterqueues in the same cohort.

## Motivation

Sometimes we want to keep some resources for guaranteed usage, so that when new jobs come into queue, they can be admitted immediately.

Under the current implementation, we are using `BorrowingLimit` to define the maximum amount of quota that this ClusterQueue is allowed to borrow. But this may cause another ClusterQueue in the same cohort to run out of resources.

Even if we set the `Preemption`, it still needs some time and spends a lot of unnecessary cost.

So we need a reservation design for resource requests and security reasons: `LendingLimit`, to claim the quota allowed to lend, reserve a certain amount of resources to ensure that they will never be borrowed.

### Goals

- Implement `LendingLimit`, users can have a reservation of guaranteed resource by claiming the `LendingLimit`.

### Non-Goals

- Replace `BorrowingLimit` to some extent in the future.

## Proposal

In this proposal, `LendingLimit` is defined. The `ClusterQueue` will be limited to lend the specified quota to other ClusterQueues in the same cohort.

### User Stories (Optional)

#### Story 1

In order to ensure the full utilization of resources, we generally set `BorrowingLimit` to max, but this may cause a `ClusterQueue` to run out of its all resources, and make the new incoming job slow to response for the slow preemption. This could be worse in a competitive cluster, jobs will borrow resources and be reclaimed over and over.

So we want to reserve some resources for a `ClusterQueue`, so that any incoming jobs in the `ClusterQueue` can get admitted immediately.

### Notes/Constraints/Caveats (Optional)

With both BorrowingLimit and LendingLimit configured, one clusterQueue may not be able to borrow up to the limit just because we reserved the lending limit quota of resource.

To reduce confusion, we will recommend to users to only set borrowingLimit or lendingLimit, but not both, even though both will be supported at the same time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, there is such a confusion. However, I don't think that we should recommend users not to use both BorrowingLimit and LendingLimit. Because using both BorrowingLimit and LendingLimit would be helpful for jobs with autoscale / elastic semantics.

I would suggest to remove Line 66.
@alculquicondor @B1F030 WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok either way. I remember seeing a similar recommendation in YARN not to use borrowingLimit and weight at the same time.

Another alternative would be some form of troubleshooting, for example:

Why is my CQ not borrowing?
Check borrowingLimit of your CQ and lendingLimit of other CQs in the cohort.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok either way. I remember seeing a similar recommendation in YARN not to use borrowingLimit and weight at the same time.

Another alternative would be some form of troubleshooting, for example:

Why is my CQ not borrowing?
Check borrowingLimit of your CQ and lendingLimit of other CQs in the cohort.

I prefer to take this case (BorrowingLimit and LendingLimit) as a troubleshooting technique.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. I agree to remove Line 66.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not about we should only set borrowingLimit or lendingLimit or both, but a practice that if we configured the lendingLimit for demand, then borrowingLimit could be omitted for the maximum resource utilization as there's no borrowingLimit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz update the words.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not about we should only set borrowingLimit or lendingLimit or both, but a practice that if we configured the lendingLimit for demand, then borrowingLimit could be omitted for the maximum resource utilization as there's no borrowingLimit.

@kerthcet Does this mean the following situation?

cq-a and cq-b belong to the same cohort.

  • cq-a:
    cpu: 10
    borrowingLimit: 5

  • cq-b:
    cpu: 5
    lendingLimit: 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can set the borrowingLimit or lendingLimit by your demands, but if we set the lendingLimit in cq-b, then cq-a can leave the borrowingLimit unset to consume the resources as much as possible, as

- cq-a:
  cpu: 10
  # borrowingLimit: 5 # no need to set this if you want to maximum the resource utilization 
  # and don't need to worry about use too much resources because other clusterQueues can 
  # constrain this by setting lendingLimit.

- cq-b:
  cpu: 5
  lendingLimit: X

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it makes sense. Thanks!


### Risks and Mitigations
B1F030 marked this conversation as resolved.
Show resolved Hide resolved

None.

## Design Details

### Kueue LendingLimit API

Modify ResourceQuota API object:

```go
type ResourceQuota struct {
[...]

// lendingLimit is the maximum amount of unused quota for the [flavor, resource]
// combination that this ClusterQueue can lend to other ClusterQueues in the same cohort.
// In total, at a given time, ClusterQueue reserves for its exclusive use
// a quantity of quota equals to nominalQuota - lendingLimit.
// If null, it means that there is no lending limit.
// If not null, it must be non-negative.
// lendingLimit must be null if spec.cohort is empty.
// +optional
LendingLimit *resource.Quantity `json:"lendingLimit,omitempty"`
}
```

#### Note

We have considered adding this status field, but deprecated it in the end. Because unused resources from multiple CQs compose a single pool of shareable resources, and we cannot precisely calculate this value.
B1F030 marked this conversation as resolved.
Show resolved Hide resolved

So there is no concept of A is borrowing from B. A is borrowing from all the unused resource of B, C and any other CQs in the cohort.

```go
type ResourceUsage struct {
[...]

// Lended is a quantity of quota that is lended to other ClusterQueues in the cohort.
Lended resource.Quantity `json:"lended,omitempty"`
alculquicondor marked this conversation as resolved.
Show resolved Hide resolved
}
```

### Test Plan

[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.

##### Prerequisite testing updates

None.

#### Unit Tests

- `pkg/cache`: `2023-11-15` - `86.5%`
- `pkg/controller/core/`: `2023-11-15` - `16.5%`
- `pkg/metrics/`: `2023-11-15` - `45.7%`
- `pkg/scheduler/`: `2023-11-15` - `80.6%`
B1F030 marked this conversation as resolved.
Show resolved Hide resolved
- `pkg/scheduler/flavorassigner`: `2023-11-15` - `80.5%`
- `pkg/scheduler/preemption`: `2023-11-15` - `94.0%`

#### Integration tests

<!--
Describe what tests will be added to ensure proper quality of the enhancement.

After the implementation PR is merged, add the names of the tests here.
-->

- No new workloads can be admitted when the `LendingLimit` greater than `NominalQuota` or less than `0`.
- In a cohort with 2 ClusterQueues a, b and single ResourceFlavor:
- When cq-b's LendingLimit set:
- When cq-a's BorrowingLimit unset, cq-a can borrow as much as `cq-b's LendingLimit`.
- When cq-a's BorrowingLimit set, cq-a can borrow as much as `min(cq-b's LendingLimit, cq-a's BorrowingLimit)`.
B1F030 marked this conversation as resolved.
Show resolved Hide resolved
- In a cohort with 3 ClusterQueues a, b, c and single ResourceFlavor:
- When cq-b's LendingLimit set, cq-c's LendingLimit unset:
- When cq-a's BorrowingLimit unset, cq-a can borrow as much as `(cq-b's LendingLimit + cq-c's NominalQuota)`.
- When cq-a's BorrowingLimit set, cq-a can borrow as much as `min((cq-b's LendingLimit + cq-c's NominalQuota), cq-a's BorrowingLimit)`.
- When cq-b and cq-c's LendingLimit both set:
- When cq-a's BorrowingLimit unset, cq-a can borrow as much as `(cq-b's LendingLimit + cq-c's LendingLimit)`.
- When cq-a's BorrowingLimit set, cq-a can borrow as much as `min((cq-b's LendingLimit + cq-c's LendingLimit), cq-a's BorrowingLimit)`.
- In a ClusterQueue with 2 ResourceFlavors a, b:
- When rf-b's LendingLimit set, and FlavorFungibility set to `WhenCanBorrow: Borrow`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the default fungibility policy.

Is the feature really impacted by different policies?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case we missed something, this can be intersected with borrowingLimit on exist testcases. I'm ok to miss this case.

Copy link
Member

@tenzen-y tenzen-y Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding tests for the whenCanBorrow: TryNextFlavor case since we recently found some hidden bugs related to fungibility. So, to avoid such a case, I want to add tests for the whenCanBorrow: TryNextFlavor.

However, we can confirm in the implementation PR if it's case is valuable.

It means that I'm ok without updating here now.

- When rf-b's BorrowingLimit unset, cq-a can borrow as much as `cq-b's LendingLimit`.
- When rf-b's BorrowingLimit set, cq-a can borrow as much as `min(cq-b's LendingLimit, cq-a's BorrowingLimit)`.

### Graduation Criteria

## Implementation History

## Drawbacks

## Alternatives

- `GuaranteedQuota` which defines the quota for reservation is functionally similar to `LendingLimit`, but to align with `BorrowingLimit`, we chose the latter.

33 changes: 33 additions & 0 deletions keps/1224-lending-limit/kep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
title: Introducing lendingLimit to help reserve guaranteed resources
kep-number: 1224
authors:
- "@B1F030"
B1F030 marked this conversation as resolved.
Show resolved Hide resolved
- "@kerthcet"
status: implementable
creation-date: 2023-11-13
reviewers:
- "@alculquicondor"
- "@kerthcet"
- "@tenzen-y"
approvers:
- "@alculquicondor"

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v0.6"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v0.6"
beta: "v0.7"
stable: "v0.8"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add a feature gate here for experiment.

- name: LendingLimit
disable-supported: true