Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: support Job-level preemption #1879

Merged
merged 1 commit into from
Feb 23, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
252 changes: 252 additions & 0 deletions docs/proposals/scheduling/20240115-support-job-level-preemption.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
---
title: Support-Job-Level-Preemption
authors:
- "@xulinfei1996"
reviewers:
- "@buptcozy"
- "@eahydra"
- "@hormes"
creation-date: 2024-01-30
last-updated: 2024-01-30
status: provisional

---

# Support Job Level Preemption

<!-- TOC -->

- [Support Job Level Preemption](#support-job-level-preemption)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-goals/Future work](#non-goalsfuture-work)
- [Proposal](#proposal)
- [User Stories](#user-stories)
- [Key Concept](#key-concept)
- [Implementation Details](#implementation-details)
- [Non-Preemptible and Preemptible](#non-preemptible-and-preemptible)
- [Quota Assignment](#quota-assignment)
- [Preemption](#preemption)
- [Job Level Preemption](#job-level-preemption)
- [Adaptive Eviction Approach](#adaptive-eviction-approach)
- [Extension Point](#extension-point)
- [Over All](#over-all)
- [PreFilter](#prefilter)
- [PostFilter](#postfilter)
- [API](#api)
- [Pod](#pod)
- [Plugin Args](#plugin-args)
- [Compatibility](#compatibility)
- [Unsolved Problems](#unsolved-problems)
- [Alternatives](#alternatives)
- [Implementation History](#implementation-history)
- [References](#references)

<!-- /TOC -->

## Summary
This proposal provides job level resource assign and preemption mechanism for the scheduler.

## Motivation
In a large scale shared-cluster, some quotas may be very busy, some quotas may be idle. In ElasticQuota plugin, we already
support borrowing resources from idle ElasticQuota. But when a Pod associated with a borrowed ElasticQuota needs to get
resources back, Job granularity is not considered. For Pods belonging to the same Job, we need to perform preemption at Job
granularity to ensure that we get enough resources to meet the requirements of the Job and improve resource delivery efficiency.

### Goals

1. Supplement necessary APIs, improve preemption-related APIs, and support job granular preemption mechanism.

2. Clearly define the job-granular preemption mechanism.

### Non-goals/Future work

## Proposal
In this proposal, we use ElasticQuota to implement the job level arrangement and preemption. Each ElasticQuota declares
its "min" and "max". The semantics of "min" is the ElasticQuota's non-preemptible resources. If ElasticQuota's "request"
is less than or equal to "min", the ElasticQuota can obtain equivalent resources to the "request". The semantics of "max"
is the ElasticQuota's upper limit of resources. We require "min" should be less than or equal to "max".

"non-preemptible" means the pod can't be preempted during preemption, so we set the limit for non-preemptible pods used
as the "min". "preemptible" means the pod can be preempted during preemption, so we use the "max" minus the request
for non-preemptible pods as the limit for preemptible resource usage.

### User Stories
#### Story1
If job contains multiple pods, and only some of the pods can assign successfully, we suppose scheduler to
trigger preemption for the left not assigned pods.

#### Story2
If only some of the not assigned pods can be assigned during preemption, we suppose the preemption should
not really happen, which also means there are no victim pods evicted during preemption.

### Key Concept
1. There are two kind of pods: non-preemptible and preemptible. non-preemptible pod will be limited by quota's min.
preemptible pod will be limited by quota's max. If job is preemptible, please fill job's pods with label
`quota.scheduling.koordinator.sh/preemptible=true`, if job is non-preemptible, please fill job's pods with label
`quota.scheduling.koordinator.sh/preemptible=false`.

2. Typically, we use pod.spec.PreemptionPolicy to judge whether a pod can preempt preemptible pods. If pod's policy is
PreemptLowerPriority, it can trigger preemption. Higher priority pods can preempt lower priority pod. The
pod.spec.PreemptionPolicy is a commonly used parameter for determining whether a Pod can trigger preemption eviction.
However, users may not have been aware of this, resulting in all existing pods potentially triggering preemption, which
is not our intention. Therefore, we introduce a new label `quota.scheduling.koordinator.sh/can-preempt` to address.

3. In AI scenario, job has property of "AllOrNothing", which means the process of resource allocation and preemption must
be executed by job level, not pod level.In koordinator, not only `pod-group.scheduling.sigs.k8s.io` is supported, but
also other labels are supported to identify Pods that belong to the same Job. Considering compatibility, we continue
to use label `pod-group.scheduling.sigs.k8s.io` to support it.

### Implementation Details

#### Non-Preemptible and Preemptible
When user submit pods, it should mark pods as non-preemptible or preemptible by label configuration. In koordinator, if pods
do not declare `quota.scheduling.koordinator.sh/preemptible`, the pod can be preempted. However, in some scenario,
the expectation is opposite, the exist running pods don't have this label and can't be preempted. So here we will add a
new plugin args to control this behaviour, please see in api part.

#### Quota Assignment
When user submit job, it should fill a job-id into pod's label. When we try to assign pod, we should first list all
pods with the same job, and summarize all pods' request as totalRequest. For non-preemptible job, we should check whether
non-preemptibleTotalUsed + newNonPreemptibleJobTotalRequest <= minQuota and totalUsed + newNonPreemptibleJobTotalRequest <= maxQuota.
For preemptible job, we should check whether math.Min(minQuota, non-preemptibleTotalRequest) + preemptibleTotalUsed + newPreemptibleJobTotalRequest <= maxQuota.

[Koordinator Quota Management](https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20220722-multi-hierarchy-elastic-quota-management.md)
provide a mechanism to share idle resources among quotas, which called "runtime quota". Generally it will distribute idle
resources fairly to different quotas according to a certain proportion. However, in AI training scenario, due to job's
"AllOrNothing" property, the runtime quota mechanism may lead to preemptible job can't be assigned successfully because
each quota only obtains partial idle resources. So our strategy is that the idle resources will be assigned to each
quota by a greedy way, which means if one quota's preemptible job comes first, it can use all idle resources as much as it reaches its max.

#### Preemption
##### Job Level Preemption
The new job-granularity preemption mechanism follows the existing scheduling and orchestration constraint strategy, and
will retry executing Filter during the preemption process to ensure that the candidate nodes meet the requirements for
preempting Pods.

#### Adaptive Eviction Approach
The plugin will patch preempted pod a label `scheduling.koordinator.sh/soft-eviction` to reflect the pod is preempted.
We expect there will be a third party operator handle these labels and really do delete pod action, it will give
the user spaces to do some clean jobs before pod is really deleted. If multiple operators need to do clean jobs, the
user can also fill pod finalizer to achieve the target. Finally, we will also update preemption event for the preempted
pod.

#### Extension Point

##### Over All
The new\delta parts are:
1. Non-preemptible and preemptible resource judgement.
2. Job level preemption logic.
3. Quota resource recycling logic.

##### PreFilter
we should first list all pods with the same job, and summarize all pod's request as totalRequest. For non-preemptible job,
we should check whether non-preemptibleTotalUsed + newNonPreemptibleJobTotalRequest <= minQuota and totalUsed + newNonPreemptibleJobTotalRequest <= maxQuota;
for preemptible job, we should check whether math.Min(minQuota, non-preemptibleTotalRequest) + preemptibleTotalUsed + newPreemptibleJobTotalRequest <= maxQuota.

##### PostFilter
We need job level preemption, which means we can't execute preemption pod by pod as usual. If one of pod preemption failed,
the previous pod successful preemption will be meaningless and harmful. So the process is as followed:

1. Judge pod can trigger preempt or not by plugin args and pod label. If job is non-preemptible, non-preemptibleTotalUsed +
non-preemptibleJobRequest <= minQuota, then job can trigger preempt. If job is preemptible,
math.Min(minQuota, non-preemptibleTotalRequest) + preemptibleTotalUsed + newPreemptibleJobTotalRequest <= maxQuota, then
job can trigger preempt.
2. List of all pods of the job, iterate over waiting pods of framework to get all not-assumed pods list.
(some pods may assume successfully in previous assign process and in waiting permit status).
3. Iterate filteredNodeStatusMap of PostFilter's variable which code is not UnschedulableAndUnresolvable, then try to remove
all preemptible pods for nodeInfo's copy. To avoid after preemption, the non-preemptibleUsed + preemptibleUsed > maxQuota,
we also list all running preemptible pods of this quota, and try to remove all preemptible pods for nodeInfo's copy too.
4. For each not-assumed pod of the job, try to execute RunFilterPlugins based on nodeInfo's copy, if success, add pod into nodeInfo.
5. If all not-assumed pod preemption process success, which means job preemption success. We do preempted-pod try-assign-back
logic and finally pick up real preempted pods and node. If one not-assumed pod preemption fail, we think this round
preemption is fail. During assign-back process, we should check the quota limit of the preemption quota.
6. For all preempted pods, we patch a preemption label for upper operator to handle some clean logic and delete pod.
7. For all preemptor pods, we first execute framework.AddNominatedPods() to hold the resources in memory, and we should
store preemptor pod/preempted pods/preempted node to the local cache. Due to preempted pod are deleted by
other operator, if deleting process is slow and resource are not returned, once the next round of scheduling process
comes, it will trigger preemption again, which may generate new preempted pods, and this process may not stop
util all preemptible are preempted. So at the beginning of PostFilter, it will check cache if last round preemption finishes
or not.


### API
#### Pod
We introduce some labels to describe pod behaviour.
- `pod-group.scheduling.sigs.k8s.io` is filled by user, describe pod's jobName.
- `quota.scheduling.koordinator.sh/preemptible`is filled by user, indicate whether is non-preemptible or preemptible.
- `scheduling.koordinator.sh/soft-eviction` is filled by scheduler, indicate the pod is preempted.
- `pod.spec.PreemptionPolicy` is filled by user, describe whether a pod can trigger preemption to preemptible pods.
- `quota.scheduling.koordinator.sh/can-preempt` is filled by user, also describe whether a pod can trigger preemption to preemptible pods.

If pod does not have `quota.scheduling.koordinator.sh/preemptible` label, we declare that the pod equals to non-preemptible.

if you want to declare pod belongs to a job, please use as follows:
```yaml
labels:
pod-group.scheduling.sigs.k8s.io: "job1"
```

if you want to declare pod is non-preemptible or preemptible, please use as follows(one job's pods should be homogeneous):
```yaml
labels:
quota.scheduling.koordinator.sh/preemptible: "false"
```
or
```yaml
labels:
quota.scheduling.koordinator.sh/preemptible: "true"
```

if you want to declare pod can trigger preemption to preemptible pods, please use as follows:
```yaml
Spec:
preemptionPolicy: PreemptLowerPriority
labels:
quota.scheduling.koordinator.sh/can-preempt: "true"
```

if you want to check preemptible preempted message, please focus on label:
```yaml
labels:
scheduling.koordinator.sh/soft-eviction: "{preemptJob:job2...}"
```

#### Plugin Args
we now introduce plugin args:
`disableDefaultPreemptiable` control pod can be preempted without label `quota.scheduling.koordinator.sh/preemptible`
`disableDefaultTriggerPreemption` control pod can trigger preemption by label `quota.scheduling.koordinator.sh/can-preempt`, or
by preemption policy PreemptLowerPriority.
`podShouldBeDeletedWhenPreempted` control a preempted pod should be deleted by scheduler.

```
args:
disableDefaultPreemptiable: true\false
disableDefaultTriggerPreemption: true\false
podShouldBeDeletedWhenPreempted: true\false
```

### Compatibility
In some scenario, some users may expect the pod should be deleted directly by scheduler, some user want other operators to
do some clean jobs before really deleted. So here we introduce a new plugin args to control this behaviour.

In koordinator, if pods do not declare `quota.scheduling.koordinator.sh/preemptible`, the pod can be preempted. However,
in some scenarios, the expectation is opposite, the exist running pods don't have this label and can't be preempted.
So here we will add a new plugin args to control this behaviour.

In koordinator, if pods declare `preemptionPolicy: PreemptLowerPriority`, the pod can trigger preemption. However, in some
scenarios, the expectation is opposite, the exist running pods' preemptionPolicy all equal to PreemptLowerPriority. So
here we will add introduce a new label and add a new plugin args to control this behaviour.

We use `pod-group.scheduling.sigs.k8s.io` to declare the job, and this label has been already used in coScheduling.
In coScheduling, user can declare minimumNumber and totalNumber. For now, we only support minimumNumber=totalNumber scenario
cause AI job training job's property.

## Alternatives

## Unsolved Problems

## Implementation History

## References
Loading