diff --git a/docs/proposals/scheduling/20240115-support-job-level-preemption.md b/docs/proposals/scheduling/20240115-support-job-level-preemption.md new file mode 100644 index 000000000..f9c68c098 --- /dev/null +++ b/docs/proposals/scheduling/20240115-support-job-level-preemption.md @@ -0,0 +1,251 @@ +--- +title: Support-Job-Level-Preemption +authors: +- "@xulinfei1996" +reviewers: +- "@buptcozy" +- "@eahydra" +- "@hormes" +creation-date: 2024-01-30 +last-updated: 2024-01-30 +status: provisional + +--- + +# Support Job Level Preemption + + + +- [Support Job Level Preemption](#support-job-level-preemption) + - [Summary](#summary) + - [Motivation](#motivation) + - [Goals](#goals) + - [Non-goals/Future work](#non-goalsfuture-work) + - [Proposal](#proposal) + - [Key Concept/User Stories](#key-conceptuser-stories) + - [Implementation Details](#implementation-details) + - [Non-Preemptible and Preemptible](#non-preemptible-and-preemptible) + - [Quota Assignment](#quota-assignment) + - [Preemption](#preemption) + - [Job Level Preemption](#job-level-preemption) + - [Adaptive Eviction Approach](#adaptive-eviction-approach) + - [Extension Point](#extension-point) + - [Over All](#over-all) + - [PreFilter](#prefilter) + - [PostFilter](#postfilter) + - [API](#api) + - [Pod](#pod) + - [Plugin Args](#plugin-args) + - [Compatibility](#compatibility) + - [Unsolved Problems](#unsolved-problems) + - [Alternatives](#alternatives) + - [Implementation History](#implementation-history) + - [References](#references) + + + +## Summary +This proposal provides job level resource assign and preemption mechanism for the scheduler. + +## Motivation +In a large scale shared-cluster, some quotas may be very busy, some quotas may be idle. In ElasticQuota plugin, we already +support borrowing resources from idle ElasticQuota. But when a Pod associated with a borrowed ElasticQuota needs to get +resources back, Job granularity is not considered. For Pods belonging to the same Job, we need to perform preemption at Job +granularity to ensure that we get enough resources to meet the requirements of the Job and improve resource delivery efficiency. + +### Goals + +1. Supplement necessary APIs, improve preemption-related APIs, and support job granular preemption mechanism. + +2. Clearly define the job-granular preemption mechanism. + +### Non-goals/Future work + +## Proposal + +### Key Concept\User Stories +#### Story1 +If job contains multiple pods, and only some of the pods can assign successfully, we suppose scheduler to +trigger preemption for the left not assigned pods. + +#### Story2 +If only some of the not assigned pods can be assigned during preemption, we suppose the preemption should +not really happen, which also means there are no victim pods evicted during preemption. + + +1. Each quota group declares its own "min" and "max". The semantics of "min" is the quota group's non-preemptible resources. + If quota group's "request" is less than or equal to "min", the quota group can obtain equivalent resources to the "request". + The semantics of "max" is the quota group's upper limit of resources. We require "min" should be less than or equal to max. + +2. "non-preemptible" means the pod can't be preempted during preemption, so we set the limit for non-preemptible pods used + as the "min". "preemptible" means the pod can be preempted during preemption, so we use the "max" minus the request + for non-preemptible pods as the limit for preemptible resource usage. + +3. There are two kind of pods: non-preemptible and preemptible. non-preemptible pod will be limited by quota's min. + preemptible pod will be limited by quota's max. If job is preemptible, please fill job's pods with label + `quota.scheduling.koordinator.sh/preemptible=true`, if job is non-preemptible, please fill job's pods with label + `quota.scheduling.koordinator.sh/preemptible=false`. + +4. Typically, we use pod.spec.PreemptionPolicy to judge whether a pod can preempt preemptible pods. If pod's policy is + PreemptLowerPriority, it can trigger preemption. Higher priority pods can preempt lower priority pod. The + pod.spec.PreemptionPolicy is a commonly used parameter for determining whether a Pod can trigger preemption eviction. + However, users may not have been aware of this, resulting in all existing pods potentially triggering preemption, which + is not our intention. Therefore, we introduce a new label `quota.scheduling.koordinator.sh/can-preempt` to address. + +5. In AI scenario, job has property of "AllOrNothing", which means the process of resource allocation and preemption must + be executed by job level, not pod level.In koordinator, not only `pod-group.scheduling.sigs.k8s.io` is supported, but + also other labels are supported to identify Pods that belong to the same Job. Considering compatibility, we continue + to use label `pod-group.scheduling.sigs.k8s.io` to support it. + +### Implementation Details + +#### Non-Preemptible and Preemptible +When user submit pods, it should mark pods as non-preemptible or preemptible by label configuration. In koordinator, if pods +do not declare `quota.scheduling.koordinator.sh/preemptible`, the pod can be preempted. However, in some scenario, +the expectation is opposite, the exist running pods don't have this label and can't be preempted. So here we will add a +new plugin args to control this behaviour, please see in api part. + +#### Quota Assignment +When user submit job, it should fill a job-id into pod's label. When we try to assign pod, we should first list all +pods with the same job, and summarize all pods' request as totalRequest. For non-preemptible job, we should check whether +non-preemptibleTotalUsed + newNonPreemptibleJobTotalRequest <= minQuota and totalUsed + newNonPreemptibleJobTotalRequest <= maxQuota. +For preemptible job, we should check whether math.Min(minQuota, non-preemptibleTotalRequest) + preemptibleTotalUsed + newPreemptibleJobTotalRequest <= maxQuota. + +[Koordinator Quota Management](https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20220722-multi-hierarchy-elastic-quota-management.md) +provide a mechanism to share idle resources among quotas, which called "runtime quota". Generally it will distribute idle +resources fairly to different quotas according to a certain proportion. However, in AI training scenario, due to job's +"AllOrNothing" property, the runtime quota mechanism may lead to preemptible job can't be assigned successfully because +each quota only obtains partial idle resources. So our strategy is that the idle resources will be assigned to each +quota by a greedy way, which means if one quota's preemptible job comes first, it can use all idle resources as much as it reaches its max. + +#### Preemption +##### Job Level Preemption +The new job-granularity preemption mechanism follows the existing scheduling and orchestration constraint strategy, and +will retry executing Filter during the preemption process to ensure that the candidate nodes meet the requirements for +preempting Pods. + +#### Adaptive Eviction Approach +The plugin will patch preempted pod a label `scheduling.koordinator.sh/soft-eviction` to reflect the pod is preempted. +We expect there will be a third party operator handle these labels and really do delete pod action, it will give +the user spaces to do some clean jobs before pod is really deleted. If multiple operators need to do clean jobs, the +user can also fill pod finalizer to achieve the target. Finally, we will also update preemption event for the preempted +pod. + +#### Extension Point + +##### Over All +The new\delta parts are: +1. Non-preemptible and preemptible resource judgement. +2. Job level preemption logic. +3. Quota resource recycling logic. + +##### PreFilter +we should first list all pods with the same job, and summarize all pod's request as totalRequest. For non-preemptible job, +we should check whether non-preemptibleTotalUsed + newNonPreemptibleJobTotalRequest <= minQuota and totalUsed + newNonPreemptibleJobTotalRequest <= maxQuota; +for preemptible job, we should check whether math.Min(minQuota, non-preemptibleTotalRequest) + preemptibleTotalUsed + newPreemptibleJobTotalRequest <= maxQuota. + +##### PostFilter +We need job level preemption, which means we can't execute preemption pod by pod as usual. If one of pod preemption failed, +the previous pod successful preemption will be meaningless and harmful. So the process is as followed: + +1. Judge pod can trigger preempt or not by plugin args and pod label. If job is non-preemptible, non-preemptibleTotalUsed + + non-preemptibleJobRequest <= minQuota, then job can trigger preempt. If job is preemptible, + math.Min(minQuota, non-preemptibleTotalRequest) + preemptibleTotalUsed + newPreemptibleJobTotalRequest <= maxQuota, then + job can trigger preempt. +2. List of all pods of the job, iterate over waiting pods of framework to get all not-assumed pods list. + (some pods may assume successfully in previous assign process and in waiting permit status). +3. Iterate filteredNodeStatusMap of PostFilter's variable which code is not UnschedulableAndUnresolvable, then try to remove + all preemptible pods for nodeInfo's copy. To avoid after preemption, the non-preemptibleUsed + preemptibleUsed > maxQuota, + we also list all running preemptible pods of this quota, and try to remove all preemptible pods for nodeInfo's copy too. +4. For each not-assumed pod of the job, try to execute RunFilterPlugins based on nodeInfo's copy, if success, add pod into nodeInfo. +5. If all not-assumed pod preemption process success, which means job preemption success. We do preempted-pod try-assign-back + logic and finally pick up real preempted pods and node. If one not-assumed pod preemption fail, we think this round + preemption is fail. During assign-back process, we should check the quota limit of the preemption quota. +6. For all preempted pods, we patch a preemption label for upper operator to handle some clean logic and delete pod. +7. For all preemptor pods, we first execute framework.AddNominatedPods() to hold the resources in memory, and we should + store preemptor pod/preempted pods/preempted node to the local cache. Due to preempted pod are deleted by + other operator, if deleting process is slow and resource are not returned, once the next round of scheduling process + comes, it will trigger preemption again, which may generate new preempted pods, and this process may not stop + util all preemptible are preempted. So at the beginning of PostFilter, it will check cache if last round preemption finishes + or not. + + +### API +#### Pod +We introduce some labels to describe pod behaviour. +- `pod-group.scheduling.sigs.k8s.io` is filled by user, describe pod's jobName. +- `quota.scheduling.koordinator.sh/preemptible`is filled by user, indicate whether is non-preemptible or preemptible. +- `scheduling.koordinator.sh/soft-eviction` is filled by scheduler, indicate the pod is preempted. +- `pod.spec.PreemptionPolicy` is filled by user, describe whether a pod can trigger preemption to preemptible pods. +- `quota.scheduling.koordinator.sh/can-preempt` is filled by user, also describe whether a pod can trigger preemption to preemptible pods. + +If pod does not have `quota.scheduling.koordinator.sh/preemptible` label, we declare that the pod equals to non-preemptible. + +if you want to declare pod belongs to a job, please use as follows: +```yaml +labels: + pod-group.scheduling.sigs.k8s.io: "job1" +``` + +if you want to declare pod is non-preemptible or preemptible, please use as follows(one job's pods should be homogeneous): +```yaml +labels: + quota.scheduling.koordinator.sh/preemptible: "false" +``` +or +```yaml +labels: + quota.scheduling.koordinator.sh/preemptible: "true" +``` + +if you want to declare pod can trigger preemption to preemptible pods, please use as follows: +```yaml +Spec: + preemptionPolicy: PreemptLowerPriority +labels: + quota.scheduling.koordinator.sh/can-preempt: "true" +``` + +if you want to check preemptible preempted message, please focus on label: +```yaml +labels: + scheduling.koordinator.sh/soft-eviction: "{preemptJob:job2...}" +``` + +#### Plugin Args +we now introduce plugin args: +`disableDefaultPreemptiable` control pod can be preempted without label `quota.scheduling.koordinator.sh/preemptible` +`disableDefaultTriggerPreemption` control pod can trigger preemption by label `quota.scheduling.koordinator.sh/can-preempt`, or +by preemption policy PreemptLowerPriority. +`podShouldBeDeletedWhenPreempted` control a preempted pod should be deleted by scheduler. + +``` +args: + disableDefaultPreemptiable: true\false + disableDefaultTriggerPreemption: true\false + podShouldBeDeletedWhenPreempted: true\false +``` + +### Compatibility +In some scenario, some users may expect the pod should be deleted directly by scheduler, some user want other operators to +do some clean jobs before really deleted. So here we introduce a new plugin args to control this behaviour. + +In koordinator, if pods do not declare `quota.scheduling.koordinator.sh/preemptible`, the pod can be preempted. However, +in some scenarios, the expectation is opposite, the exist running pods don't have this label and can't be preempted. +So here we will add a new plugin args to control this behaviour. + +In koordinator, if pods declare `preemptionPolicy: PreemptLowerPriority`, the pod can trigger preemption. However, in some +scenarios, the expectation is opposite, the exist running pods' preemptionPolicy all equal to PreemptLowerPriority. So +here we will add introduce a new label and add a new plugin args to control this behaviour. + +We use `pod-group.scheduling.sigs.k8s.io` to declare the job, and this label has been already used in coScheduling. +In coScheduling, user can declare minimumNumber and totalNumber. For now, we only support minimumNumber=totalNumber scenario +cause AI job training job's property. + +## Alternatives + +## Unsolved Problems + +## Implementation History + +## References \ No newline at end of file