diff --git a/docs/proposals/scheduling/20240201-enbale-reservation-preempt.md b/docs/proposals/scheduling/20240201-enbale-reservation-preempt.md new file mode 100644 index 000000000..cd16b8dce --- /dev/null +++ b/docs/proposals/scheduling/20240201-enbale-reservation-preempt.md @@ -0,0 +1,181 @@ +--- +title: Enable-Reservation-Preempt +authors: +- "@xulinfei1996" +reviewers: +- "@buptcozy" +- "@eahydra" +- "@hormes" +creation-date: 2024-02-01 +last-updated: 2024-02-01 +status: provisional + +--- + +# Enable Reservation Preempt + + + +- [Enable Reservation Preempt](#enable-reservation-preempt) + - [Summary](#summary) + - [Motivation](#motivation) + - [Goals](#goals) + - [Non-goals/Future work](#non-goalsfuture-work) + - [Proposal](#proposal) + - [Key Concept/User Stories](#key-conceptuser-stories) + - [Implementation Details](#implementation-details) + - [Priority](#priority) + - [Preemption](#preemption) + - [Response To Preempted](#response-to-preempted) + - [Extension Point](#extension-point) + - [Over All](#over-all) + - [PreFilter](#prefilter) + - [PostFilter](#postfilter) + - [API](#api) + - [Reservation](#reservation) + - [Compatibility](#compatibility) + - [Unsolved Problems](#unsolved-problems) + - [Alternatives](#alternatives) + - [Implementation History](#implementation-history) + - [References](#references) + + + +## Summary +This proposal provides reservation preempt mechanism for the scheduler. + +## Motivation +In business scenarios, there may be excessive creation of reservations. In such cases, reservations may not be scheduled +due to insufficient resources. What's more, reservations can set different priority to declare different level SLA resources. +But reservation preemption is not supported now, so users may hope to support reservation preemption. +Additionally, reservations may be created by batch, so it is expected that reservation preemption can support batch preemption. + +### Goals + +1. Define API to announce reservation can trigger preemption and can be preempted. + +2. Define API to set reservation priority. + +### Non-goals/Future work + +## Proposal + +### Key Concept\User Stories +Story 1: Reservation failed to schedule due to insufficient cluster resource, so need to preempt other reservations. +In some scenario, users will create and schedule reservations to nodes for tenants, which represents users bind the nodes to +this tenant. In this way, other tenants' workloads can't use this tenant purchased resource. However, as users expand the ways +in which they sell Reservations, they will offer some lower-priority reservations to users. Hence, as the higher-priority +reservations come, they will meet the fact that cluster resource is not enough to schedule. For now, we suppose the pods +bound to the preempted reservations still need to be evicted. +Typically, the reservations purchased by tenants are homogeneous, and it will benefit for tenants’ training workloads if +the reservations are scheduled in the same topology unit. Therefore, we aim to do preemption in batches. Typically, users +use preemptionPolicy to declare whether reservation can trigger preemption or not. If the preemptionPolicy is +PreemptLowerPriority, the reservation can trigger preemption. Hence, users need to extend the implementation to set +reservation's priority. + +Story2 : Only allow higher-priority reservations to preempt lower-priority reservations when failed to schedule. Pod is +not allowed to preempt reservations. + + +### Implementation Details +If reservation failed to schedule after PreFilter and Filter, and the reservation can trigger preemption, scheduler +can trigger preemption in PostFilter. During the preemption, the reservation can preempt the lower priority reservations. + +But by default, the koord-scheduler sets the priority of reserved pod (constructed by Reservation) to Int32Max to disable +preempting reservation even if the preemptor also is Reservation. So we introduce new label +`scheduling.koordinator.sh/reservation-priority` to set priority according to the needs. If koord-scheduler notices that +Reservation.Labels has labels, use that value as priority, otherwise keep the default behavior. +`scheduling.koordinator.sh/reservation-priority` should be defined as Metadata.Labels of Reservation. The higher the value, +the higher the priority. + +To enable the Reservation triggering preemption, users **MUST** set the Reservation.Spec.Template.Spec.PreemptionPolicy +with `PreemptLowerPriority`. Koord-scheduler does not set the preemptionPolicy of reserved pod, so existing reserve pods +all can't trigger preemption. + +#### Priority +When reservation is created, its priority should be set as followed. +- `spec.preemptionPolicy` is filled by user, describe whether a reservation can trigger preemption. +- `scheduling.koordinator.sh/reservation-priority` is filled by user, describe reservation 's priority, only higher + priority reservations can preempt lower priority reservations. + +#### Preemption +The reservation preemption still follows the existing Filter/PostFilter procedure and can be combined with job-level +preemption mechanism. + +The preemption strategy of reservation is as followed: +1. Only preemptionPolicy=PreemptLowerPriority reservation can trigger preempt, and only lower priority reservations can be preempted. + +2. Reservations can only preempt reservations. + +3. If only part of Reservation can be assigned successfully in preemption dry-run process, the preemption will not + really happen. + +4. If reservation is preempted, the bound pods should also be evicted. + +#### Response To Preempted +Once a reservation is chosen to be evicted, it will follow the scheduler implemented eviction mechanism, support soft-eviction +or delete according to the implementation. The only difference is that we need to check the evicted is pod or reservation. + +#### Extension Point + +##### Over All +Generally we will extend the elastic quota plugin, and modify other plugins to support reservation preemption. +The new\delta parts are: +1. Enable Reserve pod to preempt. +2. Enable Reserve pod to be preempted. +3. patch evict label/annotation to Reservation. +4. Register Reservation eventHandler for job-oversold plugin. + +##### PreFilter +If pod is Reserve pod, as reservation is not associated to quota yet, so it is no need to do quota check for reservation. + +##### PostFilter +We will maintain the existing implementation process, but with the following differences. + +1. If pod is Reserve pod, as reservation is not associated to quota yet, so it is no need to do quota check for reservation. +2. Reserve pods only preempt reserve pods. + +### API +#### Reservation +We introduce some labels to describe reservation behaviour. +- `pod-group.scheduling.sigs.k8s.io` is filled by user, describe reservation should be scheduled in batch. +- `scheduling.koordinator.sh/soft-eviction` is filled by scheduler, indicate the reservation is preempted. +- `spec.preemptionPolicy` is filled by user, describe whether a reservation can trigger preemption. +- `scheduling.koordinator.sh/reservation-priority` is filled by user, describe reservation 's priority, only higher + priority reservations can preempt lower priority reservations. + +For example, there are two Reservations as followed. If Reservation1 failed to schedule due to resource not enough, then +Reservation1 can preempt Reservation2. Because Reservation1 spec.PreemptionPolicy equals to PreemptLowerPriority and its +priority is higher than Reservation2's priority. + +Reservation1 +```yaml +spec: + preemptionPolicy: PreemptLowerPriority +labels: + scheduling.koordinator.sh/reservation-priority: "9900" +``` + +Reservation2 +```yaml +spec: + preemptionPolicy: PreemptLowerPriority +labels: + scheduling.koordinator.sh/reservation-priority: "9800" +``` + + +### Compatibility +We use `pod-group.scheduling.sigs.k8s.io` to declare the reservations need to schedule in batch, and this label has been +already used in CoScheduling. + +## Alternatives + +## Unsolved Problems +In CoScheduling, user can declare minimumNumber and totalNumber. For now, we only support minimumNumber=totalNumber scenario. + +For now, we only support reservation not associated to quota yet. + +## Implementation History + +## References \ No newline at end of file