diff --git a/docs/proposals/scheduling/20240201-enbale-reservation-preempt.md b/docs/proposals/scheduling/20240201-enbale-reservation-preempt.md new file mode 100644 index 000000000..584450d33 --- /dev/null +++ b/docs/proposals/scheduling/20240201-enbale-reservation-preempt.md @@ -0,0 +1,171 @@ +--- +title: Enable-Reservation-Preempt +authors: +- "@xulinfei1996" +reviewers: +- "@buptcozy" +- "@eahydra" +- "@hormes" +creation-date: 2024-02-01 +last-updated: 2024-02-01 +status: provisional + +--- + +# Enable Reservation Preempt + + + +- [Enable Reservation Preempt](#enable-reservation-preempt) + - [Summary](#summary) + - [Motivation](#motivation) + - [Goals](#goals) + - [Non-goals/Future work](#non-goalsfuture-work) + - [Proposal](#proposal) + - [Key Concept/User Stories](#key-conceptuser-stories) + - [Implementation Details](#implementation-details) + - [Priority](#priority) + - [Preemption](#preemption) + - [Response To Preempted](#response-to-preempted) + - [Extension Point](#extension-point) + - [Over All](#over-all) + - [PreFilter](#prefilter) + - [PostFilter](#postfilter) + - [API](#api) + - [Reservation](#reservation) + - [Compatibility](#compatibility) + - [Unsolved Problems](#unsolved-problems) + - [Alternatives](#alternatives) + - [Implementation History](#implementation-history) + - [References](#references) + + + +## Summary +This proposal provides reservation preempt mechanism for the scheduler. + +## Motivation +In business scenarios, there may be excessive creation of reservations. In such cases, reservations may not be scheduled +due to insufficient resources. What's more, reservations can set different priority to declare different level SLA resources. +But reservation preemption is not supported now, so users may hope to support reservation preemption. +Additionally, reservations may be created by batch, so it is expected that reservation preemption can support batch preemption. + +### Goals + +1. Define API to announce reservation can trigger preemption and can be preempted. + +2. Define API to set reservation priority. + +### Non-goals/Future work + +## Proposal + +### Key Concept\User Stories +Story 1: Reservation failed to schedule due to insufficient cluster resource, so need to preempt other reservations. +In some scenario, users will create and schedule reservations to nodes for tenants, which represents users bind the nodes to +this tenant. In this way, other tenants' workloads can't use this tenant purchased resource. However, as users expand the ways +in which they sell Reservations, they will offer some lower-priority reservations to users. Hence, as the higher-priority +reservations come, they will meet the fact that cluster resource is not enough to schedule. For now, we suppose the pods +bound to the preempted reservations still need to be evicted. + +Story2 : Only allow higher-priority reservations to preempt lower-priority reservations when failed to schedule. Pod is +not allowed to preempt reservations. Typically, the reservations purchased by tenants are homogeneous, and it will benefit +for tenants’ training workloads if the reservations are scheduled in the same topology unit. Therefore, we aim to do +preemption in batches. Typically, users use preemptionPolicy to declare whether reservation can trigger preemption or not. +If the preemptionPolicy is PreemptLowerPriority, the reservation can trigger preemption. Hence, users need to extend the +implementation to set reservation's priority. + +### Implementation Details +If reservation failed to schedule after PreFilter and PostFilter, and the reservation can trigger preemption, scheduler +can trigger preemption in PostFilter. During the preemption, the reservation can preempt the lower priority reservations. +However, in the koordinator scheduler implementations, reservations are not allowed to preempt and reservation's priority +is default set as highest. + +Now we want to introduce the priority based preemption mechanism to reservation, so it's necessary to set reservation's +priority. +#### Priority +When reservation is created, its priority should be set as followed. +- `spec.preemptionPolicy` is filled by user, describe whether a reservation can trigger preemption. +- `scheduling.koordinator.sh/reservation-priority` is filled by user, describe reservation 's priority, only higher + priority reservations can preempt lower priority reservations. + +#### Preemption +The reservation preemption still follows the existing Filter/PostFilter procedure and can be combined with job-level +preemption mechanism. + +The preemption strategy of reservation is as followed: +1. Only preemptionPolicy=PreemptLowerPriority reservation can trigger preempt, and only lower priority reservations can be preempted. + +2. Reservations can only preempt reservations. + +3. If only part of Reservation can be assigned successfully in preemption dry-run process, the preemption will not + really happen. + +#### Response To Preempted +Once a reservation is chosen to be evicted, it will follow the scheduler implemented eviction mechanism, support soft-eviction +or delete according to the implementation. The only difference is that we need to check the evicted is pod or reservation. + +#### Extension Point + +##### Over All +Generally we will extend the job-oversold plugin, and modify other plugins to support reservation preemption. +The new\delta parts are: +1. Enable Reservation pod to preempt. +2. Enable Reservation pod to be preempted. +3. patch evict label/annotation to Reservation. +4. Register Reservation eventHandler for job-oversold plugin. + +##### PreFilter +If pod is reservation pod, as reservation is not associated to quota yet, so it is no need to do quota check for reservation. + +##### PostFilter +We will maintain the existing implementation process, but with the following differences. + +1. Skip the quota check. +2. Reservation pods only preempt reservation pods. + +### API +#### Reservation +We introduce some labels to describe reservation behaviour. +- `pod-group.scheduling.sigs.k8s.io` is filled by user, describe reservation should be scheduled in batch. +- `scheduling.koordinator.sh/soft-eviction` is filled by scheduler, indicate the reservation is preempted. +- `spec.preemptionPolicy` is filled by user, describe whether a reservation can trigger preemption. +- `scheduling.koordinator.sh/reservation-priority` is filled by user, describe reservation 's priority, only higher + priority reservations can preempt lower priority reservations. + +If you want to declare reservation belongs to the same batch, please use as follows: +```yaml +labels: + pod-group.scheduling.sigs.k8s.io: "reservation-batch-1" +``` + +if you want to declare reservation can trigger preemption, please use as follows: +```yaml +spec: + preemptionPolicy: PreemptLowerPriority +``` + +if you want to check reservation preempted message, please focus on label: +```yaml +labels: + scheduling.koordinator.sh/soft-eviction: "{preemptReservation:reservation2...}" +``` + +if you want to set reservation priority, please use as follows: +```yaml +labels: + scheduling.koordinator.sh/reservation-priority: "9900" +``` + +### Compatibility +We use `pod-group.scheduling.sigs.k8s.io` to declare the reservations need to schedule in batch, and this label has been +already used in CoScheduling. In CoScheduling, user can declare minimumNumber and totalNumber. (todo) For now, we only support +minimumNumber=totalNumber scenario. + +## Alternatives + +## Unsolved Problems + +## Implementation History + +## References \ No newline at end of file