Skip to content

Commit

Permalink
proposal: support Reservation preemption
Browse files Browse the repository at this point in the history
Signed-off-by: xulinfei.xlf <xulinfei.xlf@alibaba-inc.com>
  • Loading branch information
xulinfei.xlf committed Feb 1, 2024
1 parent ad36a0b commit 3fefae6
Showing 1 changed file with 172 additions and 0 deletions.
172 changes: 172 additions & 0 deletions docs/proposals/scheduling/20240201-enbale-reservation-preempt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
---
title: Enable-Reservation-Preempt
authors:
- "@xulinfei1996"
reviewers:
- "@buptcozy"
- "@eahydra"
- "@hormes"
- "@yihuifeng"
creation-date: 2024-02-01
last-updated: 2024-02-01
status: provisional

---

# Enable-Reservation-Preempt

<!-- TOC -->

- [Enable-Reservation-Preempt](#enable-reservation-preempt)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-goals/Future work](#non-goalsfuture-work)
- [Proposal](#proposal)
- [Key Concept/User Stories](#key-conceptuser-stories)
- [Implementation Details](#implementation-details)
- [Priority](#priority)
- [Preemption](#preemption)
- [Response To Preempted](#response-to-preempted)
\ - [Extension Point](#extension-point)
- [Over All](#over-all)
- [PreFilter](#prefilter)
- [PostFilter](#postfilter)
- [API](#api)
- [Reservation](#reservation)
- [Compatibility](#compatibility)
- [Unsolved Problems](#unsolved-problems)
- [Alternatives](#alternatives)
- [Implementation History](#implementation-history)
- [References](#references)

<!-- /TOC -->

## Summary
This proposal provides reservation preempt mechanism for the scheduler.

## Motivation
In business scenarios, there may be excessive creation of reservations. In such cases, reservations may not be scheduled
due to insufficient resources. What's more, reservations can set different priority to declare different level SLA resources.
But reservation preemption is not supported now, so users may hope to support reservation preemption.
Additionally, reservations may be created by batch, so it is expected that reservation preemption can support batch preemption.

### Goals

1. Define API to announce reservation can trigger preemption and can be preempted.

2. Define API to set reservation priority.

### Non-goals/Future work

## Proposal

### Key Concept\User Stories
Story 1: Reservation failed to schedule due to insufficient cluster resource, so need to preempt other reservations.
In some scenario, users will create and schedule reservations to nodes for tenants, which represents users bind the nodes to
this tenant. In this way, other tenants' workloads can't use this tenant purchased resource. However, as users expand the ways
in which they sell Reservations, they will offer some lower-priority reservations to users. Hence, as the higher-priority
reservations come, they will meet the fact that cluster resource is not enough to schedule. For now, we suppose the pods
bound to the preempted reservations still need to be evicted.

Story2 : Only allow higher-priority reservations to preempt lower-priority reservations when failed to schedule. Pod is
not allowed to preempt reservations. Typically, the reservations purchased by tenants are homogeneous, and it will benefit
for tenants’ training workloads if the reservations are scheduled in the same topology unit. Therefore, we aim to do
preemption in batches. Typically, users use preemptionPolicy to declare whether reservation can trigger preemption or not.
If the preemptionPolicy is PreemptLowerPriority, the reservation can trigger preemption. Hence, users need to extend the
implementation to set reservation's priority.

### Implementation Details
If reservation failed to schedule after PreFilter and PostFilter, and the reservation can trigger preemption, scheduler
can trigger preemption in PostFilter. During the preemption, the reservation can preempt the lower priority reservations.
However, in the koordinator scheduler implementations, reservations are not allowed to preempt and reservation's priority
is default set as highest.

Now we want to introduce the priority based preemption mechanism to reservation, so it's necessary to set reservation's
priority.
#### Priority
When reservation is created, its priority should be set as followed.
- `spec.preemptionPolicy` is filled by user, describe whether a reservation can trigger preemption.
- `scheduling.koordinator.sh/reservation-priority` is filled by user, describe reservation 's priority, only higher
priority reservations can preempt lower priority reservations.

#### Preemption
The reservation preemption still follows the existing Filter/PostFilter procedure and can be combined with job-level
preemption mechanism.

The preemption strategy of reservation is as followed:
1. Only preemptionPolicy=PreemptLowerPriority reservation can trigger preempt, and only lower priority reservations can be preempted.

2. Reservations can only preempt reservations.

3. If only part of Reservation can be assigned successfully in preemption dry-run process, the preemption will not
really happen.

#### Response To Preempted
Once a reservation is chosen to be evicted, it will follow the scheduler implemented eviction mechanism, support soft-eviction
or delete according to the implementation. The only difference is that we need to check the evicted is pod or reservation.

#### Extension Point

##### Over All
Generally we will extend the job-oversold plugin, and modify other plugins to support reservation preemption.
The new\delta parts are:
1. Enable Reservation pod to preempt.
2. Enable Reservation pod to be preempted.
3. patch evict label/annotation to Reservation.
4. Register Reservation eventHandler for job-oversold plugin.

##### PreFilter
If pod is reservation pod, as reservation is not associated to quota yet, so it is no need to do quota check for reservation.

##### PostFilter
We will maintain the existing implementation process, but with the following differences.

1. Skip the quota check.
2. Reservation pods only preempt reservation pods.

### API
#### Reservation
We introduce some labels to describe reservation behaviour.
- `pod-group.scheduling.sigs.k8s.io` is filled by user, describe reservation should be scheduled in batch.
- `scheduling.koordinator.sh/soft-eviction` is filled by scheduler, indicate the reservation is preempted.
- `spec.preemptionPolicy` is filled by user, describe whether a reservation can trigger preemption.
- `scheduling.koordinator.sh/reservation-priority` is filled by user, describe reservation 's priority, only higher
priority reservations can preempt lower priority reservations.

If you want to declare reservation belongs to the same batch, please use as follows:
```yaml
labels:
pod-group.scheduling.sigs.k8s.io: "reservation-batch-1"
```

if you want to declare reservation can trigger preemption, please use as follows:
```yaml
spec:
preemptionPolicy: PreemptLowerPriority
```

if you want to check reservation preempted message, please focus on label:
```yaml
labels:
scheduling.koordinator.sh/soft-eviction: "{preemptReservation:reservation2...}"
```

if you want to set reservation priority, please use as follows:
```yaml
labels:
scheduling.koordinator.sh/reservation-priority: "9900"
```

### Compatibility
We use `pod-group.scheduling.sigs.k8s.io` to declare the reservations need to schedule in batch, and this label has been
already used in CoScheduling. In CoScheduling, user can declare minimumNumber and totalNumber. (todo) For now, we only support
minimumNumber=totalNumber scenario.

## Alternatives

## Unsolved Problems

## Implementation History

## References

0 comments on commit 3fefae6

Please sign in to comment.