From 1ecd79f748af2617111d0bfa1ca848f9e172f4d7 Mon Sep 17 00:00:00 2001 From: Traian Schiau <55734665+trasc@users.noreply.github.com> Date: Thu, 23 Nov 2023 16:57:43 +0200 Subject: [PATCH] [KEP] A mechanism to stop a ClusterQueue (#1288) * [KEP] A mechanism to stop a ClusterQueue * Update keps/1284-cluster-queue-stop/README.md Co-authored-by: Yaroslava Serdiuk * update kep * review fix * added suggested integration test --------- Co-authored-by: Yaroslava Serdiuk Co-authored-by: Anton Stuchinskii --- keps/1284-cluster-queue-stop/README.md | 134 +++++++++++++++++++++++++ keps/1284-cluster-queue-stop/kep.yaml | 26 +++++ 2 files changed, 160 insertions(+) create mode 100644 keps/1284-cluster-queue-stop/README.md create mode 100644 keps/1284-cluster-queue-stop/kep.yaml diff --git a/keps/1284-cluster-queue-stop/README.md b/keps/1284-cluster-queue-stop/README.md new file mode 100644 index 0000000000..4a876a51b1 --- /dev/null +++ b/keps/1284-cluster-queue-stop/README.md @@ -0,0 +1,134 @@ +# KEP-1284: Add a mechanism to stop a ClusterQueue. + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1](#story-1) + - [Notes/Constraints/Caveats](#notesconstraintscaveats) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [API/ClusterQueue](#apiclusterqueue) + - [Controllers](#controllers) + - [ClusterQueue](#clusterqueue) + - [Workload](#workload) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit Tests](#unit-tests) + - [Integration tests](#integration-tests) + - [Graduation Criteria](#graduation-criteria) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Summary +Add setting in a ClusterQueue that an administrator is able to use in order to pause new admissions and have the option to cancel current QuotaReservations and Evict admitted workloads. + +## Motivation + +This is a common admin journey to control usage from a user. + +### Goals + +Add a setting in a ClusterQueue that an administrator is able to use in order to to pause new admissions and have the option to cancel current QuotaReservations and Evict admitted workloads. + +### Non-Goals + +Manage the QuotaReservation and Admission of workloads from the same cohort that might borrow resources from the ClusterQueue in question. + +## Proposal + +Add a new member in the ClusterQueue implementation `stopPolicy` the presence of which will mark the ClusterQueue as Inactive and it's value will control how the `Admitted` or `Reserving` workloads are affected. + +### User Stories +#### Story 1 + +As a cluster administrator I want to be able to stop the new admissions in a specific ClusterQueue with the option of Evicting currently admitted Workloads or canceling QuotaReservations. + +### Notes/Constraints/Caveats +Managing the Reservation canceling and Eviction of workloads in other queues from the same cohort that +are potentially borrowing resources from the stopped queue adds a considerable amount of complexity +while having a limited added value, therefore these cases are not covered in this first iteration. + +### Risks and Mitigations + +## Design Details + +### API/ClusterQueue + +```go +type ClusterQueueSpec struct { + // .... + + // stopPolicy - if set the ClusterQueue is considered Inactive, no new reservation being + // made. + // + // Depending on its value, its associated workloads will: + // + // - None - Workloads are admitted + // - HoldAndDrain - Admitted workloads are evicted and Reserving workloads will cancel the reservation. + // - Hold - Admitted workloads will run to completion and Reserving workloads will cancel the reservation. + // + // +kubebuilder:validation:Enum=None;Hold;HoldAndDrain + // +kubebuilder:default="None" + StopPolicy StopPolicy `json:"stopPolicy,omitempty"` +} + +type StopPolicy string + +const ( + None StopPolicy = "None" + Hold StopPolicy = "Hold" + HoldAndDrain StopPolicy = "HoldAndDrain" +) + + +``` +### Controllers +#### ClusterQueue + +Once the `stopPolicy` is set the cluster queue is marked as inactive with a relevant status message. + +#### Workload + +If the cluster queue associated to a workload has the `stopPolicy` changed depending on the policy value and state of the +workload it should Evict or cancel the reservation of the workload. + +### Test Plan + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + +#### Unit Tests + +To be added depending on the added code complexity. + +#### Integration tests + +The `controllers/core` suite should check: + +1. ClusterQueue - Once the `stopPolicy` is set a ClusterQueue becomes Inactive. +2. Workload - Once its ClusterQueue `stopPolicy` is set, depending on the value: +- The Reserving workloads are canceling the reservation. +- The Admitted workloads get Evicted and the Reserving ones cancel their reservation. +- New workload is not admitted when cluster queue is inactive + +### Graduation Criteria + + +## Implementation History + + +## Drawbacks + + +## Alternatives + diff --git a/keps/1284-cluster-queue-stop/kep.yaml b/keps/1284-cluster-queue-stop/kep.yaml new file mode 100644 index 0000000000..dd42cc9cc4 --- /dev/null +++ b/keps/1284-cluster-queue-stop/kep.yaml @@ -0,0 +1,26 @@ +title: Add a mechanism to stop a ClusterQueue +kep-number: 1284 +authors: + - "@trasc" +status: implementable +creation-date: 2023-10-30 +reviewers: + - "@tenzen-y" + - "@mwielgus" +approvers: + - "@alculquicondor" + + +# The target maturity stage in the current dev cycle for this KEP. +stage: beta + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v0.6" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + beta: "v0.6" + +