From 333e719775cf7fe48738eef583a740d8a7ab2a6e Mon Sep 17 00:00:00 2001 From: Vicente Ferrara Date: Wed, 13 Mar 2024 20:43:34 +0000 Subject: [PATCH 01/10] added kep --- keps/77-dynamically-sized-jobs/README.md | 0 keps/77-dynamically-sized-jobs/kep.yaml | 24 ++++++++++++++++++++++++ 2 files changed, 24 insertions(+) create mode 100644 keps/77-dynamically-sized-jobs/README.md create mode 100644 keps/77-dynamically-sized-jobs/kep.yaml diff --git a/keps/77-dynamically-sized-jobs/README.md b/keps/77-dynamically-sized-jobs/README.md new file mode 100644 index 0000000000..e69de29bb2 diff --git a/keps/77-dynamically-sized-jobs/kep.yaml b/keps/77-dynamically-sized-jobs/kep.yaml new file mode 100644 index 0000000000..f9b80a2928 --- /dev/null +++ b/keps/77-dynamically-sized-jobs/kep.yaml @@ -0,0 +1,24 @@ +title: +kep-number: 77 +authors: + - "@vicentefb" +status: provisional +creation-date: 2024-03-11 +reviewers: + - "@andrewsykim" + - "@alculquicondor" + - "@astefanutti" +approvers: + - "@alculquicondor" + - "@andrewsykim" + - "@astefanutti" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v0.7" + +feature-gates: + - name: DynamicallySizedJobs \ No newline at end of file From 5cb5882029ff94c3a27ec68d5f7b3d626e1b9bf4 Mon Sep 17 00:00:00 2001 From: Vicente Ferrara Date: Fri, 15 Mar 2024 02:25:54 +0000 Subject: [PATCH 02/10] kep updated applied toc --- keps/77-dynamically-sized-jobs/README.md | 286 +++++++++++++++++++++++ keps/77-dynamically-sized-jobs/kep.yaml | 2 +- 2 files changed, 287 insertions(+), 1 deletion(-) diff --git a/keps/77-dynamically-sized-jobs/README.md b/keps/77-dynamically-sized-jobs/README.md index e69de29bb2..0725c03e61 100644 --- a/keps/77-dynamically-sized-jobs/README.md +++ b/keps/77-dynamically-sized-jobs/README.md @@ -0,0 +1,286 @@ +# KEP-77: Dynamically Sized JObs + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1 - RayCluster w/ autoscaling](#story-1---raycluster-w-autoscaling) +- [Design Details](#design-details) + - [Workload Slices](#workload-slices) + - [Creating Workload Slices](#creating-workload-slices) + - [Pod Scheduling Gates](#pod-scheduling-gates) + - [Garbage Collecting Workload Slices](#garbage-collecting-workload-slices) +- [Phases for MVP (alpha)](#phases-for-mvp-alpha) + - [Phase 1 - Scale Down](#phase-1---scale-down) + - [Job controller](#job-controller) + - [Phase 2 - Aggregating Workload Slices](#phase-2---aggregating-workload-slices) + - [Phase 3 - Scale up with Workload Slices and Scheduling Gates](#phase-3---scale-up-with-workload-slices-and-scheduling-gates) + - [Scheduler](#scheduler) +- [Additional Details](#additional-details) + - [Feature Gate](#feature-gate) + - [Locking Flavor Assignments for Workload Slices](#locking-flavor-assignments-for-workload-slices) + - [Webhook changes](#webhook-changes) + - [Test Plan](#test-plan) + - [Unit Tests](#unit-tests) + - [Integration tests](#integration-tests) + - [Graduation Criteria](#graduation-criteria) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Ignore Resize from Kuberay](#ignore-resize-from-kuberay) + + +## Summary + +Enable dynamic sizing of Kueue jobs. For the MVP, we will only focus on supporting RayClusters with autoscaling enabled, but other resources that can benefit from dynamic sizing should be supported eventually. + +See: [Support dynamically sized (elastic) jobs #77](https://github.com/kubernetes-sigs/kueue/issues/77) + +## Motivation + +Kueue currently lacks support for resizing jobs. When a job is resized, Kueue will recreate the Workload representation of the job, leading to a disruptive suspend and requeue process. This limitation hinders the usability of Kueue for resources like RayCluster that sometimes have autoscaling capabilities enabled. + +To properly support RayCluster, Kueue needs to gracefully handle the scale up and scale down of RayCluster nodes. Concretely this means that scaling the resources used by a RayCluster is appropriately reflected in the respective ClusterQueue without needing to suspend the entire cluster. + +### Goals + +- Gracefully handle resize operations for Kueue jobs (i.e. update quota usage without suspend, enqueue scale ups) +- Autoscaling RayCluster works with Kueue (MVP) + +### Non-Goals + +- Vertical scaling of workloads – Kueue will only handle resize operations that scale Pods horizontally +- Support resize for other Kueue jobs such as QueuedWorkload, JobSet, etc (future) +- Resizing of RayJobs +- Partial Preemption + +## Proposal + +Update the Job framework reconciler and introduce new controllers to orchestrate dynamic resizing of jobs. We are only interested in horizontal scaling of jobs (e.g. scaling more replicas). At a high level, this will be accomplished by: +- Creating Workload Slice objects that represent incremental scale up of jobs. This gives us per-replica control of workload admission. Workload Slices will be garbage collected and consolidated with their parent workloads after successful admission. +- Adding default scheduling gates to control the scheduling of new pods based on their admission status. +- Dynamically adjust quotas in ClusterQueue based on scaling events. + +For the MVP, we will only focus on admission of **RayClusters** with autoscaling enabled. + +### User Stories + +#### Story 1 - RayCluster w/ autoscaling + +1. The user creates a RayCluster. +2. Kueue admits the RayCluster based on the requested resources in the head pod and worker pod. +3. User updates RayCluster to enable autoscaling +4. Kueue will not suspend and requeue the RayCluster, instead it will dynamically update ClusterQueue usage based on the scaling event. + +## Design Details + +### Workload Slices +To support horizontal scaling of jobs, we will introduce the concept of a "Workload Slice”. A Workload Slice is a Workload object with an owner reference to the original Workload for a job. Workload Slices represent per-replica changes to a job that were not initially accounted for when the job was created. + +The benefit of Workload Slices is that Kueue can evaluate admission on a per-replica basis without changing the existing semantics of the Workload API. Once a Workload Slice is admitted, it will be garbage collected and its resources will be aggregated into the admission status of the parent workload. + + +### Creating Workload Slices + +The [GenericJob interface](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/jobframework/interface.go#L30-L55) will be updated to handle resize operations of jobs. + +```golang +type GenericJob interface { + ... + ... + ResizeJob(wl *kueue.Workload) error +} +``` +Jobs implementing the ResizeJob method will create a Workload Slice for every new replica of a job. + +### Pod Scheduling Gates + +Inside **raycluster_webhook** implement schedulingGate injection for pods on RayCluster creation time. Which will then be ungated following a similar behavior as to how a job is suspended and then unsuspended in the beginning. When we have a scale up, the new pods will be gated due to the schedulingGates injection in the webhook. + +After the creation of each individual Workload Slice and admission of a Workload Slice, the **workload_scheduling_gates_controller** should be in charge of removing the scheduling gates from each pod. We only need to ungate the number of pods to match the number of admitted pods, this should be a counter. We don’t want to accidentally ungate too many pods since race conditions could happen and we also don’t want to double count. + +### Garbage Collecting Workload Slices + +The logic of folding and deleting would be isolated in this controller that takes a look at the number of pods that were ungated. We don’t necessarily have to say that this Workload Slice belongs to a specific pod. + +1. You increment the ungated counter and pass the UID of the workload you are folding to the parent workload. If we still don’t see the workload being deleted, we at least know it has been counted towards the parent workload and we cannot count it again. +2. This UID can be seen in the parent’s workload spec. + + +## Phases for MVP (alpha) + +### Phase 1 - Scale Down +Scaling down will be the first phase towards MVP because it can be implemented without introducing the Workload Slices. + +Scaling down a RayCluster won’t involve the creation of Workload Slices, instead it’ll involve an update to the current workload, no requeuing. + +1. Compare job's PodSet.Count vs Workload.Spec.PodSets[1].Count (worker group) inside the jobframework generic reconciler. +2. Call *updateWorkloadToMatchJob()*, this will construct and update the workload and in turn update the PodSet Count field. +3. Inside the *Update()* method from the *workload_controller* update the workload in cache and in queue. By updating the workload in cache this will update the cluster queue resource usage and by updating the workload in queue this will trigger the scheduler so that it re-assigns the flavors to the already assumed workload and in this way PodSetAssignments will be updated by applying admission based on the new assignments. +4. Inside the schedule logic in the scheduler, since the workload is already assumed in the cache we need to specify if the feature is enabled so that we can apply admission to the workload and update its PodSetAssignments. + +#### Job controller + +Rework *equivalentToWorkload()* and *reconcile()* to account for potential differences between the job’s number of replicas and the running workload’s PodSet.Count. + +Given these changes, check if the delta is positive or negative indicating a scaleup/scaledown. If it’s a scaledown the reconciler should trigger an update on the workload to match the new job’s spec and no quota needs to be checked or accounted for, the cluster queue should update the workload resource usage. + +### Phase 2 - Aggregating Workload Slices + +In Phase 2, aggregating Workload Slices into the parent workload will be implemented. + +### Phase 3 - Scale up with Workload Slices and Scheduling Gates + +In Phase 3, scale up will be implemented by introducing Workload Slices and adding Pod scheduling gates as part of Kueue’s mutating admission for RayCluster. + +When the RayCluster scales, the RayCluster webhook would be modified to intercept and "gate" all pods. Every time there’s a resize, you create a dependable (child) workload slice and once it's admitted, you increase the count in the original workload, delete the old workload and remove the schedulingGates. +- Pros: You are able to hold the pods added by Kuberay +- Cons: The fact of having schedulingGates, means we need an API call for every pod, because all pods that are created by the RayCluster are going to have schedulingGates. We need to remove those gates and for every pod you need to make API calls. + +#### Scheduler +Since every scale up will have its own individual workload they should be proposed to the current scheduling cycle and continue the normal admission process. We should lock flavor assignments to Workload Slices we need to ensure that the Workload Slice is assigned the same resource flavor as the parent workload. + +The *nominate()* returns the workloads with their requirements (resource flavors, borrowing) if they were admitted by the clusterQueues in the snapshot, so we need to return the original workload that was already admitted with the resize information inside *TotalRequests* so that *PodSetAssignments* is also updated. + +## Additional Details + +### Feature Gate +In kube_features add Elastic/Dynamic size jobs feature gate. + +### Locking Flavor Assignments for Workload Slices + +We can extract the flavor(s) that the parent workload is using through wl.Status.Admission.PodSetAssigments[1].Flavors + +We add a new field to the Workload Info object to know which parent flavor(s) were used. + +```golang +// Info holds a Workload object and some pre-processing. +type Info struct { + Obj *kueue.Workload + // list of total resources requested by the podsets. + TotalRequests []PodSetResources + // Populated from the queue during admission or from the admission field if + // already admitted. + ClusterQueue string + LastAssignment *AssignmentClusterQueueState + // Parent Flavors + ParentFlavor []string +} +``` + +Which can be passed on as an extra parameter in *getAssignments()* in the scheduler to flavorassigner.go when it assigns a flavor through *assignFlavors()* + +```golang +func (a *FlavorAssigner) assignFlavors(log logr.Logger, requests []workload.PodSetResources, ParentFlavor []string) Assignment {} +``` + +Which then calls *findFlavorForPodSetResource()* and we can use the parent flavor(s) value to check if this flavor can fit to the pod. If it doesn’t we don’t try to find another flavor for it. + +### Webhook changes +In **workload_webhook** modify *validateWorkloadUpdate()* it to make PodSets and PodSetAssignments mutable for a running job. + +```diff +if workload.HasQuotaReservation(oldObj) { +- allErrs = append(allErrs, apivalidation.ValidateImmutableField(newObj.Spec.PodSets, oldObj.Spec.PodSets, specPath.Child("podSets"))...) + allErrs = append(allErrs, apivalidation.ValidateImmutableField(newObj.Spec.PriorityClassSource, oldObj.Spec.PriorityClassSource, specPath.Child("priorityClassSource"))...) + allErrs = append(allErrs, apivalidation.ValidateImmutableField(newObj.Spec.PriorityClassName, oldObj.Spec.PriorityClassName, specPath.Child("priorityClassName"))...) + } +... +- allErrs = append(allErrs, validateAdmissionUpdate(newObj.Status.Admission, oldObj.Status.Admission, field.NewPath("status", "admission"))...) + +``` + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +#### Unit Tests + + + + + +#### Integration tests + +### Graduation Criteria + + +The feature starts at the alpha level, with a feature gate. + +In the Alpha version, Dynamically Sized Jobs will support RayCluster resizing. +- Scale down/up of replica workers + + +## Implementation History + + + +## Drawbacks + + + + +## Alternatives + + +### Ignore Resize from Kuberay + +- Idea: Ignoring the scale up or rejecting the scale up from the auto scaler and storing the value in an annotation so that Kueue takes a decision based on the annotation. We’d need a signal from Kueue to hold this in the raycluster_webhook and identify when the resize comes from Kueue, it has to be accepted. +- Pros: No need to intercept the pods and no need of using schedulingGates +- Cons: There would be a permanent race bewteen Kueue and Kuberay autoscaler to change the counter in the RayCluster replica number for the worker group. +- Exploration: See if autoscaler would indicate a desired size in the spec without altering the number of replicas directly. +- Discarded: Higher complexity than gating/ungating pods via SchedulingGates diff --git a/keps/77-dynamically-sized-jobs/kep.yaml b/keps/77-dynamically-sized-jobs/kep.yaml index f9b80a2928..72e2eb11e8 100644 --- a/keps/77-dynamically-sized-jobs/kep.yaml +++ b/keps/77-dynamically-sized-jobs/kep.yaml @@ -3,7 +3,7 @@ kep-number: 77 authors: - "@vicentefb" status: provisional -creation-date: 2024-03-11 +creation-date: 2024-03-14 reviewers: - "@andrewsykim" - "@alculquicondor" From ea42141513e318586184329ee455d7c2c015aed6 Mon Sep 17 00:00:00 2001 From: Vicente Ferrara Date: Thu, 21 Mar 2024 23:39:48 +0000 Subject: [PATCH 03/10] updated kep --- keps/77-dynamically-sized-jobs/README.md | 85 ++++++------------------ 1 file changed, 21 insertions(+), 64 deletions(-) diff --git a/keps/77-dynamically-sized-jobs/README.md b/keps/77-dynamically-sized-jobs/README.md index 0725c03e61..b3a7581691 100644 --- a/keps/77-dynamically-sized-jobs/README.md +++ b/keps/77-dynamically-sized-jobs/README.md @@ -8,6 +8,7 @@ - [Proposal](#proposal) - [User Stories](#user-stories) - [Story 1 - RayCluster w/ autoscaling](#story-1---raycluster-w-autoscaling) + - [Notes/Constraints/Caveats](#notes) - [Design Details](#design-details) - [Workload Slices](#workload-slices) - [Creating Workload Slices](#creating-workload-slices) @@ -20,9 +21,6 @@ - [Phase 3 - Scale up with Workload Slices and Scheduling Gates](#phase-3---scale-up-with-workload-slices-and-scheduling-gates) - [Scheduler](#scheduler) - [Additional Details](#additional-details) - - [Feature Gate](#feature-gate) - - [Locking Flavor Assignments for Workload Slices](#locking-flavor-assignments-for-workload-slices) - - [Webhook changes](#webhook-changes) - [Test Plan](#test-plan) - [Unit Tests](#unit-tests) - [Integration tests](#integration-tests) @@ -43,17 +41,17 @@ See: [Support dynamically sized (elastic) jobs #77](https://github.com/kubernete Kueue currently lacks support for resizing jobs. When a job is resized, Kueue will recreate the Workload representation of the job, leading to a disruptive suspend and requeue process. This limitation hinders the usability of Kueue for resources like RayCluster that sometimes have autoscaling capabilities enabled. -To properly support RayCluster, Kueue needs to gracefully handle the scale up and scale down of RayCluster nodes. Concretely this means that scaling the resources used by a RayCluster is appropriately reflected in the respective ClusterQueue without needing to suspend the entire cluster. +To properly support RayCluster, Kueue needs to gracefully handle the scale up and scale down of RayCluster nodes. Concretely this means that scaling the resources used by a RayCluster is appropriately reflected in the respective ClusterQueue without needing to suspend the entire RayCluster. ### Goals - Gracefully handle resize operations for Kueue jobs (i.e. update quota usage without suspend, enqueue scale ups) -- Autoscaling RayCluster works with Kueue (MVP) +- Autoscaling RayCluster works with Kueue since it has an autoscaler and there is high demand (MVP)s ### Non-Goals - Vertical scaling of workloads – Kueue will only handle resize operations that scale Pods horizontally -- Support resize for other Kueue jobs such as QueuedWorkload, JobSet, etc (future) +- Support resize for other Kueue jobs such as Job, JobSet, etc (future) - Resizing of RayJobs - Partial Preemption @@ -70,11 +68,15 @@ For the MVP, we will only focus on admission of **RayClusters** with autoscaling #### Story 1 - RayCluster w/ autoscaling -1. The user creates a RayCluster. +1. The user creates a RayCluster with `enableInTreeAutoscaling: true`. 2. Kueue admits the RayCluster based on the requested resources in the head pod and worker pod. -3. User updates RayCluster to enable autoscaling +3. Given the deman of the job, the Ray Autoscaler adjusts the replicas field as it adds or removes Pods from the cluster. 4. Kueue will not suspend and requeue the RayCluster, instead it will dynamically update ClusterQueue usage based on the scaling event. +### Notes/Constraints/Caveats (Optional) + +If Kueue needs to preempt the resized RayCluster, it would preempt it as a whole, regardless of whether the RayCluster was previously scaled up. + ## Design Details ### Workload Slices @@ -98,17 +100,18 @@ Jobs implementing the ResizeJob method will create a Workload Slice for every ne ### Pod Scheduling Gates -Inside **raycluster_webhook** implement schedulingGate injection for pods on RayCluster creation time. Which will then be ungated following a similar behavior as to how a job is suspended and then unsuspended in the beginning. When we have a scale up, the new pods will be gated due to the schedulingGates injection in the webhook. +Inside **raycluster_webhook** implement schedulingGate injection for pods on RayCluster creation time. +The Pods will be ungated following a similar behavior as to how a job is suspended and then unsuspended in the when admitted. +When the RayCluster scales up, the new pods will be gated due to the schedulingGates injection in the webhook. -After the creation of each individual Workload Slice and admission of a Workload Slice, the **workload_scheduling_gates_controller** should be in charge of removing the scheduling gates from each pod. We only need to ungate the number of pods to match the number of admitted pods, this should be a counter. We don’t want to accidentally ungate too many pods since race conditions could happen and we also don’t want to double count. +After the creation of each individual Workload Slice and admission of a Workload Slice, the **workload_scheduling_gates_controller** should be in charge of removing the scheduling gates from each pod. We only need to ungate the number of pods to match the number of admitted pods, this should be a counter. We don’t want to accidentally ungate too many pods since race conditions could happen and we also don’t want to double count. It's worth mentioning that for the case of recreated pods (i.e. machine failure for example), these pods will go through the admission/scheduuling check again, Kueue is responsible fo removing the scheduling gates when there's available quota and resources to spend on the RayCluster. ### Garbage Collecting Workload Slices -The logic of folding and deleting would be isolated in this controller that takes a look at the number of pods that were ungated. We don’t necessarily have to say that this Workload Slice belongs to a specific pod. - -1. You increment the ungated counter and pass the UID of the workload you are folding to the parent workload. If we still don’t see the workload being deleted, we at least know it has been counted towards the parent workload and we cannot count it again. -2. This UID can be seen in the parent’s workload spec. +The logic of folding and deleting would be isolated in this controller. We don’t necessarily have to say that this Workload Slice belongs to a specific pod. This controller will look at the Workload objects and check whether they have the Admitted condition or not. +1. The controller increments the `.status.admission.podSetAssignments.count` and passes the UID of the workload you are folding to the parent workload. +If we still don’t see the workload being deleted, we at least know it has been counted towards the parent workload and the controller shouldn't fold it again. ## Phases for MVP (alpha) @@ -120,7 +123,6 @@ Scaling down a RayCluster won’t involve the creation of Workload Slices, inste 1. Compare job's PodSet.Count vs Workload.Spec.PodSets[1].Count (worker group) inside the jobframework generic reconciler. 2. Call *updateWorkloadToMatchJob()*, this will construct and update the workload and in turn update the PodSet Count field. 3. Inside the *Update()* method from the *workload_controller* update the workload in cache and in queue. By updating the workload in cache this will update the cluster queue resource usage and by updating the workload in queue this will trigger the scheduler so that it re-assigns the flavors to the already assumed workload and in this way PodSetAssignments will be updated by applying admission based on the new assignments. -4. Inside the schedule logic in the scheduler, since the workload is already assumed in the cache we need to specify if the feature is enabled so that we can apply admission to the workload and update its PodSetAssignments. #### Job controller @@ -130,7 +132,7 @@ Given these changes, check if the delta is positive or negative indicating a sca ### Phase 2 - Aggregating Workload Slices -In Phase 2, aggregating Workload Slices into the parent workload will be implemented. +In Phase 2, aggregating Workload Slices into the parent workload will be implemented. This doesn't represent a usable feature to end users, but it can be reviewed independently from the phase 3. ### Phase 3 - Scale up with Workload Slices and Scheduling Gates @@ -141,58 +143,13 @@ When the RayCluster scales, the RayCluster webhook would be modified to intercep - Cons: The fact of having schedulingGates, means we need an API call for every pod, because all pods that are created by the RayCluster are going to have schedulingGates. We need to remove those gates and for every pod you need to make API calls. #### Scheduler -Since every scale up will have its own individual workload they should be proposed to the current scheduling cycle and continue the normal admission process. We should lock flavor assignments to Workload Slices we need to ensure that the Workload Slice is assigned the same resource flavor as the parent workload. +Since every scale up will have its own individual workload they should proceed to the current scheduling cycle and continue the normal admission process. +However, we need to ensure that the Workload Slice is assigned the same resource flavor as the parent workload. -The *nominate()* returns the workloads with their requirements (resource flavors, borrowing) if they were admitted by the clusterQueues in the snapshot, so we need to return the original workload that was already admitted with the resize information inside *TotalRequests* so that *PodSetAssignments* is also updated. +The `nominate()` returns the workloads with their requirements (resource flavors, borrowing) if they were admitted by the clusterQueues in the snapshot, so we need to return the original workload that was already admitted with the resize information inside *TotalRequests* so that *PodSetAssignments* is locked to the parent Workload assignments. ## Additional Details -### Feature Gate -In kube_features add Elastic/Dynamic size jobs feature gate. - -### Locking Flavor Assignments for Workload Slices - -We can extract the flavor(s) that the parent workload is using through wl.Status.Admission.PodSetAssigments[1].Flavors - -We add a new field to the Workload Info object to know which parent flavor(s) were used. - -```golang -// Info holds a Workload object and some pre-processing. -type Info struct { - Obj *kueue.Workload - // list of total resources requested by the podsets. - TotalRequests []PodSetResources - // Populated from the queue during admission or from the admission field if - // already admitted. - ClusterQueue string - LastAssignment *AssignmentClusterQueueState - // Parent Flavors - ParentFlavor []string -} -``` - -Which can be passed on as an extra parameter in *getAssignments()* in the scheduler to flavorassigner.go when it assigns a flavor through *assignFlavors()* - -```golang -func (a *FlavorAssigner) assignFlavors(log logr.Logger, requests []workload.PodSetResources, ParentFlavor []string) Assignment {} -``` - -Which then calls *findFlavorForPodSetResource()* and we can use the parent flavor(s) value to check if this flavor can fit to the pod. If it doesn’t we don’t try to find another flavor for it. - -### Webhook changes -In **workload_webhook** modify *validateWorkloadUpdate()* it to make PodSets and PodSetAssignments mutable for a running job. - -```diff -if workload.HasQuotaReservation(oldObj) { -- allErrs = append(allErrs, apivalidation.ValidateImmutableField(newObj.Spec.PodSets, oldObj.Spec.PodSets, specPath.Child("podSets"))...) - allErrs = append(allErrs, apivalidation.ValidateImmutableField(newObj.Spec.PriorityClassSource, oldObj.Spec.PriorityClassSource, specPath.Child("priorityClassSource"))...) - allErrs = append(allErrs, apivalidation.ValidateImmutableField(newObj.Spec.PriorityClassName, oldObj.Spec.PriorityClassName, specPath.Child("priorityClassName"))...) - } -... -- allErrs = append(allErrs, validateAdmissionUpdate(newObj.Status.Admission, oldObj.Status.Admission, field.NewPath("status", "admission"))...) - -``` - ### Test Plan +The code will adhere to regular best practices for unit tests and coverage. + + #### Integration tests +Integration tests will be executed against mocked clients for RayClusters +that will provide predefined responses and allow to test various scenarios, +including situations like: + +* RayCluster has a scale up, workload slices get created, pods are gated +* RayCluster has a scale up, gated pods are admitted, pods get ungated and assigned same flavors as parent workload +* Workload slices are correctly folded and deleted +* RayCluster has a scale down, Workload spec reflects podset count values +* When Kueu preempt a resized RayCluster, it should preempt it as a whole ### Graduation Criteria - [Summary](#summary) @@ -80,17 +80,17 @@ If Kueue needs to preempt the resized RayCluster, it would preempt it as a whole ## Design Details ### Workload Slices -To support horizontal scaling of jobs, we will introduce the concept of a "Workload Slice”. A Workload Slice is a Workload object with an owner reference to the original Workload for a job. Workload Slices represent per-replica changes to a job that were not initially accounted for when the job was created. +To support horizontal scaling of jobs, we will introduce the concept of a "Workload Slice”. A Workload Slice is a Workload object with an owner reference to the original Workload for a job. Workload Slices represent per-replica changes to a job that were not initially accounted for when the job was created. The benefit of Workload Slices is that Kueue can evaluate admission on a per-replica basis without changing the existing semantics of the Workload API. Once a Workload Slice is admitted, it will be garbage collected and its resources will be aggregated into the admission status of the parent workload. - Workload Slices will be submitted to the same LocalQueue that's referenced by the top-level RayCluster - Workload Slices will be created by Kueue and use identical PodTemplates which is already enforced by Kuberay -- Workload Slices will beneed to belong to the same resource flavor as the top-level RayCluster that was initially admitted +- Workload Slices will need to belong to the same resource flavor as the top-level RayCluster that was initially admitted ### Creating Workload Slices -The [GenericJob interface](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/jobframework/interface.go#L30-L55) will be updated to handle resize operations of jobs. +The [GenericJob interface (7e778f5)](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/jobframework/interface.go#L30-L55) will be updated to handle resize operations of jobs. ```golang type GenericJob interface { @@ -109,7 +109,7 @@ Inside **raycluster_webhook** implement schedulingGate injection for pods on Ray The Pods will be ungated following a similar behavior as to how a job is suspended and then unsuspended in the when admitted. When the RayCluster scales up, the new pods will be gated due to the schedulingGates injection in the webhook. -After the creation of each individual Workload Slice and admission of a Workload Slice, the **workload_scheduling_gates_controller** should be in charge of removing the scheduling gates from each pod. All worker pods from the same worker group share the same pod template, so we only need to ungate the number of pods to match the number of admitted pods, this should be a counter. We don’t want to accidentally ungate too many pods since race conditions could happen and we also don’t want to double count. It's worth mentioning that for the case of recreated pods (i.e. machine failure for example), these pods will go through the admission/scheduuling check again, Kueue is responsible fo removing the scheduling gates when there's available quota and resources to spend on the RayCluster. +After the creation of each individual Workload Slice and admission of a Workload Slice, the **workload_scheduling_gates_controller** should be in charge of removing the scheduling gates from each pod. All worker pods from the same worker group share the same pod template, so we only need to ungate the number of pods to match the number of admitted pods, this should be a counter. We don’t want to accidentally ungate too many pods since race conditions could happen and we also don’t want to double count. It's worth mentioning that for the case of recreated pods (i.e. machine failure for example), these pods will go through the admission/scheduling check again, Kueue is responsible fo removing the scheduling gates when there's available quota and resources to spend on the RayCluster. ### Garbage Collecting Workload Slices @@ -125,9 +125,12 @@ Scaling down will be the first phase towards MVP because it can be implemented w Scaling down a RayCluster won’t involve the creation of Workload Slices, instead it’ll involve an update to the current workload, no requeuing. -1. Compare job's PodSet.Count vs Workload.Spec.PodSets[1].Count (worker group) inside the jobframework generic reconciler. -2. Call *updateWorkloadToMatchJob()*, this will construct and update the workload and in turn update the PodSet Count field. -3. Inside the *Update()* method from the *workload_controller* update the workload in cache and in queue. By updating the workload in cache this will update the cluster queue resource usage and by updating the workload in queue this will trigger the scheduler so that it re-assigns the flavors to the already assumed workload and in this way PodSetAssignments will be updated by applying admission based on the new assignments. +1. Compare Pod Counts: Within the job framework, check if the PodSet.Count of the job matches the Workload.Spec.PodSets[1].Count (worker group). +2. Synchronize Workload: If the pod counts don't match, update the workload to align with the job's PodSet.Count. +3. Trigger Resource Updates: By updating the workload in cache and in queue, we'll signal the following: +- The cluster queue resource usage should recalculate, ensuring accurate resource management. +- The scheduler should re-evaluate and update PodSetAssignments related to the workload. + #### Job controller From 6104b38289168f4fae3902c0ef3a06c6a571f379 Mon Sep 17 00:00:00 2001 From: Vicente Ferrara Date: Wed, 27 Mar 2024 23:51:10 +0000 Subject: [PATCH 08/10] updated and added details on slices, generalized design details and typos --- keps/77-dynamically-sized-jobs/README.md | 44 ++++++++++++++++-------- 1 file changed, 29 insertions(+), 15 deletions(-) diff --git a/keps/77-dynamically-sized-jobs/README.md b/keps/77-dynamically-sized-jobs/README.md index 3d53a26307..b441f71684 100644 --- a/keps/77-dynamically-sized-jobs/README.md +++ b/keps/77-dynamically-sized-jobs/README.md @@ -83,10 +83,22 @@ If Kueue needs to preempt the resized RayCluster, it would preempt it as a whole To support horizontal scaling of jobs, we will introduce the concept of a "Workload Slice”. A Workload Slice is a Workload object with an owner reference to the original Workload for a job. Workload Slices represent per-replica changes to a job that were not initially accounted for when the job was created. The benefit of Workload Slices is that Kueue can evaluate admission on a per-replica basis without changing the existing semantics of the Workload API. Once a Workload Slice is admitted, it will be garbage collected and its resources will be aggregated into the admission status of the parent workload. -- Workload Slices will be submitted to the same LocalQueue that's referenced by the top-level RayCluster -- Workload Slices will be created by Kueue and use identical PodTemplates which is already enforced by Kuberay -- Workload Slices will need to belong to the same resource flavor as the top-level RayCluster that was initially admitted +- Workload Slices will be submitted to the same LocalQueue that's referenced by the top-level Workload. +- In MultiKueue, Workload Slices would go into the same cluster in a multi-cluster environment. +- Workload Slices will be created by Kueue and use identical PodTemplates (which is already enforced by Kuberay in the case for RayCluster). +- Workload Slices will belong to the same resource flavor as the top-level Workload that was initially admitted. +The parent Workload should have a condition that reflects the scaling progression status. + +```golang +const ( + ... + ... + // WorkloadResizeRequested means that the Workload is in the process of scaling up or down + WorkloadResizeRequested = "ResizeRequested" +) + +``` ### Creating Workload Slices @@ -105,11 +117,11 @@ On scale down to M we will change the original Workload's resources and then on ### Pod Scheduling Gates -Inside **raycluster_webhook** implement schedulingGate injection for pods on RayCluster creation time. +Inside the job's webhook, implement schedulingGate injection for pods on creation time. The Pods will be ungated following a similar behavior as to how a job is suspended and then unsuspended in the when admitted. -When the RayCluster scales up, the new pods will be gated due to the schedulingGates injection in the webhook. +When the job scales up, the new pods will be gated due to the schedulingGates injection in the webhook. -After the creation of each individual Workload Slice and admission of a Workload Slice, the **workload_scheduling_gates_controller** should be in charge of removing the scheduling gates from each pod. All worker pods from the same worker group share the same pod template, so we only need to ungate the number of pods to match the number of admitted pods, this should be a counter. We don’t want to accidentally ungate too many pods since race conditions could happen and we also don’t want to double count. It's worth mentioning that for the case of recreated pods (i.e. machine failure for example), these pods will go through the admission/scheduling check again, Kueue is responsible fo removing the scheduling gates when there's available quota and resources to spend on the RayCluster. +After the creation of each individual Workload Slice and admission of a Workload Slice, the **workload_scheduling_gates_controller** should be in charge of removing the scheduling gates from each pod. All worker pods from the same worker group share the same pod template, so we only need to ungate the number of pods to match the number of admitted pods, this should be a counter. We don’t want to accidentally ungate too many pods since race conditions could happen and we also don’t want to double count. It's worth mentioning that for the case of recreated pods (i.e. machine failure for example), these pods will go through the admission/scheduling check again, Kueue is responsible fo removing the scheduling gates when there's available quota and resources to spend on the Job. ### Garbage Collecting Workload Slices @@ -123,7 +135,7 @@ If we still don’t see the workload being deleted, we at least know it has been ### Phase 1 - Scale Down Scaling down will be the first phase towards MVP because it can be implemented without introducing the Workload Slices. -Scaling down a RayCluster won’t involve the creation of Workload Slices, instead it’ll involve an update to the current workload, no requeuing. +Scaling down a Job won’t involve the creation of Workload Slices, instead it’ll involve an update to the current workload, no requeuing. 1. Compare Pod Counts: Within the job framework, check if the PodSet.Count of the job matches the Workload.Spec.PodSets[1].Count (worker group). 2. Synchronize Workload: If the pod counts don't match, update the workload to align with the job's PodSet.Count. @@ -138,17 +150,19 @@ Rework *equivalentToWorkload()* and *reconcile()* to account for potential diffe Given these changes, check if the delta is positive or negative indicating a scaleup/scaledown. If it’s a scaledown the reconciler should trigger an update on the workload to match the new job’s spec and no quota needs to be checked or accounted for, the cluster queue should update the workload resource usage. +**Important:** in Phase 1 if there's a scale up, the behaviour will be suspend and requeue. This will be temporary while Phase 2 and 3 are not completed. + ### Phase 2 - Aggregating Workload Slices In Phase 2, aggregating Workload Slices into the parent workload will be implemented. This doesn't represent a usable feature to end users, but it can be reviewed independently from the phase 3. ### Phase 3 - Scale up with Workload Slices and Scheduling Gates -In Phase 3, scale up will be implemented by introducing Workload Slices and adding Pod scheduling gates as part of Kueue’s mutating admission for RayCluster. +In Phase 3, scale up will be implemented by introducing Workload Slices and adding Pod scheduling gates as part of Kueue’s mutating admission for the job. -When the RayCluster scales, the RayCluster webhook would be modified to intercept and "gate" all pods. Every time there’s a resize, you create a dependable (child) workload slice and once it's admitted, you increase the count in the original workload, delete the old workload and remove the schedulingGates. +When a job scales, its webhook would be modified to intercept and "gate" all pods. Every time there’s a resize, you create a dependable (child) workload slice and once it's admitted, you increase the count in the original workload, delete the old workload and remove the schedulingGates. - Pros: You are able to hold the pods added by Kuberay -- Cons: The fact of having schedulingGates, means we need an API call for every pod, because all pods that are created by the RayCluster are going to have schedulingGates. We need to remove those gates and for every pod you need to make API calls. +- Cons: The fact of having schedulingGates, means we need an API call for every pod, because all pods that are created by the job are going to have schedulingGates. We need to remove those gates and for every pod you need to make API calls. #### Scheduler Since every scale up will have its own individual workload they should proceed to the current scheduling cycle and continue the normal admission process. @@ -198,15 +212,15 @@ The code will adhere to regular best practices for unit tests and coverage. #### Integration tests -Integration tests will be executed against mocked clients for RayClusters +Integration tests will be executed against mocked clients for Jobs that will provide predefined responses and allow to test various scenarios, including situations like: -* RayCluster has a scale up, workload slices get created, pods are gated -* RayCluster has a scale up, gated pods are admitted, pods get ungated and assigned same flavors as parent workload +* Job has a scale up, workload slices get created, pods are gated +* Job has a scale up, gated pods are admitted, pods get ungated and assigned same flavors as parent workload * Workload slices are correctly folded and deleted -* RayCluster has a scale down, Workload spec reflects podset count values -* When Kueu preempt a resized RayCluster, it should preempt it as a whole +* Job has a scale down, Workload spec reflects podset count values +* When Kueu preempt a resized Job, it should preempt it as a whole ### Graduation Criteria