Dynamically Sized Jobs - Scale Down #1852

vicentefb · 2024-03-15T02:51:37Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This includes Phase 1 implementation for Dynamically Sized Jobs KEP #1851

Which issue(s) this PR fixes:

Part of #77

Special notes for your reviewer:

Scale Down only implementation for RayClusters

Does this PR introduce a user-facing change?

Scale Down implementation for RayClusters

k8s-ci-robot · 2024-03-15T02:51:47Z

Hi @vicentefb. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

netlify · 2024-03-15T02:51:56Z

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Name	Link
🔨 Latest commit	`64edc86`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/667c66c547e9ff0007fc8ebf
😎 Deploy Preview	https://deploy-preview-1852--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

vicentefb · 2024-03-15T02:55:00Z

cc @andrewsykim @astefanutti

tenzen-y · 2024-03-15T22:56:39Z

/ok-to-test

pkg/controller/core/workload_controller.go

pkg/scheduler/scheduler.go

pkg/webhooks/workload_webhook.go

n m patch error field not declared in schema commented out podSet immutability from workload webhook to be able to update that field added more comments clean code nit debuggin n m patch error field not declared in schema clean code n m patch error field not declared in schema commented out podSet immutability from workload webhook to be able to update that field added more comments clean code nit a cluster queue reconciliation fixed, it had to do with the infot totalrequests from admission inside the worklad go file working with scheduler cleaning code cleaning code cleaning cleaning cleaning integation test, but it messes up with parallelism test which should be expected updated parallelism it test updated wrappers kep removed Kep removed log lines clean code added a better conditional for updating the resize if the job is a RayCluster added Kind condition updated test and equivalentToWorkload condition added podset assigments check updated feature gate updated feature gate updating equivalentWorkload fixed lint removed changes from scheduler and workload controller testing updated workload controller reconciler to update spec and status nit update feature gate update variables made code more generic updated workload controller helper method typo nit addressed comments updated workload controller to use unuused quota updated integration test to work added unit test in workload controller changed naming to resizeable and fixed lint nit addressed comments addressed comments lint nit addressed comments nit nit nit

vicentefb · 2024-06-26T20:41:59Z

/retest

k8s-ci-robot · 2024-07-20T09:38:40Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

akram · 2024-10-13T20:50:44Z

any update about this PR ?
It was looking as a good start

alculquicondor · 2024-10-15T17:08:06Z

This needs a contributor to pick up the work and complete the scale up design and implementation.

VassilisVassiliadis · 2025-01-16T15:49:17Z

My understanding is that Kueue supports plain pods (https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/#example-pod) therefore couldn't we have the head node and pod-workers have a kueue.x-k8s.io/queue-name label ?

Then instead of supporting special logic for scaling up/down a RayCluster we could just make sure that the RayCluster contains workers which have a label that's the same as the one in the RayCluster object. Then the inTreeAutoScaler of Ray would just create pod objects and they would be scheduled based on the capacity of the queue they point to.

Edit: the head node should also have the same label

alculquicondor · 2025-01-16T16:20:37Z

That's a valid approach too. You would probably want the head nodes to have a higher priority than the workers, to reduce risk of preemption.

troychiu · 2025-02-03T09:01:53Z

My understanding is that Kueue supports plain pods (https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/#example-pod) therefore couldn't we have the head node and pod-workers have a kueue.x-k8s.io/queue-name label ?

Then instead of supporting special logic for scaling up/down a RayCluster we could just make sure that the RayCluster contains workers which have a label that's the same as the one in the RayCluster object. Then the inTreeAutoScaler of Ray would just create pod objects and they would be scheduled based on the capacity of the queue they point to.

Edit: the head node should also have the same label

Does this mean that we would need a workload CR for each Ray worker? I think that would be a heavy load for the cluster. Also, users have to enable plain pods integration if they want to use ray autoscaling in this case?

VassilisVassiliadis · 2025-02-06T12:41:53Z

Does this mean that we would need a workload CR for each Ray worker? I think that would be a heavy load for the cluster.

I don't think this is a big problem. For example, I expect that we currently have 1 Workload CR per managed Job and the number of jobs on a cluster is greater than the number of Ray Workers pods. This is because your average ray worker pod tends to have several tasks running on it.

What would be the number of Ray worker pods that would cause kueue to show signs of heavy load ?

Also, users have to enable plain pods integration if they want to use ray autoscaling in this case?

Yeah, I think cluster admins - or people with privileges to configure Kueue - would need to enable pod integration (https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/#before-you-begin). It's probably best they restrict Kueue to managing the pods for just a limited number of namespaces - those which contain RayClusters managed by Kueue.

VassilisVassiliadis · 2025-02-06T12:47:21Z

The problem I'm currently facing is that I am using RayClusters in a Kueue managed K8s cluster. Kueue stops me from dynamically scaling my RayClusters which in turn makes my hardware utilization poor :(.

Ray, with the help of KubeRay and the inTreeAutoScaler, already has support for that. So I was thinking that we can just configure Kueue to react to the decisions of Ray/KubeRay by just looking at the Pods that Ray/Kuberay delete/create. In turn, that would simplify the RayCluster-specific code in Kueue.

Of course, you rarely get something for free. This suggestion would need Kueue to manage pods.

KunWuLuan · 2025-02-07T11:40:56Z

Then instead of supporting special logic for scaling up/down a RayCluster we could just make sure that the RayCluster contains workers which have a label that's the same as the one in the RayCluster object. Then the inTreeAutoScaler of Ray would just create pod objects and they would be scheduled based on the capacity of the queue they point to.

Does this mean that each pod in the RayCluster will be considered as a separate workload? Can we use TAS (topology-aware scheduling) for RayCluster when we submit RayCluster like this?

KunWuLuan · 2025-02-07T11:43:57Z

@vicentefb @alculquicondor Do you need contributors? Let me see what can I help.

alculquicondor · 2025-02-07T13:21:45Z

I'm not involved in this project anymore.
cc @mimowo @tenzen-y

VassilisVassiliadis · 2025-02-07T13:25:59Z

Does this mean that each pod in the RayCluster will be considered as a separate workload?

In my proposal, yes.

Can we use TAS (topology-aware scheduling) for RayCluster when we submit RayCluster like this?

Not sure about that. A RayCluster is effectively a collection of pod definitions. As long Kueue supports using the TAS-related annotations/labels (https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/#example-1) in a Pod, then it should work with the RayCluster workers/head-node too.

mimowo · 2025-02-07T13:35:08Z

ack. IIUC @VassilisVassiliadis your proposal is to have a Workload per Pod to achieve scaling of RayClusters?

We apply that approach for Deployments. However, Deployments by nature are meant for serving and thus they typically support partial preemption and working at partial capacity. For workloads managed by Ray this is less obvious. Some users who use Ray for training would not want that granularity, I think.

So my questions that come first to my mind:

what would be the determining factor between the classic Workload per RayCluster and Workload per Pod approaches? Some annotation?
what are the advantages of this approach over the scale down support as proposed in the KEP Support dynamically sized (elastic) jobs #77? Is it "time to market" / "ease of implementation" or are there additional use cases supported thanks to that?
do you only care about scale down? If so, then it probably could be implemented just by using dynamic reclaim of quota for finished pods? If so, then it should also be relatively easy to implement and a more generic solution (applicable to also training).

VassilisVassiliadis · 2025-02-07T14:49:51Z

ack. IIUC @VassilisVassiliadis your proposal is to have a Workload per Pod to achieve scaling of RayClusters?

Yes that's it. Instead of switching of the inTreeAutoScaler off ray and handling scaling of workers as an all or nothing static decision within Kueue we could let the autoScaler do its thing and just react to requests for individual pods.

Effectively, we're letting KubeRay just submit pods to K8s whenever it decides to do that. The plain pods are then intercepted by Kueue and handled accordingly.

We apply that approach for Deployments. However, Deployments by nature are meant for serving and thus they typically support partial preemption and working at partial capacity. For workloads managed by Ray this is less obvious. Some users who use Ray for training would not want that granularity, I think.

That's understandable.

So my questions that come first to my mind:

1. what would be the determining factor between the classic Workload per RayCluster  and Workload per Pod approaches? Some annotation?

I think you're asking how Kueue would tell between scheduling the entire RayCluster as 1 thing or allowing KubeRay to dynamically scale the Ray workers up/down i.e. Kueue just deals with the objects that KubeRay creates, not the RayCluster object itself. Perhaps, there is a label at the level of the RayCluster CR to configure which method to use.

2. what are the advantages of this approach over the scale down support as proposed in the KEP [Support dynamically sized (elastic) jobs #77](https://github.com/kubernetes-sigs/kueue/issues/77)? Is it "time to market" / "ease of implementation" or are there additional use cases supported thanks to that?

The proposal can be considered as "don't treat a RayCluster as a resource Kueue manages, instead handle the objects it creates".

The advantages are:

Kueue does not need extra code to support dynamically resized RayClusters -> Faster time to market, Less code to manage inside Kueue which is a kind of superset for "ease of implementation". This is because KubeRay/ray take care of deciding when to scale up/down the cluster resources. Kueue doesn't need to actually have logic for the RayCluster objects.
We don't have to switch off the inTreeAutoScaler of KubeRay - we can also make use of Ray's experimental v2 Autoscaler which comes with a couple of bugfixes not present in v1 (e.g. strict-spread policies)
A RayCluster that's (indirectly) managed by Kueue works the same way as a cluster not managed by kueue --> users don't get surprised by unexpected behaviour of their RayCluster

Downside:

ClusterAdmins must ask Kueue to monitor plain pods
This approach doesn't take into account future plans for dynamically scaled Kueue workloads -> this isn't really a disadvantage just that the approach for a RayCluster wouldn't work for a generic workload.

An alternative would be to modify KubeRay so that it works with Job objects instead of Pod objects. That would get rid of the disadvantage. But of course it would need someone to update KubeRay first.

3. do you only care about scale down? If so, then it probably could be implemented just by using dynamic reclaim of quota for finished pods? If so, then it should also be relatively easy to implement and a more generic solution (applicable to also training).

I'm interested in both scale up and down. I'd like to create a cluster that starts with 0 workers, just the head node and have it scale to the appropriate size while reacting to the number and nature of tasks I create. This would be the behaviour I'd get on a vanilla k8s environment.

VassilisVassiliadis · 2025-02-07T15:54:39Z

Another thing that came to mind is that if the entire RayCluster is scheduled as 1 then in the event of preemption the entire thing goes down. If you don't persist your Ray logs, which is something you might do to avoid dealing with garbage collection, you risk losing them.

troychiu · 2025-02-10T19:02:51Z

If we are considering what @VassilisVassiliadis proposed, I am wondering if KEP-77 can also be done. In my case and possibly same for other users, we don't want to enable pod integration since a CR is created for every pod and this can greatly slow down the cluster.
If yes, I am going to create a PR to resume KEP-77.

k8s-ci-robot · 2025-02-14T10:23:27Z

@vicentefb: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kueue-test-e2e-main-1-29	`64edc86`	link	true	`/test pull-kueue-test-e2e-main-1-29`
pull-kueue-test-e2e-main-1-30	`64edc86`	link	true	`/test pull-kueue-test-e2e-main-1-30`
pull-kueue-verify-main	`64edc86`	link	true	`/test pull-kueue-verify-main`
pull-kueue-build-image-main	`64edc86`	link	true	`/test pull-kueue-build-image-main`
pull-kueue-test-unit-main	`64edc86`	link	true	`/test pull-kueue-test-unit-main`
pull-kueue-test-scheduling-perf-main	`64edc86`	link	true	`/test pull-kueue-test-scheduling-perf-main`
pull-kueue-test-multikueue-e2e-main	`64edc86`	link	true	`/test pull-kueue-test-multikueue-e2e-main`
pull-kueue-test-integration-main	`64edc86`	link	true	`/test pull-kueue-test-integration-main`
pull-kueue-test-e2e-main-1-31	`64edc86`	link	true	`/test pull-kueue-test-e2e-main-1-31`
pull-kueue-test-tas-e2e-main	`64edc86`	link	true	`/test pull-kueue-test-tas-e2e-main`
pull-kueue-test-e2e-main-1-32	`64edc86`	link	true	`/test pull-kueue-test-e2e-main-1-32`
pull-kueue-test-multikueue-integration-main	`64edc86`	link	true	`/test pull-kueue-test-multikueue-integration-main`
pull-kueue-test-customconfigs-e2e-main	`64edc86`	link	true	`/test pull-kueue-test-customconfigs-e2e-main`
pull-kueue-test-e2e-tas-main	`64edc86`	link	true	`/test pull-kueue-test-e2e-tas-main`
pull-kueue-test-integration-multikueue-main	`64edc86`	link	true	`/test pull-kueue-test-integration-multikueue-main`
pull-kueue-test-e2e-customconfigs-main	`64edc86`	link	true	`/test pull-kueue-test-e2e-customconfigs-main`
pull-kueue-test-e2e-multikueue-main	`64edc86`	link	true	`/test pull-kueue-test-e2e-multikueue-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

tenzen-y · 2025-02-14T12:02:54Z

Sorry for the delayed response, folks.

First of all, I could understand both concerns and use cases mentioned by @VassilisVassiliadis @troychiu.
Those sound reasonable.

If we are considering what @VassilisVassiliadis proposed, I am wondering if KEP-77 can also be done. In my case and possibly same for other users, we don't want to enable pod integration since a CR is created for every pod and this can greatly slow down the cluster.
If yes, I am going to create a PR to resume KEP-77.

However, if we move this (and scale up as well) forward, I'm wondering if we should revisit design and investigation since our JobFramework is much more evolved and complicated rather than the day when we designed this. Additionally, the next minor release is closing. So, I do not contain such big things currently.

So, my proposal is to support Ray's autoscaling step by step.

Step 1. Support known ownerReference standalone feature only for Plain Pod Integration discussed in #4106 (I know this can not resolve @troychiu concerns)
This could resolve @VassilisVassiliadis use cases, IIUC, since you could enqueue Ray nodes by Pod Integraion instead of RayCluster integration to Kueue.

Step 2. Consider if we should expand the known ownerReference standalone mechanism to other frameworks, including RayCluster. I mean that we want to investigate if we should support another RayCluster enqueue mode to handle RayWorker as standalone pods as opposed to a member of RayCluster. In this new mode, we create the Workload object only with the head Node, and the worker Node will be handled by just the Pod.

For Ray, this is mostly same behavior with Step 1. But, for other frameworks, this is a good starting point on how to handle the known owners. Actually, the kubeflow/trainer has similar use case.

This might be problematic since we can not guarantee when the Ray worker node Pods will be admitted by Kueue even though the Ray Head has been admitted already since the Workload objects corresponding to the Ray worker node will be created when the RayCluster Head node is admitted.

I know this is still not resolved @troychiu concerns since this offers cluster admins to enable Pod Integration.

Step 3. We redesign this straightforward dynamic approach. But, I think we can perform investigations and designing (not including impl) during step 2 so that we can delivery this faster.

In my mind, step 1 will be able to do in v0.11 (next release) or v0.12, step 2 will be able to do in v0.12, and step 3 will be able to do in v0.13. But step 3 is a much tentative candidate since it is a much bigger feature, and we currently focus on other bigger Workload scheduling evolving like fairSharing, HierarcyCohort, and TopologyScheduling.

cc @mimowo

VassilisVassiliadis · 2025-02-18T16:54:52Z

I think that as long as we can have Kueue monitor pods, but not do anything to those that don't have the kueue.x-k8s.io/queue-name label, my suggestion should be "good enough" for now. When Kueue gets more complete support for dynamically scaling workloads it could also switch to using that for managing the RayCluster custom resources.

mimowo · 2025-02-18T17:01:07Z

I support Step 1 from #1852 (comment), basically create workloads for Pods with queue-name. However, to avoid creating a workload for KubeRay APIs (RayJob or RayCluster), introduce the "standalone" annotation.

Is it aligned with your proposal @VassilisVassiliadis?

k8s-ci-robot requested review from alculquicondor and denkensk March 15, 2024 02:51

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 15, 2024

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 15, 2024

vicentefb force-pushed the ElasticJobs branch 2 times, most recently from f0f98d9 to d49df00 Compare March 15, 2024 02:54

vicentefb mentioned this pull request Mar 15, 2024

KEP #77 Dynamically Sized Jobs #1851

Merged

vicentefb force-pushed the ElasticJobs branch 2 times, most recently from 0bcee5e to daa8d26 Compare March 15, 2024 16:57

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 15, 2024

andrewsykim reviewed Mar 18, 2024

View reviewed changes

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 18, 2024

vicentefb force-pushed the ElasticJobs branch from df0d60e to 0e2a8af Compare March 18, 2024 18:58

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 18, 2024

vicentefb force-pushed the ElasticJobs branch from d9bdcd9 to 536f1ec Compare March 18, 2024 20:35

vicentefb force-pushed the ElasticJobs branch 2 times, most recently from b194b89 to 35ca648 Compare April 5, 2024 21:18

vicentefb force-pushed the ElasticJobs branch from db9e4d7 to a559d8e Compare April 9, 2024 17:54

vicentefb changed the title ~~[WIP] Dynamically Sized Jobs - Scale Down~~ Dynamically Sized Jobs - Scale Down Apr 9, 2024

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 14, 2024

vicentefb force-pushed the ElasticJobs branch 2 times, most recently from 698cd3b to 1ed17a7 Compare June 17, 2024 18:49

vicentefb force-pushed the ElasticJobs branch from 1ed17a7 to 4d16a02 Compare June 26, 2024 19:06

vicentefb force-pushed the ElasticJobs branch from 4d16a02 to 64edc86 Compare June 26, 2024 19:06

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2024

troychiu mentioned this pull request Dec 19, 2024

Support dynamically sized (elastic) jobs #77

Open

eric-higgins-ai mentioned this pull request Feb 7, 2025

Support partial scale-down in RayJob #4169

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamically Sized Jobs - Scale Down #1852

Dynamically Sized Jobs - Scale Down #1852

vicentefb commented Mar 15, 2024 •

edited by alculquicondor

Loading

k8s-ci-robot commented Mar 15, 2024

netlify bot commented Mar 15, 2024 •

edited

Loading

vicentefb commented Mar 15, 2024

tenzen-y commented Mar 15, 2024

vicentefb commented Jun 26, 2024

k8s-ci-robot commented Jul 20, 2024

akram commented Oct 13, 2024

alculquicondor commented Oct 15, 2024

VassilisVassiliadis commented Jan 16, 2025 •

edited

Loading

alculquicondor commented Jan 16, 2025

troychiu commented Feb 3, 2025 •

edited

Loading

VassilisVassiliadis commented Feb 6, 2025

VassilisVassiliadis commented Feb 6, 2025 •

edited

Loading

KunWuLuan commented Feb 7, 2025

KunWuLuan commented Feb 7, 2025

alculquicondor commented Feb 7, 2025

VassilisVassiliadis commented Feb 7, 2025

mimowo commented Feb 7, 2025

VassilisVassiliadis commented Feb 7, 2025 •

edited

Loading

VassilisVassiliadis commented Feb 7, 2025

troychiu commented Feb 10, 2025

k8s-ci-robot commented Feb 14, 2025

tenzen-y commented Feb 14, 2025

VassilisVassiliadis commented Feb 18, 2025

mimowo commented Feb 18, 2025

Dynamically Sized Jobs - Scale Down #1852

Are you sure you want to change the base?

Dynamically Sized Jobs - Scale Down #1852

Conversation

vicentefb commented Mar 15, 2024 • edited by alculquicondor Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Mar 15, 2024

netlify bot commented Mar 15, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue ready!

vicentefb commented Mar 15, 2024

tenzen-y commented Mar 15, 2024

vicentefb commented Jun 26, 2024

k8s-ci-robot commented Jul 20, 2024

akram commented Oct 13, 2024

alculquicondor commented Oct 15, 2024

VassilisVassiliadis commented Jan 16, 2025 • edited Loading

alculquicondor commented Jan 16, 2025

troychiu commented Feb 3, 2025 • edited Loading

VassilisVassiliadis commented Feb 6, 2025

VassilisVassiliadis commented Feb 6, 2025 • edited Loading

KunWuLuan commented Feb 7, 2025

KunWuLuan commented Feb 7, 2025

alculquicondor commented Feb 7, 2025

VassilisVassiliadis commented Feb 7, 2025

mimowo commented Feb 7, 2025

VassilisVassiliadis commented Feb 7, 2025 • edited Loading

VassilisVassiliadis commented Feb 7, 2025

troychiu commented Feb 10, 2025

k8s-ci-robot commented Feb 14, 2025

tenzen-y commented Feb 14, 2025

VassilisVassiliadis commented Feb 18, 2025

mimowo commented Feb 18, 2025

vicentefb commented Mar 15, 2024 •

edited by alculquicondor

Loading

netlify bot commented Mar 15, 2024 •

edited

Loading

VassilisVassiliadis commented Jan 16, 2025 •

edited

Loading

troychiu commented Feb 3, 2025 •

edited

Loading

VassilisVassiliadis commented Feb 6, 2025 •

edited

Loading

VassilisVassiliadis commented Feb 7, 2025 •

edited

Loading