Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically Sized Jobs - Scale Down #1852

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vicentefb
Copy link
Contributor

@vicentefb vicentefb commented Mar 15, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

This includes Phase 1 implementation for Dynamically Sized Jobs KEP #1851

Which issue(s) this PR fixes:

Part of #77

Special notes for your reviewer:

Scale Down only implementation for RayClusters

Does this PR introduce a user-facing change?

Scale Down implementation for RayClusters

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 15, 2024
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 15, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @vicentefb. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 15, 2024
Copy link

netlify bot commented Mar 15, 2024

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 64edc86
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/667c66c547e9ff0007fc8ebf
😎 Deploy Preview https://deploy-preview-1852--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@vicentefb vicentefb force-pushed the ElasticJobs branch 2 times, most recently from f0f98d9 to d49df00 Compare March 15, 2024 02:54
@vicentefb
Copy link
Contributor Author

cc @andrewsykim @astefanutti

@vicentefb vicentefb force-pushed the ElasticJobs branch 2 times, most recently from 0bcee5e to daa8d26 Compare March 15, 2024 16:57
@tenzen-y
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 15, 2024
pkg/controller/core/workload_controller.go Outdated Show resolved Hide resolved
pkg/controller/core/workload_controller.go Outdated Show resolved Hide resolved
pkg/controller/core/workload_controller.go Outdated Show resolved Hide resolved
pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved
pkg/webhooks/workload_webhook.go Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 18, 2024
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 18, 2024
@vicentefb vicentefb force-pushed the ElasticJobs branch 2 times, most recently from b194b89 to 35ca648 Compare April 5, 2024 21:18
@vicentefb vicentefb changed the title [WIP] Dynamically Sized Jobs - Scale Down Dynamically Sized Jobs - Scale Down Apr 9, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 14, 2024
@vicentefb vicentefb force-pushed the ElasticJobs branch 2 times, most recently from 698cd3b to 1ed17a7 Compare June 17, 2024 18:49
n

m

patch error field not declared in schema

commented out podSet immutability from workload webhook to be able to update that field

added more comments

clean code

nit

debuggin

n

m

patch error field not declared in schema

clean code

n

m

patch error field not declared in schema

commented out podSet immutability from workload webhook to be able to update that field

added more comments

clean code

nit

a

cluster queue reconciliation fixed, it had to do with the infot totalrequests from admission
inside the worklad go file

working with scheduler

cleaning code

cleaning code

cleaning

cleaning

cleaning

integation test, but it messes up with parallelism test which should be expected

updated parallelism it test

updated wrappers

kep

removed Kep

removed log lines

clean code

added a better conditional for updating the resize if the job is a RayCluster

added Kind condition

updated test and equivalentToWorkload condition

added podset assigments check

updated feature gate

updated feature gate

updating equivalentWorkload

fixed lint

removed changes from scheduler and workload controller

testing

updated workload controller reconciler to update spec and status

nit

update feature gate

update variables

made code more generic

updated workload controller helper method

typo

nit

addressed comments

updated workload controller to use unuused quota

updated integration test to work

added unit test in workload controller

changed naming to resizeable and fixed lint

nit

addressed comments

addressed comments

lint

nit

addressed comments

nit

nit

nit
@vicentefb
Copy link
Contributor Author

/retest

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2024
@akram
Copy link
Contributor

akram commented Oct 13, 2024

any update about this PR ?
It was looking as a good start

@alculquicondor
Copy link
Contributor

This needs a contributor to pick up the work and complete the scale up design and implementation.

@VassilisVassiliadis
Copy link

VassilisVassiliadis commented Jan 16, 2025

My understanding is that Kueue supports plain pods (https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/#example-pod) therefore couldn't we have the head node and pod-workers have a kueue.x-k8s.io/queue-name label ?

Then instead of supporting special logic for scaling up/down a RayCluster we could just make sure that the RayCluster contains workers which have a label that's the same as the one in the RayCluster object. Then the inTreeAutoScaler of Ray would just create pod objects and they would be scheduled based on the capacity of the queue they point to.

Edit: the head node should also have the same label

@alculquicondor
Copy link
Contributor

That's a valid approach too. You would probably want the head nodes to have a higher priority than the workers, to reduce risk of preemption.

@troychiu
Copy link
Contributor

troychiu commented Feb 3, 2025

My understanding is that Kueue supports plain pods (https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/#example-pod) therefore couldn't we have the head node and pod-workers have a kueue.x-k8s.io/queue-name label ?

Then instead of supporting special logic for scaling up/down a RayCluster we could just make sure that the RayCluster contains workers which have a label that's the same as the one in the RayCluster object. Then the inTreeAutoScaler of Ray would just create pod objects and they would be scheduled based on the capacity of the queue they point to.

Edit: the head node should also have the same label

Does this mean that we would need a workload CR for each Ray worker? I think that would be a heavy load for the cluster. Also, users have to enable plain pods integration if they want to use ray autoscaling in this case?

@VassilisVassiliadis
Copy link

Does this mean that we would need a workload CR for each Ray worker? I think that would be a heavy load for the cluster.

I don't think this is a big problem. For example, I expect that we currently have 1 Workload CR per managed Job and the number of jobs on a cluster is greater than the number of Ray Workers pods. This is because your average ray worker pod tends to have several tasks running on it.

What would be the number of Ray worker pods that would cause kueue to show signs of heavy load ?

Also, users have to enable plain pods integration if they want to use ray autoscaling in this case?

Yeah, I think cluster admins - or people with privileges to configure Kueue - would need to enable pod integration (https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/#before-you-begin). It's probably best they restrict Kueue to managing the pods for just a limited number of namespaces - those which contain RayClusters managed by Kueue.

@VassilisVassiliadis
Copy link

VassilisVassiliadis commented Feb 6, 2025

The problem I'm currently facing is that I am using RayClusters in a Kueue managed K8s cluster. Kueue stops me from dynamically scaling my RayClusters which in turn makes my hardware utilization poor :(.

Ray, with the help of KubeRay and the inTreeAutoScaler, already has support for that. So I was thinking that we can just configure Kueue to react to the decisions of Ray/KubeRay by just looking at the Pods that Ray/Kuberay delete/create. In turn, that would simplify the RayCluster-specific code in Kueue.

Of course, you rarely get something for free. This suggestion would need Kueue to manage pods.

@KunWuLuan
Copy link
Member

Then instead of supporting special logic for scaling up/down a RayCluster we could just make sure that the RayCluster contains workers which have a label that's the same as the one in the RayCluster object. Then the inTreeAutoScaler of Ray would just create pod objects and they would be scheduled based on the capacity of the queue they point to.

Does this mean that each pod in the RayCluster will be considered as a separate workload? Can we use TAS (topology-aware scheduling) for RayCluster when we submit RayCluster like this?

@KunWuLuan
Copy link
Member

@vicentefb @alculquicondor Do you need contributors? Let me see what can I help.

@alculquicondor
Copy link
Contributor

I'm not involved in this project anymore.
cc @mimowo @tenzen-y

@VassilisVassiliadis
Copy link

Does this mean that each pod in the RayCluster will be considered as a separate workload?

In my proposal, yes.

Can we use TAS (topology-aware scheduling) for RayCluster when we submit RayCluster like this?

Not sure about that. A RayCluster is effectively a collection of pod definitions. As long Kueue supports using the TAS-related annotations/labels (https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/#example-1) in a Pod, then it should work with the RayCluster workers/head-node too.

@mimowo
Copy link
Contributor

mimowo commented Feb 7, 2025

ack. IIUC @VassilisVassiliadis your proposal is to have a Workload per Pod to achieve scaling of RayClusters?

We apply that approach for Deployments. However, Deployments by nature are meant for serving and thus they typically support partial preemption and working at partial capacity. For workloads managed by Ray this is less obvious. Some users who use Ray for training would not want that granularity, I think.

So my questions that come first to my mind:

  1. what would be the determining factor between the classic Workload per RayCluster and Workload per Pod approaches? Some annotation?
  2. what are the advantages of this approach over the scale down support as proposed in the KEP Support dynamically sized (elastic) jobs #77? Is it "time to market" / "ease of implementation" or are there additional use cases supported thanks to that?
  3. do you only care about scale down? If so, then it probably could be implemented just by using dynamic reclaim of quota for finished pods? If so, then it should also be relatively easy to implement and a more generic solution (applicable to also training).

@VassilisVassiliadis
Copy link

VassilisVassiliadis commented Feb 7, 2025

ack. IIUC @VassilisVassiliadis your proposal is to have a Workload per Pod to achieve scaling of RayClusters?

Yes that's it. Instead of switching of the inTreeAutoScaler off ray and handling scaling of workers as an all or nothing static decision within Kueue we could let the autoScaler do its thing and just react to requests for individual pods.

Effectively, we're letting KubeRay just submit pods to K8s whenever it decides to do that. The plain pods are then intercepted by Kueue and handled accordingly.

We apply that approach for Deployments. However, Deployments by nature are meant for serving and thus they typically support partial preemption and working at partial capacity. For workloads managed by Ray this is less obvious. Some users who use Ray for training would not want that granularity, I think.

That's understandable.

So my questions that come first to my mind:

1. what would be the determining factor between the classic Workload per RayCluster  and Workload per Pod approaches? Some annotation?

I think you're asking how Kueue would tell between scheduling the entire RayCluster as 1 thing or allowing KubeRay to dynamically scale the Ray workers up/down i.e. Kueue just deals with the objects that KubeRay creates, not the RayCluster object itself. Perhaps, there is a label at the level of the RayCluster CR to configure which method to use.

2. what are the advantages of this approach over the scale down support as proposed in the KEP [Support dynamically sized (elastic) jobs #77](https://github.com/kubernetes-sigs/kueue/issues/77)? Is it "time to market" / "ease of implementation" or are there additional use cases supported thanks to that?

The proposal can be considered as "don't treat a RayCluster as a resource Kueue manages, instead handle the objects it creates".

The advantages are:

  1. Kueue does not need extra code to support dynamically resized RayClusters -> Faster time to market, Less code to manage inside Kueue which is a kind of superset for "ease of implementation". This is because KubeRay/ray take care of deciding when to scale up/down the cluster resources. Kueue doesn't need to actually have logic for the RayCluster objects.
  2. We don't have to switch off the inTreeAutoScaler of KubeRay - we can also make use of Ray's experimental v2 Autoscaler which comes with a couple of bugfixes not present in v1 (e.g. strict-spread policies)
  3. A RayCluster that's (indirectly) managed by Kueue works the same way as a cluster not managed by kueue --> users don't get surprised by unexpected behaviour of their RayCluster

Downside:

  1. ClusterAdmins must ask Kueue to monitor plain pods
  2. This approach doesn't take into account future plans for dynamically scaled Kueue workloads -> this isn't really a disadvantage just that the approach for a RayCluster wouldn't work for a generic workload.

An alternative would be to modify KubeRay so that it works with Job objects instead of Pod objects. That would get rid of the disadvantage. But of course it would need someone to update KubeRay first.

3. do you only care about scale down? If so, then it probably could be implemented just by using dynamic reclaim of quota for finished pods? If so, then it should also be relatively easy to implement and a more generic solution (applicable to also training).

I'm interested in both scale up and down. I'd like to create a cluster that starts with 0 workers, just the head node and have it scale to the appropriate size while reacting to the number and nature of tasks I create. This would be the behaviour I'd get on a vanilla k8s environment.

@VassilisVassiliadis
Copy link

Another thing that came to mind is that if the entire RayCluster is scheduled as 1 then in the event of preemption the entire thing goes down. If you don't persist your Ray logs, which is something you might do to avoid dealing with garbage collection, you risk losing them.

@troychiu
Copy link
Contributor

If we are considering what @VassilisVassiliadis proposed, I am wondering if KEP-77 can also be done. In my case and possibly same for other users, we don't want to enable pod integration since a CR is created for every pod and this can greatly slow down the cluster.
If yes, I am going to create a PR to resume KEP-77.

@k8s-ci-robot
Copy link
Contributor

@vicentefb: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-test-e2e-main-1-29 64edc86 link true /test pull-kueue-test-e2e-main-1-29
pull-kueue-test-e2e-main-1-30 64edc86 link true /test pull-kueue-test-e2e-main-1-30
pull-kueue-verify-main 64edc86 link true /test pull-kueue-verify-main
pull-kueue-build-image-main 64edc86 link true /test pull-kueue-build-image-main
pull-kueue-test-unit-main 64edc86 link true /test pull-kueue-test-unit-main
pull-kueue-test-scheduling-perf-main 64edc86 link true /test pull-kueue-test-scheduling-perf-main
pull-kueue-test-multikueue-e2e-main 64edc86 link true /test pull-kueue-test-multikueue-e2e-main
pull-kueue-test-integration-main 64edc86 link true /test pull-kueue-test-integration-main
pull-kueue-test-e2e-main-1-31 64edc86 link true /test pull-kueue-test-e2e-main-1-31
pull-kueue-test-tas-e2e-main 64edc86 link true /test pull-kueue-test-tas-e2e-main
pull-kueue-test-e2e-main-1-32 64edc86 link true /test pull-kueue-test-e2e-main-1-32
pull-kueue-test-multikueue-integration-main 64edc86 link true /test pull-kueue-test-multikueue-integration-main
pull-kueue-test-customconfigs-e2e-main 64edc86 link true /test pull-kueue-test-customconfigs-e2e-main
pull-kueue-test-e2e-tas-main 64edc86 link true /test pull-kueue-test-e2e-tas-main
pull-kueue-test-integration-multikueue-main 64edc86 link true /test pull-kueue-test-integration-multikueue-main
pull-kueue-test-e2e-customconfigs-main 64edc86 link true /test pull-kueue-test-e2e-customconfigs-main
pull-kueue-test-e2e-multikueue-main 64edc86 link true /test pull-kueue-test-e2e-multikueue-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tenzen-y
Copy link
Member

Sorry for the delayed response, folks.

First of all, I could understand both concerns and use cases mentioned by @VassilisVassiliadis @troychiu.
Those sound reasonable.

If we are considering what @VassilisVassiliadis proposed, I am wondering if KEP-77 can also be done. In my case and possibly same for other users, we don't want to enable pod integration since a CR is created for every pod and this can greatly slow down the cluster.
If yes, I am going to create a PR to resume KEP-77.

However, if we move this (and scale up as well) forward, I'm wondering if we should revisit design and investigation since our JobFramework is much more evolved and complicated rather than the day when we designed this. Additionally, the next minor release is closing. So, I do not contain such big things currently.

So, my proposal is to support Ray's autoscaling step by step.

Step 1. Support known ownerReference standalone feature only for Plain Pod Integration discussed in #4106 (I know this can not resolve @troychiu concerns)
This could resolve @VassilisVassiliadis use cases, IIUC, since you could enqueue Ray nodes by Pod Integraion instead of RayCluster integration to Kueue.

Step 2. Consider if we should expand the known ownerReference standalone mechanism to other frameworks, including RayCluster. I mean that we want to investigate if we should support another RayCluster enqueue mode to handle RayWorker as standalone pods as opposed to a member of RayCluster. In this new mode, we create the Workload object only with the head Node, and the worker Node will be handled by just the Pod.

For Ray, this is mostly same behavior with Step 1. But, for other frameworks, this is a good starting point on how to handle the known owners. Actually, the kubeflow/trainer has similar use case.

This might be problematic since we can not guarantee when the Ray worker node Pods will be admitted by Kueue even though the Ray Head has been admitted already since the Workload objects corresponding to the Ray worker node will be created when the RayCluster Head node is admitted.

I know this is still not resolved @troychiu concerns since this offers cluster admins to enable Pod Integration.

Step 3. We redesign this straightforward dynamic approach. But, I think we can perform investigations and designing (not including impl) during step 2 so that we can delivery this faster.

In my mind, step 1 will be able to do in v0.11 (next release) or v0.12, step 2 will be able to do in v0.12, and step 3 will be able to do in v0.13. But step 3 is a much tentative candidate since it is a much bigger feature, and we currently focus on other bigger Workload scheduling evolving like fairSharing, HierarcyCohort, and TopologyScheduling.

cc @mimowo

@VassilisVassiliadis
Copy link

I think that as long as we can have Kueue monitor pods, but not do anything to those that don't have the kueue.x-k8s.io/queue-name label, my suggestion should be "good enough" for now. When Kueue gets more complete support for dynamically scaling workloads it could also switch to using that for managing the RayCluster custom resources.

@mimowo
Copy link
Contributor

mimowo commented Feb 18, 2025

I support Step 1 from #1852 (comment), basically create workloads for Pods with queue-name. However, to avoid creating a workload for KubeRay APIs (RayJob or RayCluster), introduce the "standalone" annotation.

Is it aligned with your proposal @VassilisVassiliadis?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.