-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamically Sized Jobs - Scale Down #1852
base: main
Are you sure you want to change the base?
Conversation
Hi @vicentefb. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
f0f98d9
to
d49df00
Compare
0bcee5e
to
daa8d26
Compare
/ok-to-test |
b194b89
to
35ca648
Compare
698cd3b
to
1ed17a7
Compare
n m patch error field not declared in schema commented out podSet immutability from workload webhook to be able to update that field added more comments clean code nit debuggin n m patch error field not declared in schema clean code n m patch error field not declared in schema commented out podSet immutability from workload webhook to be able to update that field added more comments clean code nit a cluster queue reconciliation fixed, it had to do with the infot totalrequests from admission inside the worklad go file working with scheduler cleaning code cleaning code cleaning cleaning cleaning integation test, but it messes up with parallelism test which should be expected updated parallelism it test updated wrappers kep removed Kep removed log lines clean code added a better conditional for updating the resize if the job is a RayCluster added Kind condition updated test and equivalentToWorkload condition added podset assigments check updated feature gate updated feature gate updating equivalentWorkload fixed lint removed changes from scheduler and workload controller testing updated workload controller reconciler to update spec and status nit update feature gate update variables made code more generic updated workload controller helper method typo nit addressed comments updated workload controller to use unuused quota updated integration test to work added unit test in workload controller changed naming to resizeable and fixed lint nit addressed comments addressed comments lint nit addressed comments nit nit nit
/retest |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
any update about this PR ? |
This needs a contributor to pick up the work and complete the scale up design and implementation. |
My understanding is that Kueue supports plain pods (https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/#example-pod) therefore couldn't we have the head node and pod-workers have a Then instead of supporting special logic for scaling up/down a RayCluster we could just make sure that the RayCluster contains workers which have a label that's the same as the one in the RayCluster object. Then the inTreeAutoScaler of Ray would just create pod objects and they would be scheduled based on the capacity of the queue they point to. Edit: the head node should also have the same label |
That's a valid approach too. You would probably want the head nodes to have a higher priority than the workers, to reduce risk of preemption. |
Does this mean that we would need a workload CR for each Ray worker? I think that would be a heavy load for the cluster. Also, users have to enable plain pods integration if they want to use ray autoscaling in this case? |
I don't think this is a big problem. For example, I expect that we currently have 1 Workload CR per managed Job and the number of jobs on a cluster is greater than the number of Ray Workers pods. This is because your average ray worker pod tends to have several tasks running on it. What would be the number of Ray worker pods that would cause kueue to show signs of heavy load ?
Yeah, I think cluster admins - or people with privileges to configure Kueue - would need to enable pod integration (https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/#before-you-begin). It's probably best they restrict Kueue to managing the pods for just a limited number of namespaces - those which contain RayClusters managed by Kueue. |
The problem I'm currently facing is that I am using RayClusters in a Kueue managed K8s cluster. Kueue stops me from dynamically scaling my RayClusters which in turn makes my hardware utilization poor :(. Ray, with the help of KubeRay and the inTreeAutoScaler, already has support for that. So I was thinking that we can just configure Kueue to react to the decisions of Ray/KubeRay by just looking at the Pods that Ray/Kuberay delete/create. In turn, that would simplify the RayCluster-specific code in Kueue. Of course, you rarely get something for free. This suggestion would need Kueue to manage pods. |
Does this mean that each pod in the RayCluster will be considered as a separate workload? Can we use TAS (topology-aware scheduling) for RayCluster when we submit RayCluster like this? |
@vicentefb @alculquicondor Do you need contributors? Let me see what can I help. |
In my proposal, yes.
Not sure about that. A RayCluster is effectively a collection of pod definitions. As long Kueue supports using the TAS-related annotations/labels (https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/#example-1) in a Pod, then it should work with the RayCluster workers/head-node too. |
ack. IIUC @VassilisVassiliadis your proposal is to have a Workload per Pod to achieve scaling of RayClusters? We apply that approach for Deployments. However, Deployments by nature are meant for serving and thus they typically support partial preemption and working at partial capacity. For workloads managed by Ray this is less obvious. Some users who use Ray for training would not want that granularity, I think. So my questions that come first to my mind:
|
Yes that's it. Instead of switching of the inTreeAutoScaler off ray and handling scaling of workers as an all or nothing static decision within Kueue we could let the autoScaler do its thing and just react to requests for individual pods. Effectively, we're letting KubeRay just submit pods to K8s whenever it decides to do that. The plain pods are then intercepted by Kueue and handled accordingly.
That's understandable.
I think you're asking how Kueue would tell between scheduling the entire RayCluster as 1 thing or allowing KubeRay to dynamically scale the Ray workers up/down i.e. Kueue just deals with the objects that KubeRay creates, not the RayCluster object itself. Perhaps, there is a label at the level of the RayCluster CR to configure which method to use.
The proposal can be considered as "don't treat a RayCluster as a resource Kueue manages, instead handle the objects it creates". The advantages are:
Downside:
An alternative would be to modify KubeRay so that it works with Job objects instead of Pod objects. That would get rid of the disadvantage. But of course it would need someone to update KubeRay first.
I'm interested in both scale up and down. I'd like to create a cluster that starts with 0 workers, just the head node and have it scale to the appropriate size while reacting to the number and nature of tasks I create. This would be the behaviour I'd get on a vanilla k8s environment. |
Another thing that came to mind is that if the entire RayCluster is scheduled as 1 then in the event of preemption the entire thing goes down. If you don't persist your Ray logs, which is something you might do to avoid dealing with garbage collection, you risk losing them. |
If we are considering what @VassilisVassiliadis proposed, I am wondering if KEP-77 can also be done. In my case and possibly same for other users, we don't want to enable pod integration since a CR is created for every pod and this can greatly slow down the cluster. |
@vicentefb: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Sorry for the delayed response, folks. First of all, I could understand both concerns and use cases mentioned by @VassilisVassiliadis @troychiu.
However, if we move this (and scale up as well) forward, I'm wondering if we should revisit design and investigation since our JobFramework is much more evolved and complicated rather than the day when we designed this. Additionally, the next minor release is closing. So, I do not contain such big things currently. So, my proposal is to support Ray's autoscaling step by step. Step 1. Support known ownerReference standalone feature only for Plain Pod Integration discussed in #4106 (I know this can not resolve @troychiu concerns) Step 2. Consider if we should expand the known ownerReference standalone mechanism to other frameworks, including RayCluster. I mean that we want to investigate if we should support another RayCluster enqueue mode to handle RayWorker as standalone pods as opposed to a member of RayCluster. In this new mode, we create the Workload object only with the head Node, and the worker Node will be handled by just the Pod. For Ray, this is mostly same behavior with This might be problematic since we can not guarantee when the Ray worker node Pods will be admitted by Kueue even though the Ray Head has been admitted already since the Workload objects corresponding to the Ray worker node will be created when the RayCluster Head node is admitted. I know this is still not resolved @troychiu concerns since this offers cluster admins to enable Pod Integration. Step 3. We redesign this straightforward dynamic approach. But, I think we can perform investigations and designing (not including impl) during step 2 so that we can delivery this faster. In my mind, cc @mimowo |
I think that as long as we can have Kueue monitor pods, but not do anything to those that don't have the |
I support Step 1 from #1852 (comment), basically create workloads for Pods with queue-name. However, to avoid creating a workload for KubeRay APIs (RayJob or RayCluster), introduce the "standalone" annotation. Is it aligned with your proposal @VassilisVassiliadis? |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This includes Phase 1 implementation for Dynamically Sized Jobs KEP #1851
Which issue(s) this PR fixes:
Part of #77
Special notes for your reviewer:
Scale Down only implementation for RayClusters
Does this PR introduce a user-facing change?