Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for RayCluster #1272

Closed
3 tasks done
andrewsykim opened this issue Oct 26, 2023 · 9 comments · Fixed by #1520 or #1607 · May be fixed by vicentefb/kueue#2
Closed
3 tasks done

Add support for RayCluster #1272

andrewsykim opened this issue Oct 26, 2023 · 9 comments · Fixed by #1520 or #1607 · May be fixed by vicentefb/kueue#2
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@andrewsykim
Copy link
Member

andrewsykim commented Oct 26, 2023

What would you like to be added:

Support RayCluster as a queue-able workload in Kueue (much like RayJob).

Why is this needed:

Currently Kueue supports RayJob which works great when managing ray jobs that run on ephemeral ray clusters. However, there are many use-cases and existing workloads that depend on long-lived RayClusters. Being able to account for these RayClusters with Kueue would greatly improve integration of Kueue with Ray.

Completion requirements:

This probably needs a KEP, but very roughly the requirements would be:

  • Kueue should only account for ray worker nodes in ClusterQueue quotas (we may account for head nodes later)
  • Scaling of worker nodes should be blocked on quota reservations
  • Evicting or pre-empting RayClusters involves scaling worker groups to 0

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@andrewsykim andrewsykim added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 26, 2023
@andrewsykim
Copy link
Member Author

@alculquicondor initial thoughts on this? I can try to put together a KEP if this idea doesn't sound crazy :)

@trasc
Copy link
Contributor

trasc commented Oct 27, 2023

Hi, in my opinion an easy way to do this is to have Ray create "Kueueble" pods for the workers.

@alculquicondor
Copy link
Contributor

It sounds reasonable to me to add support for RayCluster.
The additional thing to take into consideration is to make sure RayJobs are not doubly queued once RayCluster is also queueable.

cc @kerthcet as one of the primary users of Ray+Kueue

@tenzen-y
Copy link
Member

+1

I think that supporting long-living resources would be worth it.

@kerthcet
Copy link
Contributor

kerthcet commented Nov 7, 2023

Make sense to me, several points here:

  • RayCluster is somehow similar to RayJob because RayJob reuses rayClusterSpec, so the code should be reuse.
  • RayCluster doesn't support suspend semantics, so it behaves as long-lived resources, right? Then we can't guarantee the resource utilization if the cluster is empty of jobs.

make sure RayJobs are not doubly queued once RayCluster is also queueable.

I think we can tell this via the clusterSelector field.

@andrewsykim
Copy link
Member Author

/assign

@andrewsykim
Copy link
Member Author

Do we still want a KEP for this? I think with the new suspend API (ray-project/kuberay#1667), the implementation should be very straight forward.

@alculquicondor
Copy link
Contributor

alculquicondor commented Dec 13, 2023

Historically, we haven't written KEPs for integrations.

FWIIW, the implementation should be quite similar to that of Job, whereas the implementation for RayJob will change a little to be more similar to Job. Note that if a user is queuing a RayJob, its RayCluster shouldn't be doubly queued.

@andrewsykim
Copy link
Member Author

FYI @vicentefb who is working on the feature now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
5 participants