-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support queue-related logic with kube-queue #1519
Comments
Are there any experimental results to support this? |
Sorry for late reply. The improvement in cluster resource utilization is related to the different cluster sizes, tenant divisions and the types of workloads. There is a 5%~30% improvement according to actual statistics after using Kube-queue and a reasonable quota management system. @terrytangyuan |
@denkensk do you mind if we repurpose this issue for kueue? :) |
cc @tenzen-y, as I see you involved in both kubeflow and kueue :) |
@alculquicondor Thanks for doing cc. |
Hi! Do u have any other progress? |
@KunWuLuan We don't have any progress. Before we move suspend feature forward, we need to work on #1714. |
I'll work on this issue after #1809 is completed. /assign |
I started this implementation right now. |
Completed: #1859 |
@tenzen-y: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
For a deep learning cluster, it is common case that all kinds of tasks (like TFJob, MPIJob, Deployment, Statefulset, etc.) submitted by users are waiting for resource to be allocated. Unfortunately, Pod is the minimal scheduling unit, which brings hurdle to mange tasks the way other clusters like Slurm do.
To make up such a feature missing, @denkensk and I work together with other contributors to present a new queue system for tasks on Kubernetes cluster called
kube-queue
. Unlike the queue in volcano, kube-queue does not hijack the creation/submission of tasks. Instead, kube-queue relies operators of each task API (like TFJob, MPIJob) to wait until a clearready-to-go
message confirmed by kube-queue and delivered to the task itself via annotation of the CR.We'd like to integrate kube-queue with training-operator, which requires minimal changes to the
Reconcile
method:Certainly, such logic can be turn on and off via the launch argument of training-operator.
The proposal of kube-queue has been submitted to Kubernetes wg-batch, pending further discussion and the implementation is now managing thousands of tasks within Alibaba and Baidu.
The text was updated successfully, but these errors were encountered: