Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support queue-related logic with kube-queue #1519

Closed
zw0610 opened this issue Jan 6, 2022 · 11 comments
Closed

Support queue-related logic with kube-queue #1519

zw0610 opened this issue Jan 6, 2022 · 11 comments
Assignees

Comments

@zw0610
Copy link
Member

zw0610 commented Jan 6, 2022

For a deep learning cluster, it is common case that all kinds of tasks (like TFJob, MPIJob, Deployment, Statefulset, etc.) submitted by users are waiting for resource to be allocated. Unfortunately, Pod is the minimal scheduling unit, which brings hurdle to mange tasks the way other clusters like Slurm do.

To make up such a feature missing, @denkensk and I work together with other contributors to present a new queue system for tasks on Kubernetes cluster called kube-queue. Unlike the queue in volcano, kube-queue does not hijack the creation/submission of tasks. Instead, kube-queue relies operators of each task API (like TFJob, MPIJob) to wait until a clear ready-to-go message confirmed by kube-queue and delivered to the task itself via annotation of the CR.

We'd like to integrate kube-queue with training-operator, which requires minimal changes to the Reconcile method:

import (
    ...
    queuev1alpha1 "github.com/kube-queue/pkg/apis/scheduling/v1alpha1"
    ....
)

func (r *XXJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    ...
    if queuev1alpha1.JobSuspended(job) {
        logger.Info("job suspended by kube-queue")
        return ctrl.Result{RequeueAfter: 10*time.Second}, nil
    }
    ...
}

Certainly, such logic can be turn on and off via the launch argument of training-operator.

The proposal of kube-queue has been submitted to Kubernetes wg-batch, pending further discussion and the implementation is now managing thousands of tasks within Alibaba and Baidu.

@terrytangyuan
Copy link
Member

kube-queue automates and optimizes workload and resource quota management to maximize cluster resource utilization.

Are there any experimental results to support this?

@denkensk
Copy link
Member

Are there any experimental results to support this?

Sorry for late reply. The improvement in cluster resource utilization is related to the different cluster sizes, tenant divisions and the types of workloads. There is a 5%~30% improvement according to actual statistics after using Kube-queue and a reasonable quota management system. @terrytangyuan

@alculquicondor
Copy link

@denkensk do you mind if we repurpose this issue for kueue? :)

ref kubernetes-sigs/kueue#65

@alculquicondor
Copy link

cc @tenzen-y, as I see you involved in both kubeflow and kueue :)

@tenzen-y
Copy link
Member

tenzen-y commented Jan 5, 2023

@alculquicondor Thanks for doing cc.
Yes. I am aiming to support Kueue on training-operator and mpi-operator, finally.
So we need to work on kubeflow/common#196.

@KunWuLuan
Copy link

Hi! Do u have any other progress?
I think suspend semantics for other workload types is also needed for both kube-queue and kueue.

@tenzen-y
Copy link
Member

@KunWuLuan We don't have any progress. Before we move suspend feature forward, we need to work on #1714.

@tenzen-y
Copy link
Member

I'll work on this issue after #1809 is completed.

/assign

@tenzen-y
Copy link
Member

tenzen-y commented Jul 5, 2023

I started this implementation right now.

@tenzen-y
Copy link
Member

tenzen-y commented Aug 7, 2023

Completed: #1859
/close

@google-oss-prow
Copy link

@tenzen-y: Closing this issue.

In response to this:

Completed: #1859
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants