Support coscheduling plugin #500

tenzen-y · 2023-01-12T05:09:11Z

/kind feature

MPI Operator now supports the all-or-nothing semantic, queuing logic, and more features for batch workload by the Volcano.
Although, I think the maintenance cost for Volcano is a bit high for users who want to use only all-or-nothing semantic.

So I would like to support that semantic by coscheduling plugin.

Supporting the coscheduling plugin, users could use that semantic without additional components.

@alculquicondor @gaocegege @terrytangyuan @zw0610 WDYT?

Tasks:

Implement the logic to support the coscheduling plugin
Add E2E for gang-scheduling using the scheduler plugins

zw0610 · 2023-01-12T08:03:25Z

I support this feature enhancement. Meanwhile, will mpi-operator keep support for Volcano?

tenzen-y · 2023-01-12T08:09:13Z

I support this feature enhancement. Meanwhile, will mpi-operator keep support for Volcano?

Yes, I was thinking of adding support for the coscheduling plugin, not replacing volcano.

zw0610 · 2023-01-12T09:39:21Z

I was trying to support this in the following pull requests:
kubeflow/common#185
kubeflow/training-operator#1526

But there are still some version conflict between the coscheduling plugin and training operator when the last time I try to push forward this feature.

I will close these two when new pull requests are submitted.

tenzen-y · 2023-01-12T09:49:15Z

@zw0610 I created an issue to support the coscheduling plugin for training-operator at kubeflow/training-operator#1722 since I missed those PRs.

As I can see mpi-operator v2, I think we can support the scheduler plugin by simply implementing it in this repository since mpi-operator v2 doesn't depend on kubeflow/common.

alculquicondor · 2023-01-12T14:57:03Z

Note that the coscheduling plugin is still not part of a standard kubernetes installation. The enhancement is tracked here, but it's currently stalled kubernetes/enhancements#3370

I'm not sure if the design would change much. Perhaps @denkensk and @Huang-Wei could provide some updates.

But there are still some version conflict between the coscheduling plugin and training operator when the last time I try to push forward this feature.

Can you expand on this?

tenzen-y · 2023-01-12T17:51:33Z

Note that the coscheduling plugin is still not part of a standard kubernetes installation. The enhancement is tracked here, but it's currently stalled kubernetes/enhancements#3370

@alculquicondor Thanks for letting me know about PodGroup enhancement. It would be helpful for batch workloads.

First, we can introduce the current coscheduling plugin and then adapt new coscheduling specifications once kubernetes/enhancements#3370 is released.

But there are still some version conflict between the coscheduling plugin and training operator when the last time I try to push forward this feature.

Can you expand on this?

First, I plan to take over kubeflow/common#185 and kubeflow/training-operator#1526. Then I will work on introducing support for the coscheduling plugin in this repository based on kubeflow/common, not importing kubeflow/common to this repository, once those PRs completed.

alculquicondor · 2023-01-12T18:37:35Z

I mentioned this topic in today's SIG Scheduling meeting. The conclusion is that the PodGroup KEP needs some work but current contributors don't have enough bandwidth. @Huang-Wei might add some updates.

I would say we are a little late for the 1.27 release timeline, but it would be nice to make some progress. If you are interested in contributing or taking over the design, you are welcomed to.

tenzen-y · 2023-01-12T19:29:59Z

@alculquicondor Thanks for sharing the progress for PoDGroup KEP.

I'm not familiar with kube-scheduler internal implementations and cluster-autoscaler specifications. So I will dive into kube-scheduler and cluster-autoscaler implementations.

Once I understand those components better, I will let you know and assign me PodGroup KEP.

Although I'm ok with someone else moving forward with PodGroup KEP.

BTW, I would like to confirm closing. As I can see about the 1.27 release schedule, we must complete PodGroup KEP until 10th February 2023, right?

alculquicondor · 2023-01-12T19:49:33Z

That is correct, so I don't think we have enough time for this to go through for 1.27. But 1.28 is certainly possible.

tenzen-y · 2023-01-12T20:00:15Z

Thanks. I'm ok with either 1.27 or 1.28.

denkensk · 2023-01-13T09:32:19Z

Thanks @tenzen-y
I suggest that you can start to continue the work kubeflow/common#185 to integrate with scheduler-plugins now. It's worth to do this because so many users have used co-scheduling in their cluster. This work will make it more convenient.

I will continue the work on podgroup. If it can be the in native api. We can upgrade to the api version to support.

tenzen-y · 2023-01-13T10:31:58Z

Thanks @tenzen-y I suggest that you can start to continue the work kubeflow/common#185 to integrate with scheduler-plugins now. It's worth to do this because so many users have used co-scheduling in their cluster. This work will make it more convenient.

Yes, I plan to work on that PR. We can track the progress of the training-operator with kubeflow/training-operator#1722.

For the mpi-operator, I will work on this after implementations for the training-operator are completed.

I will continue the work on podgroup. If it can be the in native api. We can upgrade to the api version to support.

Thanks for your effort. Yes, that's right. We can migrate to a native PodGroup API once the feature is graduated to beta or stable.

tenzen-y · 2023-01-17T06:27:11Z

Once #502 is completed, I'm going to work on this.

/assign

tenzen-y · 2023-01-21T17:45:51Z

Now, training-operator supports coscheduling plugin 🎉

Next, I'll work on mpi-operator!

terrytangyuan · 2023-01-24T04:27:30Z

Thank you for your work on this!

tenzen-y · 2023-03-03T09:53:28Z

I will implement the logic to support scheduler-plugins using the master branch since the scheduler-plugins switched the API group to x-k8s.io. Also, it switched labels for PodGroup to scheduling.x-k8s.io/pod-group.

ref:

NOTE: The new scheduler-plugins will be released this month, which include the above changes.

kubernetes-sigs/scheduler-plugins#503

cc: @alculquicondor @denkensk

alculquicondor · 2023-03-03T13:53:58Z

Devil's advocate here 😅

Could the scheduler-plugins (and volcano for that matter) use a webhook to create the PodGroup objects based on the .schedulingPolicy field?

tenzen-y · 2023-03-03T14:07:01Z

Devil's advocate here 😅

OMG 😱

Could the scheduler-plugins (and volcano for that matter) use a webhook to create the PodGroup objects based on the .schedulingPolicy field?

Does that mean the webhook converts the API group like conversion webhooks?

alculquicondor · 2023-03-03T14:33:31Z

No, the webhook (or a reconciler) would create the volcano or scheduler-plugins PodGroup object based on the .schedulingPolicy

My point is that we don't need to embed this code in kubeflow.

denkensk · 2023-03-03T14:41:37Z

My point is that we don't need to embed this code in kubeflow.

webhook may be not a good choice.

deploying and managing webhook brings extra work
webhook need to understand the different job api definitions in the train-operator

And also Volcano is already integrated with Kubeflow in the code, so we can easily reuse the original logic to talk about scheduling-plugins integration together.

denkensk · 2023-03-03T14:42:33Z

NOTE: The new scheduler-plugins will be released this month, which include the above changes.

Thanks for your work on this. 👍🏻

alculquicondor · 2023-03-03T14:43:56Z

A controller then, but in volcano/scheduling-plugins.

I'm just trying to put a guard against adding support for more and more "schedulers", given that there are alternative ways for them to support kubeflow that are more sustainable and reduce maintenance burden in kubeflow.

denkensk · 2023-03-03T14:56:52Z

I think what you say is reasonable, but it may require explicit design and a longer time.

It's also not just Kubeflow's training-operator. Probably a topic of how Operator and Scheduling are integrated. On the other side of the road, Volcano is working on code integration with many Operators, such as Spark-Operator and Spark itself.

We can reuse the original logic used by Volcano in training-operator to integrate with scheduler-plugins together firstly.

tenzen-y · 2023-03-03T15:05:31Z

No, the webhook (or a reconciler) would create the volcano or scheduler-plugins PodGroup object based on the .schedulingPolicy

My point is that we don't need to embed this code in kubeflow.

I see. I will add a similar implementation to the training operator:

https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/controller.v1/control/podgroup_control.go

This means mpijob-controller creates PodGroup using a reconcile.

alculquicondor · 2023-03-08T17:35:36Z

I'm sort of on the fence. I can see that reversing the dependency would make mpi-operator more sustainable, but the individual scheduler would have to vendor mpi-operator, so the cost of managing deps is still here - it just gets shifted.

That's kind of my point. There will be a dependency problem somewhere. So we need to think of a good long term solution. I fear that if we blindly accept new dependencies in kubeflow, nobody will think about this long term solution and always fall into the easy path of "add it to kubeflow".

My alternate intermediate proposal is that there are integration controller(s) in a standalone repo.

On kubeflow's defense, the RunPolicy field is common among all controllers in the same place, so it can be used by duplicating just that part of the library, instead of embedding a dependency.

creating PodGroup immediately in mpi-operator vs. relying on an external controller watching the CR (e.g., TFJob) and then creating a PodGroup

You could introduce a webhook that sets the suspend field to false until the PodGroup object exists.

tenzen-y · 2023-03-08T18:11:08Z

My alternate intermediate proposal is that there are integration controller(s) in a standalone repo.

@alculquicondor Does that mean we create a repo under k-sigs org? Then do we add controllers to watch CustomJobs (e.g., MPIJob, VolcanoJob, RayJob) and to create PodGroups for custom schedulers (e.g., scheduler-plugins, and volcano-scheduler)?

You could introduce a webhook that sets the suspend field to false until the PodGroup object exists.

IIRC, @Huang-Wei wants to avoid introducing webhooks to scheduler-plugins as much as possible.

kubernetes-sigs/scheduler-plugins#518

alculquicondor · 2023-03-08T18:21:25Z

I'd be ok having a new repo in k-sigs for scheduler-plugins integration, but no other schedulers. Although maybe a submodule in the scheduler-plugins repo is enough to separate the dependencies.

alculquicondor · 2023-03-08T18:23:03Z

But to be fair, I'm just one owner of mpi-operator. Others probably need to chime-in.

@terrytangyuan @johnugeorge

tenzen-y · 2023-03-08T18:27:02Z

I'd be ok having a new repo in k-sigs for scheduler-plugins integration, but no other schedulers. Although maybe a submodule in the scheduler-plugins repo is enough to separate the dependencies.

Sounds good.

cc: @kubeflow/wg-training-leads

tenzen-y · 2023-03-08T19:07:03Z

I'd be ok having a new repo in k-sigs for scheduler-plugins integration, but no other schedulers. Although maybe a submodule in the scheduler-plugins repo is enough to separate the dependencies.

@Huang-Wei @denkensk What do you think about the above proposal? We would like to add other controllers, not part of the scheduler-plugins-controller, to the scheduler-plugins repo to watch CustomJobs like MPIJob.

tenzen-y · 2023-03-08T19:23:29Z

@alculquicondor If my understanding is not correct, please correct it.

Huang-Wei · 2023-03-08T19:27:53Z

Creating a k-sigs project doesn't work for all supported schedulers, right? as some projects are not kubernetes subprojects.

Personally, if we all agree to reverse the dependency, I'd spend some time evaluating whether watching on custom resource is compelling (via k8s dynamic client), so that a scheduler offering doesn't need to explicitly pin mpi-operator's dependency in go.mod.

BTW: I'm not a fan of using admission controller here as that brings further complexity - admission controller should be treated as part of API server stack and should be tested thoroughly to meet certain SLO standards, however in reality it's not (always) the case. I won't be surprised if an ill-maintained admission controller would timeout intermittently or return slowly or stuck in certificate issues, and those would impact the overall SLO for all workloads.

alculquicondor · 2023-03-08T19:35:53Z

Creating a k-sigs project doesn't work for all supported schedulers, right?

It doesn't and that's the point. Each scheduler should deal with its own problems as long as we don't have a common interface (job scheduling subresource?). My hope is that this adds pressure to make progress on such thing.

via k8s dynamic client

Exactly, we should be thinking about these potential solutions.

denkensk · 2023-03-09T08:03:33Z

I don't think reversing dependencies is a reasonable way.

As a scheduler, no matter scheduler-plugins or volcano, they holds a relatively low-level position in the entire architecture, providing basic primitives like podGroup, Quota for upper-level business integrations, such as different job operators, rather than integrating all different jobs api(MPIJob, TFJob, PytorchJob, VolcanoJob, SparkJob, PaddleJob, RayJob...) , which is unreasonable in the architecture.

The basic primitives will not undergo significant changes in the long term, but the API definitions for jobs will continue to evolve.

alculquicondor · 2023-03-09T15:46:34Z

which is unreasonable in the architecture.

Why is it reasonable to add the complexity in Kubeflow instead?

I don't think either place is good. The ideal solution is a standard "Job", "PodGroup" or a jobQueueing subresource.

But in the meantime, a separate controller makes a lot of sense. It doesn't compromise the maintainability of either of the repositories that are not supposed to know about each another.

alculquicondor · 2023-03-09T15:48:28Z

But maybe we don't need a separate repository for now. One good thing about all kubeflow jobs is that they all have the same fields relevant for scheduling in the same jsonpath. So you just need to duplicate that part of the API.

tenzen-y · 2023-03-09T16:56:24Z

But maybe we don't need a separate repository for now. One good thing about all kubeflow jobs is that they all have the same fields relevant for scheduling in the same jsonpath. So you just need to duplicate that part of the API.

@alculquicondor Does that mean we add the logic for the PodGroup to the mpi-operator, similar to the training-operator?

alculquicondor · 2023-03-09T17:05:32Z

No, I'm saying that scheduler-plugins and volcano can use the fields without having to import all of kubeflow's libraries.

alculquicondor · 2023-03-09T17:06:43Z

TBH, I think my PoV doesn't have enough support. But I wanted to share my concerns and how I think going the easy path is just going to prevent us doing progress for the right solution in the future.

tenzen-y · 2023-03-09T17:40:41Z

TBH, I think my PoV doesn't have enough support. But I wanted to share my concerns and how I think going the easy path is just going to prevent us doing progress for the right solution in the future.

I think this discussion is a good starting point for integrating CustomJobs with other scheduler.

However, it might be better to proceed with this discussion upstream (k/k), not downstream (mpi-operator).
This means, first of all, it might be better to start discussions at kubernetes/kubernetes#107294 and kubernetes/enhancements#3370. And then, after we add those features to k/k, we can come back downstream.

So, I would like to support the scheduler-plugins for the mpi-operator as originally planned.

WDYT?

Syulin7 · 2023-03-10T09:38:03Z

In the long run, I think it's necessary to provide a Native PodGroup API that includes fields such as suspend, queue, minNumOfWorkers, etc. Different schedulers can implement the same Native PodGroup API.

For example, the operator only needs to add a specific label to the pod, and the scheduler creates and maintains the PodGroup. This way, the operator and scheduler won't depend on each other. I think we can work towards this direction in the future.

This will take a long time, but currently, it is necessary to support different schedulers in operators, which can actually help users solve problems. I think this is very important.

tenzen-y · 2023-03-10T11:11:12Z

This will take a long time, but currently, it is necessary to support different schedulers in operators, which can actually help users solve problems. I think this is very important.

The issues in this discussion are short-term solutions.

tenzen-y · 2023-03-10T11:11:51Z

Let me summarize the topics for others.

We have similar opinions in the long term. This means we would like to add more features for the batch workloads (e.g., the PodGroup, subresource...) to k/k.

However, we have other opinions in the short term, like the following:

Add a separate controller to the scheduler-plugins repo, volcano repo, and so on for the CustomJobs (e.g., MPIJob, SparkJob...).
Add the logic for doing CRUD for the PodGroup to downstream controllers (e.g., mpi-operator, spark-operator).

tenzen-y · 2023-03-10T14:29:29Z

TBH, I think my PoV doesn't have enough support. But I wanted to share my concerns and how I think going the easy path is just going to prevent us doing progress for the right solution in the future.

I think this discussion is a good starting point for integrating CustomJobs with other scheduler.

However, it might be better to proceed with this discussion upstream (k/k), not downstream (mpi-operator). This means, first of all, it might be better to start discussions at kubernetes/kubernetes#107294 and kubernetes/enhancements#3370. And then, after we add those features to k/k, we can come back downstream.

So, I would like to support the scheduler-plugins for the mpi-operator as originally planned.

WDYT?

@alculquicondor I think getting consent to create a separate controller in the scheduler-plugins repo from @denkensk @Huang-Wei is hard.

In the mpi-operator repo, I would like to add support for the scheduler-plugins similar to the training-operator.

alculquicondor · 2023-03-10T14:46:51Z

I said all these things not because it was going to be easy, but because I believe it's the right direction :)

But, given that training-plugins already has support for scheduling-plugins and there's some chance that the coscheduling plugin will be upstream in the not so far future, I'm ok with the proposal.

tenzen-y · 2023-03-13T07:22:58Z

I believe it's the right direction :)

Yes, I agree.
So I'd like to take over the KEP for the native PodGroup API if the current author of its KEP doesn't have enough bandwidth after we start the release cycle for the k/k v1.28 :)

alculquicondor · 2023-03-30T13:18:05Z

Do you have a PR for E2E tests? I'm hoping to release kueue, and I would prefer we don't have to use an unreleased version of mpi-operator

tenzen-y · 2023-03-30T13:24:07Z

I would prefer we don't have to use an unreleased version of mpi-operator

I agree. I will create a PR today.

tenzen-y · 2023-04-04T19:45:19Z

/close as completed.

tenzen-y · 2023-04-04T19:45:38Z

/close

google-oss-prow · 2023-04-04T19:45:41Z

@tenzen-y: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow bot added the kind/feature label Jan 12, 2023

tenzen-y mentioned this issue Jan 12, 2023

Support coscheduling plugin kubeflow/training-operator#1722

Closed

google-oss-prow bot assigned tenzen-y Jan 17, 2023

tenzen-y mentioned this issue Jan 20, 2023

Add job suspend semantics kubeflow/common#196

Open

tenzen-y mentioned this issue Jan 26, 2023

Release 0.4.0 #507

Closed

10 tasks

tenzen-y mentioned this issue Feb 5, 2023

Respect SchedulingPolicy #518

Closed

tenzen-y mentioned this issue Mar 20, 2023

Support the coscheduling plugin of scheduler-plugins #538

Merged

tenzen-y mentioned this issue Mar 30, 2023

Implement E2E for integration with scheduler-plugins #540

Merged

google-oss-prow bot closed this as completed Apr 4, 2023

Support coscheduling plugin #500

Support coscheduling plugin #500

Comments

tenzen-y commented Jan 12, 2023 • edited Loading

zw0610 commented Jan 12, 2023 • edited Loading

tenzen-y commented Jan 12, 2023

zw0610 commented Jan 12, 2023

tenzen-y commented Jan 12, 2023

alculquicondor commented Jan 12, 2023

tenzen-y commented Jan 12, 2023 • edited Loading

alculquicondor commented Jan 12, 2023

tenzen-y commented Jan 12, 2023

alculquicondor commented Jan 12, 2023

tenzen-y commented Jan 12, 2023

denkensk commented Jan 13, 2023

tenzen-y commented Jan 13, 2023

tenzen-y commented Jan 17, 2023

tenzen-y commented Jan 21, 2023

terrytangyuan commented Jan 24, 2023

tenzen-y commented Mar 3, 2023

alculquicondor commented Mar 3, 2023

tenzen-y commented Mar 3, 2023 • edited Loading

alculquicondor commented Mar 3, 2023

denkensk commented Mar 3, 2023

denkensk commented Mar 3, 2023

alculquicondor commented Mar 3, 2023

denkensk commented Mar 3, 2023

tenzen-y commented Mar 3, 2023 • edited Loading

alculquicondor commented Mar 8, 2023

tenzen-y commented Mar 8, 2023

alculquicondor commented Mar 8, 2023

alculquicondor commented Mar 8, 2023

tenzen-y commented Mar 8, 2023

tenzen-y commented Mar 8, 2023 • edited Loading

tenzen-y commented Mar 8, 2023 • edited Loading

Huang-Wei commented Mar 8, 2023

alculquicondor commented Mar 8, 2023

denkensk commented Mar 9, 2023

alculquicondor commented Mar 9, 2023

alculquicondor commented Mar 9, 2023 • edited Loading

tenzen-y commented Mar 9, 2023

alculquicondor commented Mar 9, 2023

alculquicondor commented Mar 9, 2023

tenzen-y commented Mar 9, 2023

Syulin7 commented Mar 10, 2023

tenzen-y commented Mar 10, 2023

tenzen-y commented Mar 10, 2023

tenzen-y commented Mar 10, 2023

alculquicondor commented Mar 10, 2023

tenzen-y commented Mar 13, 2023

alculquicondor commented Mar 30, 2023

tenzen-y commented Mar 30, 2023

tenzen-y commented Apr 4, 2023

tenzen-y commented Apr 4, 2023

google-oss-prow bot commented Apr 4, 2023

tenzen-y commented Jan 12, 2023 •

edited

Loading

zw0610 commented Jan 12, 2023 •

edited

Loading

tenzen-y commented Jan 12, 2023 •

edited

Loading

tenzen-y commented Mar 3, 2023 •

edited

Loading

tenzen-y commented Mar 3, 2023 •

edited

Loading

tenzen-y commented Mar 8, 2023 •

edited

Loading

tenzen-y commented Mar 8, 2023 •

edited

Loading

alculquicondor commented Mar 9, 2023 •

edited

Loading