Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support coscheduling plugin #500

Closed
2 tasks done
Tracked by #507
tenzen-y opened this issue Jan 12, 2023 · 67 comments
Closed
2 tasks done
Tracked by #507

Support coscheduling plugin #500

tenzen-y opened this issue Jan 12, 2023 · 67 comments
Assignees

Comments

@tenzen-y
Copy link
Member

tenzen-y commented Jan 12, 2023

/kind feature

MPI Operator now supports the all-or-nothing semantic, queuing logic, and more features for batch workload by the Volcano.
Although, I think the maintenance cost for Volcano is a bit high for users who want to use only all-or-nothing semantic.

So I would like to support that semantic by coscheduling plugin.

Supporting the coscheduling plugin, users could use that semantic without additional components.

@alculquicondor @gaocegege @terrytangyuan @zw0610 WDYT?


Tasks:

  • Implement the logic to support the coscheduling plugin
  • Add E2E for gang-scheduling using the scheduler plugins
@zw0610
Copy link
Member

zw0610 commented Jan 12, 2023

I support this feature enhancement. Meanwhile, will mpi-operator keep support for Volcano?

@tenzen-y
Copy link
Member Author

I support this feature enhancement. Meanwhile, will mpi-operator keep support for Volcano?

Yes, I was thinking of adding support for the coscheduling plugin, not replacing volcano.

@zw0610
Copy link
Member

zw0610 commented Jan 12, 2023

I was trying to support this in the following pull requests:
kubeflow/common#185
kubeflow/training-operator#1526

But there are still some version conflict between the coscheduling plugin and training operator when the last time I try to push forward this feature.

I will close these two when new pull requests are submitted.

@tenzen-y
Copy link
Member Author

@zw0610 I created an issue to support the coscheduling plugin for training-operator at kubeflow/training-operator#1722 since I missed those PRs.

As I can see mpi-operator v2, I think we can support the scheduler plugin by simply implementing it in this repository since mpi-operator v2 doesn't depend on kubeflow/common.

@alculquicondor
Copy link
Collaborator

Note that the coscheduling plugin is still not part of a standard kubernetes installation. The enhancement is tracked here, but it's currently stalled kubernetes/enhancements#3370

I'm not sure if the design would change much. Perhaps @denkensk and @Huang-Wei could provide some updates.

But there are still some version conflict between the coscheduling plugin and training operator when the last time I try to push forward this feature.

Can you expand on this?

@tenzen-y
Copy link
Member Author

tenzen-y commented Jan 12, 2023

Note that the coscheduling plugin is still not part of a standard kubernetes installation. The enhancement is tracked here, but it's currently stalled kubernetes/enhancements#3370

@alculquicondor Thanks for letting me know about PodGroup enhancement. It would be helpful for batch workloads.

First, we can introduce the current coscheduling plugin and then adapt new coscheduling specifications once kubernetes/enhancements#3370 is released.

But there are still some version conflict between the coscheduling plugin and training operator when the last time I try to push forward this feature.

Can you expand on this?

First, I plan to take over kubeflow/common#185 and kubeflow/training-operator#1526. Then I will work on introducing support for the coscheduling plugin in this repository based on kubeflow/common, not importing kubeflow/common to this repository, once those PRs completed.

@alculquicondor
Copy link
Collaborator

I mentioned this topic in today's SIG Scheduling meeting. The conclusion is that the PodGroup KEP needs some work but current contributors don't have enough bandwidth. @Huang-Wei might add some updates.

I would say we are a little late for the 1.27 release timeline, but it would be nice to make some progress. If you are interested in contributing or taking over the design, you are welcomed to.

@tenzen-y
Copy link
Member Author

@alculquicondor Thanks for sharing the progress for PoDGroup KEP.

I'm not familiar with kube-scheduler internal implementations and cluster-autoscaler specifications. So I will dive into kube-scheduler and cluster-autoscaler implementations.

Once I understand those components better, I will let you know and assign me PodGroup KEP.

Although I'm ok with someone else moving forward with PodGroup KEP.

BTW, I would like to confirm closing. As I can see about the 1.27 release schedule, we must complete PodGroup KEP until 10th February 2023, right?

@alculquicondor
Copy link
Collaborator

That is correct, so I don't think we have enough time for this to go through for 1.27. But 1.28 is certainly possible.

@tenzen-y
Copy link
Member Author

Thanks. I'm ok with either 1.27 or 1.28.

@denkensk
Copy link
Member

Thanks @tenzen-y
I suggest that you can start to continue the work kubeflow/common#185 to integrate with scheduler-plugins now. It's worth to do this because so many users have used co-scheduling in their cluster. This work will make it more convenient.

I will continue the work on podgroup. If it can be the in native api. We can upgrade to the api version to support.

@tenzen-y
Copy link
Member Author

Thanks @tenzen-y I suggest that you can start to continue the work kubeflow/common#185 to integrate with scheduler-plugins now. It's worth to do this because so many users have used co-scheduling in their cluster. This work will make it more convenient.

Yes, I plan to work on that PR. We can track the progress of the training-operator with kubeflow/training-operator#1722.

For the mpi-operator, I will work on this after implementations for the training-operator are completed.

I will continue the work on podgroup. If it can be the in native api. We can upgrade to the api version to support.

Thanks for your effort. Yes, that's right. We can migrate to a native PodGroup API once the feature is graduated to beta or stable.

@tenzen-y
Copy link
Member Author

Once #502 is completed, I'm going to work on this.

/assign

@tenzen-y
Copy link
Member Author

Now, training-operator supports coscheduling plugin 🎉

Next, I'll work on mpi-operator!

@terrytangyuan
Copy link
Member

Thank you for your work on this!

@tenzen-y tenzen-y mentioned this issue Jan 26, 2023
10 tasks
@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 3, 2023

I will implement the logic to support scheduler-plugins using the master branch since the scheduler-plugins switched the API group to x-k8s.io. Also, it switched labels for PodGroup to scheduling.x-k8s.io/pod-group.

ref:

NOTE: The new scheduler-plugins will be released this month, which include the above changes.

kubernetes-sigs/scheduler-plugins#503

cc: @alculquicondor @denkensk

@alculquicondor
Copy link
Collaborator

Devil's advocate here 😅

Could the scheduler-plugins (and volcano for that matter) use a webhook to create the PodGroup objects based on the .schedulingPolicy field?

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 3, 2023

Devil's advocate here 😅

OMG 😱

Could the scheduler-plugins (and volcano for that matter) use a webhook to create the PodGroup objects based on the .schedulingPolicy field?

Does that mean the webhook converts the API group like conversion webhooks?

@alculquicondor
Copy link
Collaborator

No, the webhook (or a reconciler) would create the volcano or scheduler-plugins PodGroup object based on the .schedulingPolicy

My point is that we don't need to embed this code in kubeflow.

@denkensk
Copy link
Member

denkensk commented Mar 3, 2023

My point is that we don't need to embed this code in kubeflow.

webhook may be not a good choice.

  1. deploying and managing webhook brings extra work
  2. webhook need to understand the different job api definitions in the train-operator

And also Volcano is already integrated with Kubeflow in the code, so we can easily reuse the original logic to talk about scheduling-plugins integration together.

@denkensk
Copy link
Member

denkensk commented Mar 3, 2023

NOTE: The new scheduler-plugins will be released this month, which include the above changes.

Thanks for your work on this. 👍🏻

@alculquicondor
Copy link
Collaborator

A controller then, but in volcano/scheduling-plugins.

I'm just trying to put a guard against adding support for more and more "schedulers", given that there are alternative ways for them to support kubeflow that are more sustainable and reduce maintenance burden in kubeflow.

@denkensk
Copy link
Member

denkensk commented Mar 3, 2023

I think what you say is reasonable, but it may require explicit design and a longer time.

It's also not just Kubeflow's training-operator. Probably a topic of how Operator and Scheduling are integrated. On the other side of the road, Volcano is working on code integration with many Operators, such as Spark-Operator and Spark itself.

We can reuse the original logic used by Volcano in training-operator to integrate with scheduler-plugins together firstly.

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 3, 2023

No, the webhook (or a reconciler) would create the volcano or scheduler-plugins PodGroup object based on the .schedulingPolicy

My point is that we don't need to embed this code in kubeflow.

I see. I will add a similar implementation to the training operator:

https://github.com/kubeflow/common/blob/4dbf18c85417b54158914d1d01c1744b4e3542f2/pkg/controller.v1/control/podgroup_control.go

This means mpijob-controller creates PodGroup using a reconcile.

@alculquicondor
Copy link
Collaborator

I'm sort of on the fence. I can see that reversing the dependency would make mpi-operator more sustainable, but the individual scheduler would have to vendor mpi-operator, so the cost of managing deps is still here - it just gets shifted.

That's kind of my point. There will be a dependency problem somewhere. So we need to think of a good long term solution. I fear that if we blindly accept new dependencies in kubeflow, nobody will think about this long term solution and always fall into the easy path of "add it to kubeflow".

My alternate intermediate proposal is that there are integration controller(s) in a standalone repo.

On kubeflow's defense, the RunPolicy field is common among all controllers in the same place, so it can be used by duplicating just that part of the library, instead of embedding a dependency.

creating PodGroup immediately in mpi-operator vs. relying on an external controller watching the CR (e.g., TFJob) and then creating a PodGroup

You could introduce a webhook that sets the suspend field to false until the PodGroup object exists.

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 8, 2023

My alternate intermediate proposal is that there are integration controller(s) in a standalone repo.

@alculquicondor Does that mean we create a repo under k-sigs org? Then do we add controllers to watch CustomJobs (e.g., MPIJob, VolcanoJob, RayJob) and to create PodGroups for custom schedulers (e.g., scheduler-plugins, and volcano-scheduler)?

You could introduce a webhook that sets the suspend field to false until the PodGroup object exists.

IIRC, @Huang-Wei wants to avoid introducing webhooks to scheduler-plugins as much as possible.

kubernetes-sigs/scheduler-plugins#518

@alculquicondor
Copy link
Collaborator

I'd be ok having a new repo in k-sigs for scheduler-plugins integration, but no other schedulers. Although maybe a submodule in the scheduler-plugins repo is enough to separate the dependencies.

@alculquicondor
Copy link
Collaborator

But to be fair, I'm just one owner of mpi-operator. Others probably need to chime-in.

@terrytangyuan @johnugeorge

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 8, 2023

I'd be ok having a new repo in k-sigs for scheduler-plugins integration, but no other schedulers. Although maybe a submodule in the scheduler-plugins repo is enough to separate the dependencies.

Sounds good.

cc: @kubeflow/wg-training-leads

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 8, 2023

I'd be ok having a new repo in k-sigs for scheduler-plugins integration, but no other schedulers. Although maybe a submodule in the scheduler-plugins repo is enough to separate the dependencies.

@Huang-Wei @denkensk What do you think about the above proposal? We would like to add other controllers, not part of the scheduler-plugins-controller, to the scheduler-plugins repo to watch CustomJobs like MPIJob.

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 8, 2023

@alculquicondor If my understanding is not correct, please correct it.

@Huang-Wei
Copy link

Creating a k-sigs project doesn't work for all supported schedulers, right? as some projects are not kubernetes subprojects.

Personally, if we all agree to reverse the dependency, I'd spend some time evaluating whether watching on custom resource is compelling (via k8s dynamic client), so that a scheduler offering doesn't need to explicitly pin mpi-operator's dependency in go.mod.

BTW: I'm not a fan of using admission controller here as that brings further complexity - admission controller should be treated as part of API server stack and should be tested thoroughly to meet certain SLO standards, however in reality it's not (always) the case. I won't be surprised if an ill-maintained admission controller would timeout intermittently or return slowly or stuck in certificate issues, and those would impact the overall SLO for all workloads.

@alculquicondor
Copy link
Collaborator

Creating a k-sigs project doesn't work for all supported schedulers, right?

It doesn't and that's the point. Each scheduler should deal with its own problems as long as we don't have a common interface (job scheduling subresource?). My hope is that this adds pressure to make progress on such thing.

via k8s dynamic client

Exactly, we should be thinking about these potential solutions.

@denkensk
Copy link
Member

denkensk commented Mar 9, 2023

I don't think reversing dependencies is a reasonable way.

As a scheduler, no matter scheduler-plugins or volcano, they holds a relatively low-level position in the entire architecture, providing basic primitives like podGroup, Quota for upper-level business integrations, such as different job operators, rather than integrating all different jobs api(MPIJob, TFJob, PytorchJob, VolcanoJob, SparkJob, PaddleJob, RayJob...) , which is unreasonable in the architecture.

The basic primitives will not undergo significant changes in the long term, but the API definitions for jobs will continue to evolve.

@alculquicondor
Copy link
Collaborator

which is unreasonable in the architecture.

Why is it reasonable to add the complexity in Kubeflow instead?

I don't think either place is good. The ideal solution is a standard "Job", "PodGroup" or a jobQueueing subresource.

But in the meantime, a separate controller makes a lot of sense. It doesn't compromise the maintainability of either of the repositories that are not supposed to know about each another.

@alculquicondor
Copy link
Collaborator

alculquicondor commented Mar 9, 2023

But maybe we don't need a separate repository for now. One good thing about all kubeflow jobs is that they all have the same fields relevant for scheduling in the same jsonpath. So you just need to duplicate that part of the API.

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 9, 2023

But maybe we don't need a separate repository for now. One good thing about all kubeflow jobs is that they all have the same fields relevant for scheduling in the same jsonpath. So you just need to duplicate that part of the API.

@alculquicondor Does that mean we add the logic for the PodGroup to the mpi-operator, similar to the training-operator?

@alculquicondor
Copy link
Collaborator

No, I'm saying that scheduler-plugins and volcano can use the fields without having to import all of kubeflow's libraries.

@alculquicondor
Copy link
Collaborator

TBH, I think my PoV doesn't have enough support. But I wanted to share my concerns and how I think going the easy path is just going to prevent us doing progress for the right solution in the future.

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 9, 2023

TBH, I think my PoV doesn't have enough support. But I wanted to share my concerns and how I think going the easy path is just going to prevent us doing progress for the right solution in the future.

I think this discussion is a good starting point for integrating CustomJobs with other scheduler.

However, it might be better to proceed with this discussion upstream (k/k), not downstream (mpi-operator).
This means, first of all, it might be better to start discussions at kubernetes/kubernetes#107294 and kubernetes/enhancements#3370. And then, after we add those features to k/k, we can come back downstream.

So, I would like to support the scheduler-plugins for the mpi-operator as originally planned.

WDYT?

@Syulin7
Copy link

Syulin7 commented Mar 10, 2023

In the long run, I think it's necessary to provide a Native PodGroup API that includes fields such as suspend, queue, minNumOfWorkers, etc. Different schedulers can implement the same Native PodGroup API.

For example, the operator only needs to add a specific label to the pod, and the scheduler creates and maintains the PodGroup. This way, the operator and scheduler won't depend on each other. I think we can work towards this direction in the future.

This will take a long time, but currently, it is necessary to support different schedulers in operators, which can actually help users solve problems. I think this is very important.

@tenzen-y
Copy link
Member Author

This will take a long time, but currently, it is necessary to support different schedulers in operators, which can actually help users solve problems. I think this is very important.

The issues in this discussion are short-term solutions.

@tenzen-y
Copy link
Member Author

Let me summarize the topics for others.

We have similar opinions in the long term. This means we would like to add more features for the batch workloads (e.g., the PodGroup, subresource...) to k/k.

However, we have other opinions in the short term, like the following:

  1. Add a separate controller to the scheduler-plugins repo, volcano repo, and so on for the CustomJobs (e.g., MPIJob, SparkJob...).
  2. Add the logic for doing CRUD for the PodGroup to downstream controllers (e.g., mpi-operator, spark-operator).

@tenzen-y
Copy link
Member Author

TBH, I think my PoV doesn't have enough support. But I wanted to share my concerns and how I think going the easy path is just going to prevent us doing progress for the right solution in the future.

I think this discussion is a good starting point for integrating CustomJobs with other scheduler.

However, it might be better to proceed with this discussion upstream (k/k), not downstream (mpi-operator). This means, first of all, it might be better to start discussions at kubernetes/kubernetes#107294 and kubernetes/enhancements#3370. And then, after we add those features to k/k, we can come back downstream.

So, I would like to support the scheduler-plugins for the mpi-operator as originally planned.

WDYT?

@alculquicondor I think getting consent to create a separate controller in the scheduler-plugins repo from @denkensk @Huang-Wei is hard.

In the mpi-operator repo, I would like to add support for the scheduler-plugins similar to the training-operator.

@alculquicondor
Copy link
Collaborator

I said all these things not because it was going to be easy, but because I believe it's the right direction :)

But, given that training-plugins already has support for scheduling-plugins and there's some chance that the coscheduling plugin will be upstream in the not so far future, I'm ok with the proposal.

@tenzen-y
Copy link
Member Author

I believe it's the right direction :)

Yes, I agree.
So I'd like to take over the KEP for the native PodGroup API if the current author of its KEP doesn't have enough bandwidth after we start the release cycle for the k/k v1.28 :)

@alculquicondor
Copy link
Collaborator

Do you have a PR for E2E tests? I'm hoping to release kueue, and I would prefer we don't have to use an unreleased version of mpi-operator

@tenzen-y
Copy link
Member Author

I would prefer we don't have to use an unreleased version of mpi-operator

I agree. I will create a PR today.

@tenzen-y
Copy link
Member Author

tenzen-y commented Apr 4, 2023

/close as completed.

@tenzen-y
Copy link
Member Author

tenzen-y commented Apr 4, 2023

/close

@google-oss-prow
Copy link

@tenzen-y: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants