-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support coscheduling plugin #500
Comments
I support this feature enhancement. Meanwhile, will mpi-operator keep support for Volcano? |
Yes, I was thinking of adding support for the coscheduling plugin, not replacing volcano. |
I was trying to support this in the following pull requests: But there are still some version conflict between the coscheduling plugin and training operator when the last time I try to push forward this feature. I will close these two when new pull requests are submitted. |
@zw0610 I created an issue to support the coscheduling plugin for training-operator at kubeflow/training-operator#1722 since I missed those PRs. As I can see mpi-operator v2, I think we can support the scheduler plugin by simply implementing it in this repository since mpi-operator v2 doesn't depend on kubeflow/common. |
Note that the coscheduling plugin is still not part of a standard kubernetes installation. The enhancement is tracked here, but it's currently stalled kubernetes/enhancements#3370 I'm not sure if the design would change much. Perhaps @denkensk and @Huang-Wei could provide some updates.
Can you expand on this? |
@alculquicondor Thanks for letting me know about PodGroup enhancement. It would be helpful for batch workloads. First, we can introduce the current coscheduling plugin and then adapt new coscheduling specifications once kubernetes/enhancements#3370 is released.
First, I plan to take over kubeflow/common#185 and kubeflow/training-operator#1526. Then I will work on introducing support for the coscheduling plugin in this repository based on kubeflow/common, not importing kubeflow/common to this repository, once those PRs completed. |
I mentioned this topic in today's SIG Scheduling meeting. The conclusion is that the PodGroup KEP needs some work but current contributors don't have enough bandwidth. @Huang-Wei might add some updates. I would say we are a little late for the 1.27 release timeline, but it would be nice to make some progress. If you are interested in contributing or taking over the design, you are welcomed to. |
@alculquicondor Thanks for sharing the progress for PoDGroup KEP. I'm not familiar with kube-scheduler internal implementations and cluster-autoscaler specifications. So I will dive into kube-scheduler and cluster-autoscaler implementations. Once I understand those components better, I will let you know and assign me PodGroup KEP. Although I'm ok with someone else moving forward with PodGroup KEP. BTW, I would like to confirm closing. As I can see about the 1.27 release schedule, we must complete PodGroup KEP until 10th February 2023, right? |
That is correct, so I don't think we have enough time for this to go through for 1.27. But 1.28 is certainly possible. |
Thanks. I'm ok with either 1.27 or 1.28. |
Thanks @tenzen-y I will continue the work on podgroup. If it can be the in native api. We can upgrade to the api version to support. |
Yes, I plan to work on that PR. We can track the progress of the training-operator with kubeflow/training-operator#1722. For the mpi-operator, I will work on this after implementations for the training-operator are completed.
Thanks for your effort. Yes, that's right. We can migrate to a native PodGroup API once the feature is graduated to beta or stable. |
Once #502 is completed, I'm going to work on this. /assign |
Now, training-operator supports coscheduling plugin 🎉 Next, I'll work on mpi-operator! |
Thank you for your work on this! |
I will implement the logic to support scheduler-plugins using the master branch since the scheduler-plugins switched the API group to ref:
NOTE: The new scheduler-plugins will be released this month, which include the above changes. |
Devil's advocate here 😅 Could the scheduler-plugins (and volcano for that matter) use a webhook to create the PodGroup objects based on the .schedulingPolicy field? |
OMG 😱
Does that mean the webhook converts the API group like conversion webhooks? |
No, the webhook (or a reconciler) would create the volcano or scheduler-plugins PodGroup object based on the .schedulingPolicy My point is that we don't need to embed this code in kubeflow. |
webhook may be not a good choice.
And also Volcano is already integrated with Kubeflow in the code, so we can easily reuse the original logic to talk about scheduling-plugins integration together. |
Thanks for your work on this. 👍🏻 |
A controller then, but in volcano/scheduling-plugins. I'm just trying to put a guard against adding support for more and more "schedulers", given that there are alternative ways for them to support kubeflow that are more sustainable and reduce maintenance burden in kubeflow. |
I think what you say is reasonable, but it may require explicit design and a longer time. It's also not just Kubeflow's training-operator. Probably a topic of how Operator and Scheduling are integrated. On the other side of the road, Volcano is working on code integration with many Operators, such as Spark-Operator and Spark itself. We can reuse the original logic used by Volcano in training-operator to integrate with scheduler-plugins together firstly. |
I see. I will add a similar implementation to the training operator: This means mpijob-controller creates PodGroup using a reconcile. |
That's kind of my point. There will be a dependency problem somewhere. So we need to think of a good long term solution. I fear that if we blindly accept new dependencies in kubeflow, nobody will think about this long term solution and always fall into the easy path of "add it to kubeflow". My alternate intermediate proposal is that there are integration controller(s) in a standalone repo. On kubeflow's defense, the RunPolicy field is common among all controllers in the same place, so it can be used by duplicating just that part of the library, instead of embedding a dependency.
You could introduce a webhook that sets the suspend field to false until the PodGroup object exists. |
@alculquicondor Does that mean we create a repo under k-sigs org? Then do we add controllers to watch CustomJobs (e.g., MPIJob, VolcanoJob, RayJob) and to create PodGroups for custom schedulers (e.g., scheduler-plugins, and volcano-scheduler)?
IIRC, @Huang-Wei wants to avoid introducing webhooks to scheduler-plugins as much as possible. |
I'd be ok having a new repo in k-sigs for scheduler-plugins integration, but no other schedulers. Although maybe a submodule in the scheduler-plugins repo is enough to separate the dependencies. |
But to be fair, I'm just one owner of mpi-operator. Others probably need to chime-in. |
Sounds good. cc: @kubeflow/wg-training-leads |
@Huang-Wei @denkensk What do you think about the above proposal? We would like to add other controllers, not part of the scheduler-plugins-controller, to the scheduler-plugins repo to watch CustomJobs like MPIJob. |
@alculquicondor If my understanding is not correct, please correct it. |
Creating a k-sigs project doesn't work for all supported schedulers, right? as some projects are not kubernetes subprojects. Personally, if we all agree to reverse the dependency, I'd spend some time evaluating whether watching on custom resource is compelling (via k8s dynamic client), so that a scheduler offering doesn't need to explicitly pin mpi-operator's dependency in go.mod. BTW: I'm not a fan of using admission controller here as that brings further complexity - admission controller should be treated as part of API server stack and should be tested thoroughly to meet certain SLO standards, however in reality it's not (always) the case. I won't be surprised if an ill-maintained admission controller would timeout intermittently or return slowly or stuck in certificate issues, and those would impact the overall SLO for all workloads. |
It doesn't and that's the point. Each scheduler should deal with its own problems as long as we don't have a common interface (job scheduling subresource?). My hope is that this adds pressure to make progress on such thing.
Exactly, we should be thinking about these potential solutions. |
I don't think reversing dependencies is a reasonable way. As a scheduler, no matter scheduler-plugins or volcano, they holds a relatively low-level position in the entire architecture, providing basic primitives like podGroup, Quota for upper-level business integrations, such as different job operators, rather than integrating all different jobs api(MPIJob, TFJob, PytorchJob, VolcanoJob, SparkJob, PaddleJob, RayJob...) , which is unreasonable in the architecture. The basic primitives will not undergo significant changes in the long term, but the API definitions for jobs will continue to evolve. |
Why is it reasonable to add the complexity in Kubeflow instead? I don't think either place is good. The ideal solution is a standard "Job", "PodGroup" or a jobQueueing subresource. But in the meantime, a separate controller makes a lot of sense. It doesn't compromise the maintainability of either of the repositories that are not supposed to know about each another. |
But maybe we don't need a separate repository for now. One good thing about all kubeflow jobs is that they all have the same fields relevant for scheduling in the same jsonpath. So you just need to duplicate that part of the API. |
@alculquicondor Does that mean we add the logic for the PodGroup to the mpi-operator, similar to the training-operator? |
No, I'm saying that scheduler-plugins and volcano can use the fields without having to import all of kubeflow's libraries. |
TBH, I think my PoV doesn't have enough support. But I wanted to share my concerns and how I think going the easy path is just going to prevent us doing progress for the right solution in the future. |
I think this discussion is a good starting point for integrating CustomJobs with other scheduler. However, it might be better to proceed with this discussion upstream (k/k), not downstream (mpi-operator). So, I would like to support the scheduler-plugins for the mpi-operator as originally planned. WDYT? |
In the long run, I think it's necessary to provide a Native PodGroup API that includes fields such as suspend, queue, minNumOfWorkers, etc. Different schedulers can implement the same Native PodGroup API. For example, the operator only needs to add a specific label to the pod, and the scheduler creates and maintains the PodGroup. This way, the operator and scheduler won't depend on each other. I think we can work towards this direction in the future. This will take a long time, but currently, it is necessary to support different schedulers in operators, which can actually help users solve problems. I think this is very important. |
The issues in this discussion are short-term solutions. |
Let me summarize the topics for others. We have similar opinions in the long term. This means we would like to add more features for the batch workloads (e.g., the PodGroup, subresource...) to k/k. However, we have other opinions in the short term, like the following:
|
@alculquicondor I think getting consent to create a separate controller in the scheduler-plugins repo from @denkensk @Huang-Wei is hard. In the mpi-operator repo, I would like to add support for the scheduler-plugins similar to the training-operator. |
I said all these things not because it was going to be easy, but because I believe it's the right direction :) But, given that training-plugins already has support for scheduling-plugins and there's some chance that the coscheduling plugin will be upstream in the not so far future, I'm ok with the proposal. |
Yes, I agree. |
Do you have a PR for E2E tests? I'm hoping to release kueue, and I would prefer we don't have to use an unreleased version of mpi-operator |
I agree. I will create a PR today. |
/close as completed. |
/close |
@tenzen-y: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind feature
MPI Operator now supports the all-or-nothing semantic, queuing logic, and more features for batch workload by the Volcano.
Although, I think the maintenance cost for Volcano is a bit high for users who want to use only all-or-nothing semantic.
So I would like to support that semantic by coscheduling plugin.
Supporting the coscheduling plugin, users could use that semantic without additional components.
@alculquicondor @gaocegege @terrytangyuan @zw0610 WDYT?
Tasks:
The text was updated successfully, but these errors were encountered: