Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support coscheduling plugin #1722

Closed
tenzen-y opened this issue Jan 12, 2023 · 10 comments · Fixed by #1724
Closed

Support coscheduling plugin #1722

tenzen-y opened this issue Jan 12, 2023 · 10 comments · Fixed by #1724
Assignees

Comments

@tenzen-y
Copy link
Member

/kind feature

Training Operator now supports the all-or-nothing semantic, queuing logic, and more features for batch workload by the Volcano.
Although, I think the maintenance cost for Volcano is a bit high for users who want to use only all-or-nothing semantic.

So I would like to support that semantic by coscheduling plugin.

Supporting the coscheduling plugin, users could use that semantic without additional components.

@kubeflow/wg-training-leads WDYT?

@johnugeorge
Copy link
Member

Yes. This is great. We need to remove direct dependency of volcano in code and should be able to configure the scheduler. Currently, scheduler is taken as cmd argument but however, code has hard dependency on volcano.

Related:
#1683 (comment)
#1688 (comment)

@tenzen-y
Copy link
Member Author

@johnugeorge Does that mean we stop supporting Volcano?
I was thinking of just adding coscheduling plugin support, not replacing Volcano.

But I'm ok with either about removing Volcano support.

@johnugeorge
Copy link
Member

No. I meant, it to be dynamic. Like decoupling the main code and volcano implementation. See https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/pytorch/pytorchjob_controller.go#L88

In Katib, trial resources can be added via cmd. https://github.com/kubeflow/katib/blob/master/cmd/katib-controller/v1beta1/main.go#L62 I was thinking if we can achieve something like this

@tenzen-y
Copy link
Member Author

tenzen-y commented Jan 12, 2023

No. I meant, it to be dynamic. Like decoupling the main code and volcano implementation. See https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/pytorch/pytorchjob_controller.go#L88

Ah, I see. That makes sense.

In Katib, trial resources can be added via cmd. https://github.com/kubeflow/katib/blob/master/cmd/katib-controller/v1beta1/main.go#L62 I was thinking if we can achieve something like this

Maybe, the solution work fine.

@tenzen-y
Copy link
Member Author

Similar to #1518

@tenzen-y
Copy link
Member Author

As mentioned by @zw0610 in kubeflow/mpi-operator#500 (comment), I will work on kubeflow/common#185 and #1526.

@johnugeorge
Copy link
Member

@tenzen-y Do you want to wait till #1714 (comment) is done?

@tenzen-y
Copy link
Member Author

@johnugeorge If a community agrees with containing this feature in the next training-operator release (v1.6.0), I would like to work on this as soon as possible.

WDYT?

@johnugeorge
Copy link
Member

It would be great if you can make it in this release

@tenzen-y
Copy link
Member Author

It would be great if you can make it in this release

Sounds good. ASAP, I'm going to work on this.

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants