Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider the way to inform users that the specified gang scheduler isn't supported #1811

Closed
tenzen-y opened this issue May 23, 2023 · 15 comments · Fixed by #1929
Closed

Consider the way to inform users that the specified gang scheduler isn't supported #1811

tenzen-y opened this issue May 23, 2023 · 15 comments · Fixed by #1929
Assignees

Comments

@tenzen-y
Copy link
Member

tenzen-y commented May 23, 2023

Since #1747, we have treated the scheduler-plugins as a default gang-scheduler, and we use the scheduler-plugins if users set anything other than volcano to https://github.com/kubeflow/training-operator/blob/6d6d83dfe7d8c3f4f3938c00c04d8b7410caa205/cmd/training-operator.v1/main.go#LL76C1-L77C105.

So, if the specified gang scheduler isn't supported in the training operator, informing that would be better.

/cc @johnugeorge

@tenzen-y
Copy link
Member Author

/remove-kind discussion
/kind feature

@tenzen-y
Copy link
Member Author

cc: @Syulin7

@Syulin7
Copy link
Contributor

Syulin7 commented May 24, 2023

Do we need to check if the podgroup CRD exists? Users should ensure that their scheduler can support Volcano podgroup or scheduler-plugins podgroup.

@tenzen-y
Copy link
Member Author

Do we need to check if the podgroup CRD exists? Users should ensure that their scheduler can support Volcano podgroup or scheduler-plugins podgroup.

It sounds good to me. If users set --gang-scheduler-name=volcano, the training operator checks whether there is the podgroup CRD in the cluster; if users set another value to --gang-scheduler-name, the training operator checks the scheduler-plugins CRD in the cluster.

@johnugeorge wdyt?

@johnugeorge
Copy link
Member

The key question is, Are we planning to support only volcano and scheduler plugins in the future? If we are,(due to overhead in management), it sounds good to me. Else, it might be better to make CRD check to be dynamic(like in https://github.com/kubeflow/katib/blob/master/cmd/katib-controller/v1beta1/main.go#L62)

@tenzen-y
Copy link
Member Author

Else, it might be better to make CRD check to be dynamic(like in https://github.com/kubeflow/katib/blob/master/cmd/katib-controller/v1beta1/main.go#L62)

Makes sense.
We probably need to support native PodGroup CRD (kube-controller-manager) once the feature graduates to beta.

@tenzen-y
Copy link
Member Author

@Syulin7 Any thoughts?

@Syulin7
Copy link
Contributor

Syulin7 commented May 25, 2023

Else, it might be better to make CRD check to be dynamic(like in https://github.com/kubeflow/katib/blob/master/cmd/katib-controller/v1beta1/main.go#L62)

Agree, users decide whether or not to check the podgroup CRD themselves?

@tenzen-y
Copy link
Member Author

Else, it might be better to make CRD check to be dynamic(like in https://github.com/kubeflow/katib/blob/master/cmd/katib-controller/v1beta1/main.go#L62)

Agree, users decide whether or not to check the podgroup CRD themselves?

Probably, we can have podgroups for the volcano and the scheudler-plugins as a default.

@tenzen-y
Copy link
Member Author

@Syulin7 Do you have enough time to work on this?

@Syulin7
Copy link
Contributor

Syulin7 commented May 27, 2023

/assign

@tenzen-y
Copy link
Member Author

tenzen-y commented Jul 5, 2023

@Syulin7 Do you have any progress?

@Syulin7
Copy link
Contributor

Syulin7 commented Sep 5, 2023

@Syulin7 Do you have any progress?

@tenzen-y Sorry for late reply(Forgot to check the message.). there is no progress at the moment. I will work on this.

@tenzen-y
Copy link
Member Author

tenzen-y commented Sep 5, 2023

@Syulin7 Do you have any progress?

@tenzen-y Sorry for late reply(Forgot to check the message.). there is no progress at the moment. I will work on this.

Thanks for the response. Feel free to send a ping to me once the PR is ready.

@johnugeorge
Copy link
Member

@Syulin7 Do you have any update?
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants