Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support koordinator gang scheduler plugin #1746

Closed
Tracked by #1809
Syulin7 opened this issue Jan 29, 2023 · 11 comments · Fixed by #1747
Closed
Tracked by #1809

Support koordinator gang scheduler plugin #1746

Syulin7 opened this issue Jan 29, 2023 · 11 comments · Fixed by #1747

Comments

@Syulin7
Copy link
Contributor

Syulin7 commented Jan 29, 2023

/kind feature

Training Operator now supports many gang schedulers(volcano, scheduler-plugins), and now we can easily add koordinator gang scheduler.

Koordinator gang scheduler use PodGroup defined in scheduler-plugins/coscheduling, and will be compatible with the PodGroup.

About Koordinator: https://github.com/koordinator-sh/koordinator

Koordinator gang scheduler: https://koordinator.sh/docs/user-manuals/gang-scheduling

So I would like to support koordinator gang scheduler plugin.

@tenzen-y @johnugeorge WDYT?

@tenzen-y
Copy link
Member

@Syulin7 Thanks for creating this issue.

Generally, It sounds good to me.

OTOH, I have some questions.
As I can rapidly see the docs, the Koordinator seems to use PodGroup of the coscheduling plugin.

  1. What is the difference between the coscheduling plugin and the Koordinator gang-scheduling?
  2. Does the Koordinator have any custom resources for the gang-scheduling?
  3. Does the training-operator need to operate (CRUD) custom resources only for Koordinator to manage gang-scheduling?

@Syulin7
Copy link
Contributor Author

Syulin7 commented Jan 30, 2023

@tenzen-y Thanks for the review.

  1. The podgroup API and design of coscheduling plugin and the Koordinator is the same, the only difference is the scheduler name now. The Koordinator scheduler name is koord-scheduler. Some users use Koordinator as second scheduler and can directly set the scheduler name to "koord-scheduler" to use the gang scheduler capability.
  2. No, Koordinator doesn't introduce any new custom resources for gang scheduling.
  3. No, Koordinator does not have custom resources for the gang-scheduling and use PodGroup defined in scheduler-plugins/coscheduling.

@tenzen-y
Copy link
Member

@Syulin7 I appreciate your clear explanation.

I designed the podgroup control for the scheduler-plugins so that we could set a secondary scheduler name in the kubeflow/common repo.

So you can use koordinator integration by setting koord-scheduler to each PodSpec like the https://www.kubeflow.org/docs/components/training/job-scheduling/#scheduler-plugins-with-coscheduling.

Also, as I can see from kubeflow/common#209, the whole of the codes are almost similar to the codes for the coscheduling plugin.

Hence I would suggest we only modify training-operator.v1/main.go in the kubeflow/training-operator repo in the following since we shouldn't hold the same codes in multiple places to avoid paying high maintenance costs:

Before:

} else if strings.EqualFold(gangSchedulerName, string(common.GangSchedulerSchedulerPlugins)) {
gangSchedulingSetupFunc = common.GenSchedulerPluginsSetupFunc(mgr.GetClient())
}

After:

 } else if strings.EqualFold(gangSchedulerName, string(common.GangSchedulerSchedulerPlugins)) || 
    strings.EqualFold(gangSchedulerName, string(common.GangSchedulerKoordScheduler)) { 
 	gangSchedulingSetupFunc = common.GenSchedulerPluginsSetupFunc(mgr.GetClient()) 
 } 

@Syulin7
Copy link
Contributor Author

Syulin7 commented Jan 30, 2023

@tenzen-y Thanks for your suggestion.

My viewpoint is that there is no need to specify the scheduler name(koord-scheduler) in CustomJob resources (similar to the implementation of Volcano and make it more user-friendly), which results in the whole of the codes are almost similar to the codes for the coscheduling plugin.

the only difference is:

func (s *KoordinatorControl) DecoratePodTemplateSpec(pts *corev1.PodTemplateSpec, job metav1.Object, _ string) {
	if len(pts.Spec.SchedulerName) == 0 {
		pts.Spec.SchedulerName = s.GetSchedulerName()
	}
	...
}

Perhaps in the future, when we need to support the Koordinator new Gang feature, the implementation of the koordinator podgroup control will be separated out, and there will not be many duplicated codes.

And now I plan to modify it according to your suggestion. WDYT

@tenzen-y
Copy link
Member

My viewpoint is that there is no need to specify the scheduler name(koord-scheduler) in CustomJob resources (similar to the implementation of Volcano and make it more user-friendly), which results in the whole of the codes are almost similar to the codes for the coscheduling plugin.

@Syulin7 I agree with that. However, I would like to avoid duplicated codes since that makes the training-operator chaotic.

So, I would suggest using the embedded struct and override the method to reduce duplicated codes like my review comments in kubeflow/common#209.

@Syulin7
Copy link
Contributor Author

Syulin7 commented Jan 30, 2023

@Syulin7 I agree with that. However, I would like to avoid duplicated codes since that makes the training-operator chaotic.

So, I would suggest using the embedded struct and override the method to reduce duplicated codes like my review comments in kubeflow/common#209.

Agree, I will modify it according to your suggestion.
Thank you very much for your suggestion. @tenzen-y

@Syulin7
Copy link
Contributor Author

Syulin7 commented Jan 31, 2023

@tenzen-y I have updated the PR, PTAL.

kubeflow/common#209
#1747

By the way, the Slack channel in the Readme is no longer available, how can I join the Slack channel?

@tenzen-y
Copy link
Member

@tenzen-y I have updated the PR, PTAL.

kubeflow/common#209 #1747

By the way, the Slack channel in the Readme is no longer available, how can I join the Slack channel?

@Syulin7
Oh...
Maybe, you can join the Slack workspace using https://github.com/kubeflow/community/tree/master/slack.

@lowang-bh
Copy link
Member

so we didn't support koordinator gang-scheduler now?

@tenzen-y
Copy link
Member

so we didn't support koordinator gang-scheduler now?

@lowang-bh No, we support koordinator gang-scheduler.

@tenzen-y
Copy link
Member

The koordinator gang-scheduler uses kubernetes-sigs/scheduler-plugins PodGroup. That scheduler doesn't have a custom PodGroup resource.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants