Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make scheduler-plugins the default gang scheduler. #1747

Merged
merged 1 commit into from
May 23, 2023

Conversation

Syulin7
Copy link
Contributor

@Syulin7 Syulin7 commented Jan 29, 2023

What this PR does / why we need it:
Training Operator now supports many gang schedulers(volcano, scheduler-plugins), and now we can easily add koordinator gang scheduler.

Related: #1746

Reviewers can check the koordinator gang schedule feature with the koordinator in the following steps:

# install koordinator
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/
$ helm repo update
$ helm install koordinator koordinator-sh/koordinator --version 1.1.0
# pull this pr and run training-operator with --gang-scheduler-name=koord-scheduler
$ cat <<EOF | kubectl apply -f -
apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: tfjob-simple
  namespace: kubeflow
spec:
  runPolicy:
    schedulingPolicy:
      minAvailable: 10
      scheduleTimeoutSeconds: 120
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          schedulerName: koord-scheduler
          containers:
          - name: tensorflow
            image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
            command:
            - "python"
            - "/var/tf_mnist/mnist_with_summaries.py"
EOF
$ kubectl get pods -n kubeflow
NAME                    READY   STATUS    RESTARTS   AGE
tfjob-simple-worker-0   0/1     Pending   0          82s
tfjob-simple-worker-1   0/1     Pending   0          82s
$ kubectl describe pods -n kubeflow tfjob-simple-worker-0 | grep -A10  Events
Events:
  Type     Reason            Age   From             Message
  ----     ------            ----  ----             -------
  Warning  FailedScheduling  15s   koord-scheduler  running PreFilter plugin "Coscheduling": gang child pod not collect enough, gangName: kubeflow/tfjob-simple, podName: kubeflow/tfjob-simple-worker-0

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #1746

Checklist:

  • Docs included if any changes are user facing

@Syulin7
Copy link
Contributor Author

Syulin7 commented Jan 29, 2023

@tenzen-y @johnugeorge PTAL, Related: kubeflow/common#209

@tenzen-y
Copy link
Member

/hold

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -73,7 +73,8 @@ func main() {
flag.StringVar(&leaderElectionID, "leader-election-id", "1ca428e5.training-operator.kubeflow.org", "The ID for leader election.")
flag.Var(&enabledSchemes, "enable-scheme", "Enable scheme(s) as --enable-scheme=tfjob --enable-scheme=pytorchjob, case insensitive."+
" Now supporting TFJob, PyTorchJob, MXNetJob, XGBoostJob, PaddleJob. By default, all supported schemes will be enabled.")
flag.StringVar(&gangSchedulerName, "gang-scheduler-name", "none", "The scheduler to gang-schedule kubeflow jobs, defaults to none")
flag.StringVar(&gangSchedulerName, "gang-scheduler-name", "none", "The scheduler to gang-schedule kubeflow jobs, defaults to none."+
" Now supporting node, volcano, scheduler-plugins, koord-scheduler.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
" Now supporting node, volcano, scheduler-plugins, koord-scheduler.")
" Now supporting none, volcano, scheduler-plugins, koord-scheduler.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@coveralls
Copy link

coveralls commented Mar 13, 2023

Pull Request Test Coverage Report for Build 4412816813

  • 0 of 1 (0.0%) changed or added relevant line in 1 file are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.2%) to 39.555%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/apis/kubeflow.org/v1/openapi_generated.go 0 1 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 2 77.16%
Totals Coverage Status
Change from base Build 4405001546: 0.2%
Covered Lines: 2740
Relevant Lines: 6927

💛 - Coveralls

@johnugeorge
Copy link
Member

@tenzen-y Are there any blockers here?

@tenzen-y
Copy link
Member

@johnugeorge We need to cut a release on the common repository.

kubeflow/common#209 (comment)

@johnugeorge
Copy link
Member

@tenzen-y
Copy link
Member

@johnugeorge Thank you!

@Syulin7 Can you update this PR with a new common library version?

@Syulin7
Copy link
Contributor Author

Syulin7 commented May 18, 2023

/assign
Thanks, I will update it this week.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Syulin7 Thank you for the updates!
Mostly, LGTM. I left a comment for a nit.

@@ -73,7 +73,8 @@ func main() {
flag.StringVar(&leaderElectionID, "leader-election-id", "1ca428e5.training-operator.kubeflow.org", "The ID for leader election.")
flag.Var(&enabledSchemes, "enable-scheme", "Enable scheme(s) as --enable-scheme=tfjob --enable-scheme=pytorchjob, case insensitive."+
" Now supporting TFJob, PyTorchJob, MXNetJob, XGBoostJob, PaddleJob. By default, all supported schemes will be enabled.")
flag.StringVar(&gangSchedulerName, "gang-scheduler-name", "none", "The scheduler to gang-schedule kubeflow jobs, defaults to none")
flag.StringVar(&gangSchedulerName, "gang-scheduler-name", "", "The scheduler to gang-schedule kubeflow jobs."+
" Now supporting volcano, default-scheduler, scheduler-plugins, koord-scheduler.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
" Now supporting volcano, default-scheduler, scheduler-plugins, koord-scheduler.")
" Now Supporting volcano and scheduler-plugins. Note: If you set another scheduler name, the training-operator assumes it's the scheduler-plugins.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Syulin7 <735122171@qq.com>
@Syulin7 Syulin7 changed the title Support Koordinator Gang Scheduler Plugin Make scheduler-plugins the default gang scheduler. May 20, 2023
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Syulin7 Great! Thank you for the awesome contribution!

/lgtm
/assign @johnugeorge

@tenzen-y
Copy link
Member

/hold cancel

@johnugeorge
Copy link
Member

Thanks @Syulin7

Need to update docs( https://www.kubeflow.org/docs/components/training/job-scheduling/#running-jobs-with-gang-scheduling) regarding this.

/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, Syulin7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 6d6d83d into kubeflow:master May 23, 2023
google-oss-prow bot pushed a commit that referenced this pull request May 31, 2023
* Support for k8s v1.25 in CI

* Support for k8s v1.25 in CI

* Change k8s api to v1.25

* Upgrade golangci-lint version

* Add changelog

* Update CHANGELOG.md

* Update Changelog

* Merge common repo

* Avoid to depend on local env when installing the code-generators (#1810)

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Make scheduler-plugins the default gang scheduler. (#1747)

Signed-off-by: Syulin7 <735122171@qq.com>

* Fix tests

* Fix merge conflicts

* Fix CI issues

* Fix CI issues

* Fix review comments

* Add contributors in Readme file

* Fix review comments

* Fix review comments

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Syulin7 <735122171@qq.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: yu lin <37265556+Syulin7@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support koordinator gang scheduler plugin
4 participants