Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopting coschduling plugin #1724

Merged
merged 5 commits into from
Jan 21, 2023
Merged

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Jan 14, 2023

What this PR does / why we need it:
Adopts updated kubeflow/common with PodGroupControl designed for more generic gang-schedulers.

Follow up on #1526.

Reviewers can check the coscheduling feature with the scheduler-plugins in the following steps:

$ kind create cluster --image kindest/node:v1.24.7
$ git clone git@github.com:kubernetes-sigs/scheduler-plugins.git
$ cd scheduler-plugins/manifests/install/charts
$ helm install scheduler-plugins as-a-second-scheduler/
$ kubectl apply -k "github.com/tenzen-y/training-operator/manifests/overlays/standalone?ref=coscheduling"
$ kubectl patch -n kubeflow deployments training-operator -p '{"spec":{"template":{"spec":{"containers":[{"command":["/manager","--gang-scheduler-name=scheduler-plugins"],"image":"ytenzen/training-operator:test","name":"training-operator", "imagePullPolicy": "Always"}]}}}}'
$ cat <<EOF | kubectl apply -f -
apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: tfjob-simple
  namespace: kubeflow
spec:
  runPolicy:
    schedulingPolicy:
      minAvailable: 10
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          schedulerName: scheduler-plugins-scheduler
          containers:
          - name: tensorflow
            image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
            command:
            - "python"
            - "/var/tf_mnist/mnist_with_summaries.py"
EOF
$ kubectl get pods -n kubeflow
NAME                                 READY   STATUS    RESTARTS   AGE
tfjob-simple-worker-0                0/1     Pending   0          7m37s
tfjob-simple-worker-1                0/1     Pending   0          7m37s
training-operator-67cc5b6584-t7qtl   1/1     Running   0          8m39s
$ kubectl describe pods -n kubeflow tfjob-simple-worker-0 | grep -A10  Events
Events:
  Type     Reason            Age    From                         Message
  ----     ------            ----   ----                         -------
  Warning  FailedScheduling  8m26s  scheduler-plugins-scheduler  0/1 nodes are available: 1 pre-filter pod tfjob-simple-worker-0 cannot find enough sibling pods, current pods number: 1, minMember of group: 10. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling., PodGroup kubeflow/tfjob-simple gets rejected due to Pod tfjob-simple-worker-0 is unschedulable even after PostFilter, not eligible due to failed to read from cycleState
  Warning  FailedScheduling  2m57s  scheduler-plugins-scheduler  0/1 nodes are available: 1 pre-filter pod tfjob-simple-worker-0 cannot find enough sibling pods, current pods number: 2, minMember of group: 10. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling., PodGroup kubeflow/tfjob-simple gets rejected due to Pod tfjob-simple-worker-0 is unschedulable even after PostFilter, not eligible due to failed to read from cycleState

Also, this PR includes HPA changes and upgrading PriorityClass apiVersion.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #1722 #1725

Checklist:

  • Docs included if any changes are user facing

@tenzen-y tenzen-y changed the title [WIP] Add scheduler plugin for coscheduling [WIP] Adopting coschduling plugins Jan 14, 2023
@coveralls
Copy link

Pull Request Test Coverage Report for Build 3920653750

  • 16 of 49 (32.65%) changed or added relevant lines in 7 files are covered.
  • 6 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.2%) to 39.383%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/mxnet/mxjob_controller.go 0 3 0.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go 0 3 0.0%
pkg/controller.v1/tensorflow/tfjob_controller.go 4 9 44.44%
pkg/controller.v1/mpi/mpijob_controller.go 6 16 37.5%
pkg/controller.v1/register_controller.go 0 12 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/register_controller.go 1 50.0%
pkg/controller.v1/mpi/mpijob_controller.go 2 77.47%
pkg/common/util/scheduler.go 3 0%
Totals Coverage Status
Change from base Build 3894456879: 0.2%
Covered Lines: 2669
Relevant Lines: 6777

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jan 14, 2023

Pull Request Test Coverage Report for Build 3975832997

  • 38 of 164 (23.17%) changed or added relevant lines in 8 files are covered.
  • 6 unchanged lines in 4 files lost coverage.
  • Overall coverage decreased (-0.09%) to 39.145%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go 9 21 42.86%
pkg/controller.v1/pytorch/pytorchjob_controller.go 9 21 42.86%
pkg/controller.v1/register_controller.go 0 12 0.0%
pkg/apis/kubeflow.org/v1/openapi_generated.go 0 15 0.0%
pkg/controller.v1/tensorflow/tfjob_controller.go 9 26 34.62%
pkg/controller.v1/mpi/mpijob_controller.go 11 29 37.93%
pkg/controller.v1/mxnet/mxjob_controller.go 0 20 0.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go 0 20 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mxnet/mxjob_controller.go 1 0%
pkg/controller.v1/register_controller.go 1 50.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go 1 0%
pkg/common/util/scheduler.go 3 0%
Totals Coverage Status
Change from base Build 3975791653: -0.09%
Covered Lines: 2692
Relevant Lines: 6877

💛 - Coveralls

@tenzen-y tenzen-y force-pushed the coscheduling branch 3 times, most recently from 4ae57d6 to f9fa8ac Compare January 15, 2023 19:24
@tenzen-y tenzen-y changed the title [WIP] Adopting coschduling plugins Adopting coschduling plugins Jan 15, 2023
@tenzen-y
Copy link
Member Author

/hold for kubeflow/common#203

@tenzen-y
Copy link
Member Author

This PR is ready for review. PTAL.
/assign @johnugeorge @zw0610 @terrytangyuan @gaocegege

go.mod Outdated Show resolved Hide resolved
@tenzen-y
Copy link
Member Author

@zw0610 @johnugeorge @terrytangyuan Please take a look. I would like to include PodGroup with the scheduler-plugin feature in this release.

@johnugeorge
Copy link
Member

@tenzen-y What is the best way to add e2e test for this?

Also, can you update docs in a separate PR? (In kubeflow/website?)

@tenzen-y
Copy link
Member Author

What is the best way to add e2e test for this?

Maybe, we could run e2e test by updating integration-tests.yaml like the following:

...
jobs:
  integration-test:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        kubernetes-version: ["v1.23.12", "v1.24.6", "v1.25.2"]
+       gang-scheduler-name: ["none", "scheduler-plugins"]
    steps:
      - name: Checkout
...
      - name: Deploy training operator
        run: |
          ./scripts/gha/setup-training-operator.sh
        env:
          KIND_CLUSTER: training-operator-cluster
          TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
+         GANG_SCHEDULER_NAME: ${{ matrix.gang-scheduler-name }}
...

Also, can you update docs in a separate PR? (In kubeflow/website?)

Yes, sure.

@johnugeorge

@johnugeorge
Copy link
Member

@tenzen-y You can use this release https://github.com/kubeflow/common/releases/tag/v0.4.6

Note: You need to take changes for HPA fixes from #1732 as the release contains HPA selector field change as well.

@tenzen-y
Copy link
Member Author

@tenzen-y You can use this release https://github.com/kubeflow/common/releases/tag/v0.4.6

Note: You need to take changes for HPA fixes from #1732 as the release contains HPA selector field change as well.

@johnugeorge Thanks for letting me know.
I will upgrade kubeflow/common version after #1732 is merged.

@tenzen-y
Copy link
Member Author

@johnugeorge Completed.

@johnugeorge
Copy link
Member

What is the best way to add e2e test for this?

Maybe, we could run e2e test by updating integration-tests.yaml like the following:

...
jobs:
  integration-test:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        kubernetes-version: ["v1.23.12", "v1.24.6", "v1.25.2"]
+       gang-scheduler-name: ["none", "scheduler-plugins"]
    steps:
      - name: Checkout
...
      - name: Deploy training operator
        run: |
          ./scripts/gha/setup-training-operator.sh
        env:
          KIND_CLUSTER: training-operator-cluster
          TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
+         GANG_SCHEDULER_NAME: ${{ matrix.gang-scheduler-name }}
...

Also, can you update docs in a separate PR? (In kubeflow/website?)

Yes, sure.

@johnugeorge

Do you plan to add CI tests in this PR or separate one ?

@tenzen-y
Copy link
Member Author

What is the best way to add e2e test for this?

Maybe, we could run e2e test by updating integration-tests.yaml like the following:

...
jobs:
  integration-test:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        kubernetes-version: ["v1.23.12", "v1.24.6", "v1.25.2"]
+       gang-scheduler-name: ["none", "scheduler-plugins"]
    steps:
      - name: Checkout
...
      - name: Deploy training operator
        run: |
          ./scripts/gha/setup-training-operator.sh
        env:
          KIND_CLUSTER: training-operator-cluster
          TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
+         GANG_SCHEDULER_NAME: ${{ matrix.gang-scheduler-name }}
...

Also, can you update docs in a separate PR? (In kubeflow/website?)

Yes, sure.

@johnugeorge

@johnugeorge I will work on E2E on another PR.

tenzen-y and others added 4 commits January 22, 2023 02:19
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Wang Zhang <zw199006@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
…1beta1 PodGroup

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y
Copy link
Member Author

@johnugeorge Rebased.

@johnugeorge
Copy link
Member

Great work @tenzen-y Thank you for adding this feature

/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Jan 21, 2023
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member Author

Great work @tenzen-y Thank you for adding this feature

/lgtm /approve

@johnugeorge
Thanks for the quick and detailed review!

I'll create a PR to add E2E test for scheduler-plugins by the end of Monday.

@tenzen-y
Copy link
Member Author

/hold cancel

@tenzen-y tenzen-y changed the title Adopting coschduling plugins Adopting coschduling plugin Jan 21, 2023
@google-oss-prow google-oss-prow bot merged commit 0126d96 into kubeflow:master Jan 21, 2023
@tenzen-y tenzen-y deleted the coscheduling branch January 21, 2023 18:44
@tenzen-y tenzen-y mentioned this pull request Jan 22, 2023
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support coscheduling plugin
6 participants