Support Apache YuniKorn as one batch scheduler option #2184

yangwwei · 2024-06-09T22:33:58Z

Why are these changes needed?

Apache YuniKorn is a widely used batch scheduler for Kubernetes, this PR is to support Apache yunikorn as a option for scheduling Ray workloads.

The integration is very simpler, Apache YuniKorn doesn't require any CR to be created, the changes in the job controller code is to automatically inject required labels to Ray pods, only 2 extra lables are needed

yunikorn.apache.org/application-id
yunikorn.apache.org/queue-name

when all pods have the above labels, the yunikorn scheduler will automatically recognize these pods belong to the same Ray application, and schedule them in the given queue. Then the Ray workload can benifit all batch scheduling features yunikorn provided: https://yunikorn.apache.org/docs/next/get_started/core_features

Related issue number

#1457

Checks

I've made sure the tests are passing.
Testing Strategy
- [v] Unit tests
- [v] Manual tests
- This PR is not tested :(

kevin85421 · 2024-06-10T16:21:15Z

Hi @yangwwei, thank you for the PR! Are you in the Ray Slack workspace? My Slack handle is "Kai-Hsun Chen (ray team)" We can have a quick sync on Slack to discuss how the KubeRay/Ray community works (e.g., how to propose a new enhancement).

yangwwei · 2024-06-17T06:40:29Z

@kevin85421 please see proposal: ray-project/enhancements#53

kevin85421 · 2024-07-07T22:39:27Z

Hi @yangwwei, I plan to review this PR next week because the REP has already been merged. Is this PR ready for review? I see it is still marked as a draft.

yangwwei · 2024-07-17T16:07:42Z

hi @kevin85421 can you help to review this PR please, thanks!

kevin85421 · 2024-07-18T23:26:13Z

I will review the PR tmr. Thanks!

kevin85421 · 2024-07-22T07:28:04Z

ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go

+}
+
+func (y *YuniKornScheduler) populatePodLabels(app *rayv1.RayCluster, pod *v1.Pod, sourceKey string, targetKey string) {
+	// check labels


Is it necessary to enable users to configure both labels and annotations? Maybe annotations are enough.

Actually, we can do labels only but not the annotations. Pls see doc here: https://yunikorn.apache.org/docs/user_guide/workloads/workload_overview.

This is probably relevant here: https://issues.apache.org/jira/plugins/servlet/mobile#issue/YUNIKORN-1351

kevin85421 · 2024-07-22T07:30:14Z

Could you (1) fix the CI lint error (install the pre-commit hooks) (2) add some instructions about how do you manually test it with Yunikorn in the PR description? I will also try it manually. Thanks!

yangwwei · 2024-07-23T22:43:45Z

Prerequisits:

a local Kind cluster (or a real k8s cluster)
ray-operator image built
comment out this line to workaround this issue:

Install kuberay

The docker image needs to be pushed to the kind registry first

helm install kuberay-operator kuberay/kuberay-operator \
   --version 1.0.0 --set batchScheduler.enabled=true \
   --set image.repository=kind-registry.vsl --set image.tag=5000/kuberayv1

the log should mention the batch scheduler is enabled:

{"level":"info","ts":"2024-07-23T22:26:41.945Z","logger":"setup","msg":"Feature flag enable-batch-scheduler is enabled."}

Install yunikorn

doc: https://yunikorn.apache.org/docs/#install, note, I reduced the memory request to fit my local env

helm repo add yunikorn https://apache.github.io/yunikorn-release
helm repo update
kubectl create namespace yunikorn
helm install yunikorn yunikorn/yunikorn --namespace yunikorn --set resources.requests.memory=200M --set web.resources.requests.memory=50M

Test

Run a simple Ray cluster, this is what I was using:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  annotations:
    meta.helm.sh/release-name: raycluster
    meta.helm.sh/release-namespace: default
  creationTimestamp: "2024-01-12T19:14:07Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: raycluster
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kuberay
    helm.sh/chart: ray-cluster-1.0.0
    ray.io/scheduler-name: yunikorn
    yunikorn.apache.org/application-id: my-ray-cluster-0001
  name: raycluster-kuberay
  namespace: default
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: 0.0.0.0
    serviceType: ClusterIP
    template:
      metadata:
        labels:
          app.kubernetes.io/instance: raycluster
          app.kubernetes.io/managed-by: Helm
          app.kubernetes.io/name: kuberay
          helm.sh/chart: ray-cluster-1.0.0
      spec:
        containers:
        - env: []
          image: rayproject/ray:2.7.0
          imagePullPolicy: IfNotPresent
          name: ray-head
          resources:
            limits:
              cpu: "1"
            requests:
              cpu: "1"
          volumeMounts:
          - mountPath: /tmp/ray
            name: log-volume
        tolerations:
        - effect: NoSchedule
          key: kwok.x-k8s.io/node
          operator: Equal
          value: fake
        volumes:
        - emptyDir: {}
          name: log-volume
  workerGroupSpecs:
  - groupName: workergroup
    rayStartParams: {}
    maxReplicas: 2147483647
    minReplicas: 0
    replicas: 1
    template:
      metadata:
        labels:
          app.kubernetes.io/instance: raycluster
          app.kubernetes.io/managed-by: Helm
          app.kubernetes.io/name: kuberay
          helm.sh/chart: ray-cluster-1.0.0
      spec:
        containers:
        - env: []
          image: rayproject/ray:2.7.0
          imagePullPolicy: IfNotPresent
          name: ray-worker
          resources:
            limits:
              cpu: "1"
            requests:
              cpu: "1"
          volumeMounts:
          - mountPath: /tmp/ray
            name: log-volume
        tolerations:
        - effect: NoSchedule
          key: kwok.x-k8s.io/node
          operator: Equal
          value: fake
        volumes:
        - emptyDir: {}
          name: log-volume

once applied, we should see the pods being scheduled by yunikorn, verify this by describing the head and worker pods, you'll see events like the belowing:

  Type    Reason             Age   From      Message
  ----    ------             ----  ----      -------
  Normal  Scheduling         14s   yunikorn  default/raycluster-kuberay-head-tvtn4 is queued and waiting for allocation
  Normal  Scheduled          14s   yunikorn  Successfully assigned default/raycluster-kuberay-head-tvtn4 to node kind-worker
  Normal  PodBindSuccessful  14s   yunikorn  Pod default/raycluster-kuberay-head-tvtn4 is successfully bound to node kind-worker
  Normal  Pulling            14s   kubelet   Pulling image "rayproject/ray:2.7.0"

kevin85421

I remember you mentioning that if we enable the batch scheduler without installing the Volcano CRD, it will report an error. Have we resolved this issue?

ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go

ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler_test.go

kevin85421 · 2024-07-24T06:51:44Z

Could you fix the lint error? You can refer to this doc to install pre-commit https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md

kevin85421 · 2024-07-24T06:53:04Z

Btw, I hope to include this PR in v1.2.0. I will do the branch cut next week.

kevin85421 · 2024-07-24T19:34:32Z

I test it manually, and I can see events from yunikorn in both head and worker Pods. I am wondering what's the scheduling strategy (e.g. gang scheduling) in this example.

yangwwei · 2024-07-24T19:48:51Z

I test it manually, and I can see events from yunikorn in both head and worker Pods. I am wondering what's the scheduling strategy (e.g. gang scheduling) in this example.

Gang scheduling support is not included in this PR yet, I will work on that after this gets merged. I intended to keep the PRs small for easier review.

I remember you mentioning that if we enable the batch scheduler without installing the Volcano CRD, it will report an error. Have we resolved this issue?

Yes, thats still an issue. I will work on another PR with the proposed solution.

Support Apache YuniKorn as one batch scheduler option

ef06f62

yangwwei marked this pull request as draft June 9, 2024 22:34

Remove changes to rayjob controller

6f29127

yangwwei mentioned this pull request Jun 10, 2024

[Bug] "enable-batch-scheduler" bool flag is not working for schedulers other than Volcano #2185

Open

2 tasks

kevin85421 self-assigned this Jun 10, 2024

Merge remote-tracking branch 'upstream/master' into yunikorn

67fde31

yangwwei marked this pull request as ready for review July 16, 2024 00:29

kevin85421 reviewed Jul 22, 2024

View reviewed changes

kevin85421 added the 1.2.0 label Jul 23, 2024

yangwwei added 2 commits July 24, 2024 05:38

Merge remote-tracking branch 'upstream/master' into yunikorn

a6fba19

address review comments and fix lint error

9e40189

kevin85421 reviewed Jul 24, 2024

View reviewed changes

fix import ordering

ab7f44a

kevin85421 approved these changes Jul 24, 2024

View reviewed changes

kevin85421 merged commit 72a63ac into ray-project:master Jul 24, 2024
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Apache YuniKorn as one batch scheduler option #2184

Support Apache YuniKorn as one batch scheduler option #2184

yangwwei commented Jun 9, 2024

kevin85421 commented Jun 10, 2024 •

edited

Loading

yangwwei commented Jun 17, 2024

kevin85421 commented Jul 7, 2024

yangwwei commented Jul 17, 2024

kevin85421 commented Jul 18, 2024

kevin85421 Jul 22, 2024

yangwwei Jul 23, 2024

phoerious Jul 25, 2024

kevin85421 commented Jul 22, 2024

yangwwei commented Jul 23, 2024 •

edited

Loading

kevin85421 left a comment

kevin85421 commented Jul 24, 2024

kevin85421 commented Jul 24, 2024

kevin85421 commented Jul 24, 2024

yangwwei commented Jul 24, 2024

Support Apache YuniKorn as one batch scheduler option #2184

Support Apache YuniKorn as one batch scheduler option #2184

Conversation

yangwwei commented Jun 9, 2024

Why are these changes needed?

Related issue number

Checks

kevin85421 commented Jun 10, 2024 • edited Loading

yangwwei commented Jun 17, 2024

kevin85421 commented Jul 7, 2024

yangwwei commented Jul 17, 2024

kevin85421 commented Jul 18, 2024

kevin85421 Jul 22, 2024

Choose a reason for hiding this comment

yangwwei Jul 23, 2024

Choose a reason for hiding this comment

phoerious Jul 25, 2024

Choose a reason for hiding this comment

kevin85421 commented Jul 22, 2024

yangwwei commented Jul 23, 2024 • edited Loading

Prerequisits:

Install kuberay

Install yunikorn

Test

kevin85421 left a comment

Choose a reason for hiding this comment

kevin85421 commented Jul 24, 2024

kevin85421 commented Jul 24, 2024

kevin85421 commented Jul 24, 2024

yangwwei commented Jul 24, 2024

kevin85421 commented Jun 10, 2024 •

edited

Loading

yangwwei commented Jul 23, 2024 •

edited

Loading