-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Apache YuniKorn as one batch scheduler option #2184
Conversation
Hi @yangwwei, thank you for the PR! Are you in the Ray Slack workspace? My Slack handle is "Kai-Hsun Chen (ray team)" We can have a quick sync on Slack to discuss how the KubeRay/Ray community works (e.g., how to propose a new enhancement). |
@kevin85421 please see proposal: ray-project/enhancements#53 |
Hi @yangwwei, I plan to review this PR next week because the REP has already been merged. Is this PR ready for review? I see it is still marked as a draft. |
hi @kevin85421 can you help to review this PR please, thanks! |
I will review the PR tmr. Thanks! |
} | ||
|
||
func (y *YuniKornScheduler) populatePodLabels(app *rayv1.RayCluster, pod *v1.Pod, sourceKey string, targetKey string) { | ||
// check labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to enable users to configure both labels
and annotations
? Maybe annotations
are enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we can do labels only but not the annotations. Pls see doc here: https://yunikorn.apache.org/docs/user_guide/workloads/workload_overview.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably relevant here: https://issues.apache.org/jira/plugins/servlet/mobile#issue/YUNIKORN-1351
Could you (1) fix the CI lint error (install the pre-commit hooks) (2) add some instructions about how do you manually test it with Yunikorn in the PR description? I will also try it manually. Thanks! |
Prerequisits:
Install kuberayThe docker image needs to be pushed to the kind registry first
the log should mention the batch scheduler is enabled:
Install yunikorndoc: https://yunikorn.apache.org/docs/#install, note, I reduced the memory request to fit my local env
TestRun a simple Ray cluster, this is what I was using:
once applied, we should see the pods being scheduled by yunikorn, verify this by describing the head and worker pods, you'll see events like the belowing:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember you mentioning that if we enable the batch scheduler without installing the Volcano CRD, it will report an error. Have we resolved this issue?
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/batchscheduler/yunikorn/yunikorn_scheduler_test.go
Outdated
Show resolved
Hide resolved
Could you fix the lint error? You can refer to this doc to install pre-commit https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md |
Btw, I hope to include this PR in v1.2.0. I will do the branch cut next week. |
Gang scheduling support is not included in this PR yet, I will work on that after this gets merged. I intended to keep the PRs small for easier review.
Yes, thats still an issue. I will work on another PR with the proposed solution. |
Why are these changes needed?
Apache YuniKorn is a widely used batch scheduler for Kubernetes, this PR is to support Apache yunikorn as a option for scheduling Ray workloads.
The integration is very simpler, Apache YuniKorn doesn't require any CR to be created, the changes in the job controller code is to automatically inject required labels to Ray pods, only 2 extra lables are needed
when all pods have the above labels, the yunikorn scheduler will automatically recognize these pods belong to the same Ray application, and schedule them in the given queue. Then the Ray workload can benifit all batch scheduling features yunikorn provided: https://yunikorn.apache.org/docs/next/get_started/core_features
Related issue number
#1457
Checks