Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RayCluster integration with Kueue #1520

Merged
merged 2 commits into from
Jan 26, 2024

Conversation

vicentefb
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Support RayCluster as a queue-able workload in Kueue since there are many use-cases and existing workloads that depend on long-lived RayClusters.

Which issue(s) this PR fixes:

Fixes #1272

Special notes for your reviewer:

This implementation is using master's version of kuberay and using v1 api version instead of v1alpha1 because it's using the new suspend API (ray-project/kuberay#1667) for RayCluster

Does this PR introduce a user-facing change?

Support RayCluster as a queue-able workload in Kueue

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Dec 27, 2023
Copy link

netlify bot commented Dec 27, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit c9080e3
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65b3f6f0609d1d00082fe61c

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 27, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @vicentefb. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 27, 2023
@vicentefb
Copy link
Contributor Author

Will squash commits once final review is given.

@tenzen-y
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 27, 2023
@vicentefb vicentefb force-pushed the KueueRayCluster branch 2 times, most recently from c59763f to 79b174a Compare December 27, 2023 20:16
@andrewsykim
Copy link
Member

/hold

Holding until there's a kuberay release that has ray-project/kuberay#1711

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 28, 2023
@vicentefb
Copy link
Contributor Author

/retest

@vicentefb vicentefb force-pushed the KueueRayCluster branch 3 times, most recently from 673c99c to 0765c64 Compare January 2, 2024 20:57
@vicentefb vicentefb changed the title [WIP] RayCluster integration with Kueue RayCluster integration with Kueue Jan 3, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 3, 2024
@astefanutti
Copy link
Member

Did you mean the feature or the test?
Node selectors are fundamental to the way flavors work, so we shouldn't miss the feature. I would be cautious about leaving the unit test for later.

I meant the feature. But I agree having the unit test and the feature in the upcoming v0.6.0 release would be ideal.

@alculquicondor
Copy link
Contributor

So it doesn't currently work? I'm rather opposed to merging without it.

/hold

Comment on lines +197 to +200
gomega.Expect(len(createdJob.Spec.HeadGroupSpec.Template.Spec.NodeSelector)).Should(gomega.Equal(1))
gomega.Expect(createdJob.Spec.HeadGroupSpec.Template.Spec.NodeSelector[instanceKey]).Should(gomega.Equal(onDemandFlavor.Name))
gomega.Expect(len(createdJob.Spec.WorkerGroupSpecs[0].Template.Spec.NodeSelector)).Should(gomega.Equal(1))
gomega.Expect(createdJob.Spec.WorkerGroupSpecs[0].Template.Spec.NodeSelector[instanceKey]).Should(gomega.Equal(spotFlavor.Name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it is tested here. It would be useful to have this in unit tests, but it's ok to leave for a follow up

cc @astefanutti

/hold cancel.

}{
"when workload is admitted, cluster is unsuspended": {
job: *baseJobWrapper.Clone().
NodeSelectorHeadGroup("provisioning", "spot").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this should be removed from the test input, as it's expected to be added during the reconciliation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a flavor needs to be defined, because that's where the selectors are obtained from.

}{
"when workload is admitted, cluster is unsuspended": {
job: *baseJobWrapper.Clone().
NodeSelectorHeadGroup("provisioning", "spot").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a flavor needs to be defined, because that's where the selectors are obtained from.

@vicentefb vicentefb force-pushed the KueueRayCluster branch 4 times, most recently from f8c09fb to 916328e Compare January 25, 2024 22:01
Copy link
Member

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, astefanutti, vicentefb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alculquicondor
Copy link
Contributor

/lgtm

based on #1520 (review)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 4ec47e5614815d8b5ed50a02d7a1e3b40c34ba50

@vicentefb
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 26, 2024
@alculquicondor
Copy link
Contributor

The periodic E2E tests are healthy... so could it be that this PR is breaking them?

/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

updated kubebuilder markers but still workload isnt created

updating branch

update

unit tests working ?

changes

ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed

added comma

fixed helper method, need to implemente tests

changes v1alpha1 to v1 but still workload isnt created

pods are not getting suspended :(

charts

WIP admit RayCluster as kueuable workload

updated kubebuilder markers but still workload isnt created

updating branch

update

unit tests working ?

changes

ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed

fixed helper method, need to implemente tests

changes v1alpha1 to v1 but still workload isnt created

updated wrappers

not all tests are passing but still working on them

debugging podReady test

update

updated controller and webhook tests, now working

updated raycluster webhook test

added ray cluster sample yaml

updated go modules since this is using kuberay masters version

removing diffs from reconciler file

removing diffs from reconciler file

removing diffs from reconciler file

updated role yaml file

added TODO comment for autoscaler

updated raycluster controller test

added scheme for v1 inside register file

update rayjob import to reference v1 otherwise tests are not passing in PR

changed the order of jobs list

updated pod controller api version to be v1 instead of v1alpha1

update go files

fixed pull kueue test

reverted changes made to rayjob import library version

fixed rayjob tests

WIP admit RayCluster as kueuable workload

updated kubebuilder markers but still workload isnt created

updating branch

update

unit tests working ?

changes

ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed

added comma

fixed helper method, need to implemente tests

changes v1alpha1 to v1 but still workload isnt created

pods are not getting suspended :(

charts

WIP admit RayCluster as kueuable workload

updated kubebuilder markers but still workload isnt created

updating branch

update

unit tests working ?

changes

ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed

fixed helper method, need to implemente tests

changes v1alpha1 to v1 but still workload isnt created

updated wrappers

not all tests are passing but still working on them

debugging podReady test

update

updated controller and webhook tests, now working

updated raycluster webhook test

updated go modules since this is using kuberay masters version

removing diffs from reconciler file

removing diffs from reconciler file

removing diffs from reconciler file

updated role yaml file

added TODO comment for autoscaler

updated raycluster controller test

added scheme for v1 inside register file

update rayjob import to reference v1 otherwise tests are not passing in PR

changed the order of jobs list

updated pod controller api version to be v1 instead of v1alpha1

update go files

fixed pull kueue test

reverted changes made to rayjob import library version

fixed rayjob tests

updated example raycluster

nit

updated ray cluster controller unit test and wrapper

updated tests and charts

updated ownerReference for rayJob and rayCluster

removed extra configuration for pods and duplicated text generated by script

added third argument for reconciler
reverted git tag change

moved the sample yaml file to a different branch

addressed comments and used generalised method call in reconciler to check ownership

updated new reconciler variable

addressed comments

nit

removed register changes

revert changes to go dependencies to test something

updated go files

nit

added schema

nit

added files generated by make verify

update charts

test

test

updated register

test

fixed test added builder

n

update

fix go modules

fixed test

nit

added case for ray cluster completion

nit

debugging

added test for coverage

updated the restore node test

nit

addressed comments

update

added scheme for rayv1 in tests

nit

updated test to inject node selector with flavor defined

nit

nit
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2024
@alculquicondor
Copy link
Contributor

tests are definitely flaky #1658

@alculquicondor
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 9882d57638bb7418ba59d98bf166f1829eff6ef9

@k8s-ci-robot k8s-ci-robot merged commit 3b37fbf into kubernetes-sigs:main Jan 26, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.6 milestone Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for RayCluster
6 participants