-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RayCluster integration with Kueue #1520
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
Hi @vicentefb. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Will squash commits once final review is given. |
32d1e36
to
1fb4a7c
Compare
/ok-to-test |
test/integration/controller/jobs/raycluster/raycluster_controller_test.go
Outdated
Show resolved
Hide resolved
test/integration/controller/jobs/raycluster/raycluster_controller_test.go
Outdated
Show resolved
Hide resolved
test/integration/controller/jobs/raycluster/raycluster_controller_test.go
Outdated
Show resolved
Hide resolved
c59763f
to
79b174a
Compare
/hold Holding until there's a kuberay release that has ray-project/kuberay#1711 |
/retest |
673c99c
to
0765c64
Compare
13b200f
to
24e1dd5
Compare
So it doesn't currently work? I'm rather opposed to merging without it. /hold |
gomega.Expect(len(createdJob.Spec.HeadGroupSpec.Template.Spec.NodeSelector)).Should(gomega.Equal(1)) | ||
gomega.Expect(createdJob.Spec.HeadGroupSpec.Template.Spec.NodeSelector[instanceKey]).Should(gomega.Equal(onDemandFlavor.Name)) | ||
gomega.Expect(len(createdJob.Spec.WorkerGroupSpecs[0].Template.Spec.NodeSelector)).Should(gomega.Equal(1)) | ||
gomega.Expect(createdJob.Spec.WorkerGroupSpecs[0].Template.Spec.NodeSelector[instanceKey]).Should(gomega.Equal(spotFlavor.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, it is tested here. It would be useful to have this in unit tests, but it's ok to leave for a follow up
cc @astefanutti
/hold cancel.
}{ | ||
"when workload is admitted, cluster is unsuspended": { | ||
job: *baseJobWrapper.Clone(). | ||
NodeSelectorHeadGroup("provisioning", "spot"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this should be removed from the test input, as it's expected to be added during the reconciliation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a flavor needs to be defined, because that's where the selectors are obtained from.
}{ | ||
"when workload is admitted, cluster is unsuspended": { | ||
job: *baseJobWrapper.Clone(). | ||
NodeSelectorHeadGroup("provisioning", "spot"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a flavor needs to be defined, because that's where the selectors are obtained from.
f8c09fb
to
916328e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, astefanutti, vicentefb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm based on #1520 (review) |
LGTM label has been added. Git tree hash: 4ec47e5614815d8b5ed50a02d7a1e3b40c34ba50
|
/hold cancel |
The periodic E2E tests are healthy... so could it be that this PR is breaking them? /test pull-kueue-test-e2e-main-1-26 |
updated kubebuilder markers but still workload isnt created updating branch update unit tests working ? changes ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed added comma fixed helper method, need to implemente tests changes v1alpha1 to v1 but still workload isnt created pods are not getting suspended :( charts WIP admit RayCluster as kueuable workload updated kubebuilder markers but still workload isnt created updating branch update unit tests working ? changes ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed fixed helper method, need to implemente tests changes v1alpha1 to v1 but still workload isnt created updated wrappers not all tests are passing but still working on them debugging podReady test update updated controller and webhook tests, now working updated raycluster webhook test added ray cluster sample yaml updated go modules since this is using kuberay masters version removing diffs from reconciler file removing diffs from reconciler file removing diffs from reconciler file updated role yaml file added TODO comment for autoscaler updated raycluster controller test added scheme for v1 inside register file update rayjob import to reference v1 otherwise tests are not passing in PR changed the order of jobs list updated pod controller api version to be v1 instead of v1alpha1 update go files fixed pull kueue test reverted changes made to rayjob import library version fixed rayjob tests WIP admit RayCluster as kueuable workload updated kubebuilder markers but still workload isnt created updating branch update unit tests working ? changes ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed added comma fixed helper method, need to implemente tests changes v1alpha1 to v1 but still workload isnt created pods are not getting suspended :( charts WIP admit RayCluster as kueuable workload updated kubebuilder markers but still workload isnt created updating branch update unit tests working ? changes ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed fixed helper method, need to implemente tests changes v1alpha1 to v1 but still workload isnt created updated wrappers not all tests are passing but still working on them debugging podReady test update updated controller and webhook tests, now working updated raycluster webhook test updated go modules since this is using kuberay masters version removing diffs from reconciler file removing diffs from reconciler file removing diffs from reconciler file updated role yaml file added TODO comment for autoscaler updated raycluster controller test added scheme for v1 inside register file update rayjob import to reference v1 otherwise tests are not passing in PR changed the order of jobs list updated pod controller api version to be v1 instead of v1alpha1 update go files fixed pull kueue test reverted changes made to rayjob import library version fixed rayjob tests updated example raycluster nit updated ray cluster controller unit test and wrapper updated tests and charts updated ownerReference for rayJob and rayCluster removed extra configuration for pods and duplicated text generated by script added third argument for reconciler
reverted git tag change moved the sample yaml file to a different branch addressed comments and used generalised method call in reconciler to check ownership updated new reconciler variable addressed comments nit removed register changes revert changes to go dependencies to test something updated go files nit added schema nit added files generated by make verify update charts test test updated register test fixed test added builder n update fix go modules fixed test nit added case for ray cluster completion nit debugging added test for coverage updated the restore node test nit addressed comments update added scheme for rayv1 in tests nit updated test to inject node selector with flavor defined nit nit
916328e
to
c9080e3
Compare
tests are definitely flaky #1658 |
/lgtm |
LGTM label has been added. Git tree hash: 9882d57638bb7418ba59d98bf166f1829eff6ef9
|
* WIP admit RayCluster as kueuable workload updated kubebuilder markers but still workload isnt created updating branch update unit tests working ? changes ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed added comma fixed helper method, need to implemente tests changes v1alpha1 to v1 but still workload isnt created pods are not getting suspended :( charts WIP admit RayCluster as kueuable workload updated kubebuilder markers but still workload isnt created updating branch update unit tests working ? changes ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed fixed helper method, need to implemente tests changes v1alpha1 to v1 but still workload isnt created updated wrappers not all tests are passing but still working on them debugging podReady test update updated controller and webhook tests, now working updated raycluster webhook test added ray cluster sample yaml updated go modules since this is using kuberay masters version removing diffs from reconciler file removing diffs from reconciler file removing diffs from reconciler file updated role yaml file added TODO comment for autoscaler updated raycluster controller test added scheme for v1 inside register file update rayjob import to reference v1 otherwise tests are not passing in PR changed the order of jobs list updated pod controller api version to be v1 instead of v1alpha1 update go files fixed pull kueue test reverted changes made to rayjob import library version fixed rayjob tests WIP admit RayCluster as kueuable workload updated kubebuilder markers but still workload isnt created updating branch update unit tests working ? changes ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed added comma fixed helper method, need to implemente tests changes v1alpha1 to v1 but still workload isnt created pods are not getting suspended :( charts WIP admit RayCluster as kueuable workload updated kubebuilder markers but still workload isnt created updating branch update unit tests working ? changes ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed fixed helper method, need to implemente tests changes v1alpha1 to v1 but still workload isnt created updated wrappers not all tests are passing but still working on them debugging podReady test update updated controller and webhook tests, now working updated raycluster webhook test updated go modules since this is using kuberay masters version removing diffs from reconciler file removing diffs from reconciler file removing diffs from reconciler file updated role yaml file added TODO comment for autoscaler updated raycluster controller test added scheme for v1 inside register file update rayjob import to reference v1 otherwise tests are not passing in PR changed the order of jobs list updated pod controller api version to be v1 instead of v1alpha1 update go files fixed pull kueue test reverted changes made to rayjob import library version fixed rayjob tests updated example raycluster nit updated ray cluster controller unit test and wrapper updated tests and charts updated ownerReference for rayJob and rayCluster removed extra configuration for pods and duplicated text generated by script added third argument for reconciler * added logic to ignore rayclusters created by a RayJob reverted git tag change moved the sample yaml file to a different branch addressed comments and used generalised method call in reconciler to check ownership updated new reconciler variable addressed comments nit removed register changes revert changes to go dependencies to test something updated go files nit added schema nit added files generated by make verify update charts test test updated register test fixed test added builder n update fix go modules fixed test nit added case for ray cluster completion nit debugging added test for coverage updated the restore node test nit addressed comments update added scheme for rayv1 in tests nit updated test to inject node selector with flavor defined nit nit
What type of PR is this?
/kind feature
What this PR does / why we need it:
Support RayCluster as a queue-able workload in Kueue since there are many use-cases and existing workloads that depend on long-lived RayClusters.
Which issue(s) this PR fixes:
Fixes #1272
Special notes for your reviewer:
This implementation is using master's version of kuberay and using
v1
api version instead ofv1alpha1
because it's using the new suspend API (ray-project/kuberay#1667) for RayClusterDoes this PR introduce a user-facing change?