Skip to content

Commit

Permalink
RayCluster integration with Kueue (#1520)
Browse files Browse the repository at this point in the history
* WIP admit RayCluster as kueuable workload

updated kubebuilder markers but still workload isnt created

updating branch

update

unit tests working ?

changes

ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed

added comma

fixed helper method, need to implemente tests

changes v1alpha1 to v1 but still workload isnt created

pods are not getting suspended :(

charts

WIP admit RayCluster as kueuable workload

updated kubebuilder markers but still workload isnt created

updating branch

update

unit tests working ?

changes

ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed

fixed helper method, need to implemente tests

changes v1alpha1 to v1 but still workload isnt created

updated wrappers

not all tests are passing but still working on them

debugging podReady test

update

updated controller and webhook tests, now working

updated raycluster webhook test

added ray cluster sample yaml

updated go modules since this is using kuberay masters version

removing diffs from reconciler file

removing diffs from reconciler file

removing diffs from reconciler file

updated role yaml file

added TODO comment for autoscaler

updated raycluster controller test

added scheme for v1 inside register file

update rayjob import to reference v1 otherwise tests are not passing in PR

changed the order of jobs list

updated pod controller api version to be v1 instead of v1alpha1

update go files

fixed pull kueue test

reverted changes made to rayjob import library version

fixed rayjob tests

WIP admit RayCluster as kueuable workload

updated kubebuilder markers but still workload isnt created

updating branch

update

unit tests working ?

changes

ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed

added comma

fixed helper method, need to implemente tests

changes v1alpha1 to v1 but still workload isnt created

pods are not getting suspended :(

charts

WIP admit RayCluster as kueuable workload

updated kubebuilder markers but still workload isnt created

updating branch

update

unit tests working ?

changes

ray cluster admitted as a workload and running, it has a lot of debug log lines that need to be removed

fixed helper method, need to implemente tests

changes v1alpha1 to v1 but still workload isnt created

updated wrappers

not all tests are passing but still working on them

debugging podReady test

update

updated controller and webhook tests, now working

updated raycluster webhook test

updated go modules since this is using kuberay masters version

removing diffs from reconciler file

removing diffs from reconciler file

removing diffs from reconciler file

updated role yaml file

added TODO comment for autoscaler

updated raycluster controller test

added scheme for v1 inside register file

update rayjob import to reference v1 otherwise tests are not passing in PR

changed the order of jobs list

updated pod controller api version to be v1 instead of v1alpha1

update go files

fixed pull kueue test

reverted changes made to rayjob import library version

fixed rayjob tests

updated example raycluster

nit

updated ray cluster controller unit test and wrapper

updated tests and charts

updated ownerReference for rayJob and rayCluster

removed extra configuration for pods and duplicated text generated by script

added third argument for reconciler

* added logic to ignore rayclusters created by a RayJob

reverted git tag change

moved the sample yaml file to a different branch

addressed comments and used generalised method call in reconciler to check ownership

updated new reconciler variable

addressed comments

nit

removed register changes

revert changes to go dependencies to test something

updated go files

nit

added schema

nit

added files generated by make verify

update charts

test

test

updated register

test

fixed test added builder

n

update

fix go modules

fixed test

nit

added case for ray cluster completion

nit

debugging

added test for coverage

updated the restore node test

nit

addressed comments

update

added scheme for rayv1 in tests

nit

updated test to inject node selector with flavor defined

nit

nit
  • Loading branch information
vicentefb committed Jan 26, 2024
1 parent 59980e2 commit 3b37fbf
Show file tree
Hide file tree
Showing 27 changed files with 2,223 additions and 13 deletions.
1 change: 1 addition & 0 deletions apis/config/v1beta1/configuration_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,7 @@ type Integrations struct {
// - "batch/job"
// - "kubeflow.org/mpijob"
// - "ray.io/rayjob"
// - "ray.io/raycluster"
// - "jobset.x-k8s.io/jobset"
// - "kubeflow.org/mxjob"
// - "kubeflow.org/paddlejob"
Expand Down
27 changes: 27 additions & 0 deletions charts/kueue/templates/rbac/raycluster_editor_role.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# permissions for end users to edit jobs.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: '{{ include "kueue.fullname" . }}-raycluster-editor-role'
labels:
rbac.kueue.x-k8s.io/batch-admin: "true"
rbac.kueue.x-k8s.io/batch-user: "true"
rules:
- apiGroups:
- ray.io
resources:
- rayclusters
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ray.io
resources:
- rayclusters/status
verbs:
- get
22 changes: 22 additions & 0 deletions charts/kueue/templates/rbac/raycluster_viewer_role.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# permissions for end users to view jobs.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: '{{ include "kueue.fullname" . }}-raycluster-viewer-role'
labels:
rbac.kueue.x-k8s.io/batch-admin: "true"
rules:
- apiGroups:
- ray.io
resources:
- rayclusters
verbs:
- get
- list
- watch
- apiGroups:
- ray.io
resources:
- rayclusters/status
verbs:
- get
24 changes: 24 additions & 0 deletions charts/kueue/templates/rbac/role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -474,6 +474,30 @@ rules:
- get
- list
- watch
- apiGroups:
- ray.io
resources:
- rayclusters
verbs:
- get
- list
- patch
- update
- watch
- apiGroups:
- ray.io
resources:
- rayclusters/finalizers
verbs:
- get
- update
- apiGroups:
- ray.io
resources:
- rayclusters/status
verbs:
- get
- update
- apiGroups:
- ray.io
resources:
Expand Down
20 changes: 20 additions & 0 deletions charts/kueue/templates/webhook/webhook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -595,3 +595,23 @@ webhooks:
resources:
- pods
sideEffects: None
- admissionReviewVersions:
- v1
clientConfig:
service:
name: '{{ include "kueue.fullname" . }}-webhook-service'
namespace: '{{ .Release.Namespace }}'
path: /validate-ray-io-v1-raycluster
failurePolicy: Fail
name: vraycluster.kb.io
rules:
- apiGroups:
- ray.io
apiVersions:
- v1
operations:
- CREATE
- UPDATE
resources:
- rayclusters
sideEffects: None
1 change: 1 addition & 0 deletions charts/kueue/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ managerConfig:
- "batch/job"
- "kubeflow.org/mpijob"
- "ray.io/rayjob"
- "ray.io/raycluster"
- "jobset.x-k8s.io/jobset"
- "kubeflow.org/mxjob"
- "kubeflow.org/paddlejob"
Expand Down
2 changes: 1 addition & 1 deletion cmd/kueue/main_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ integrations:
{
name: "bad integrations config",
configFile: badIntegrationsConfig,
wantError: fmt.Errorf("integrations.frameworks: Unsupported value: \"unregistered/jobframework\": supported values: \"batch/job\", \"jobset.x-k8s.io/jobset\", \"kubeflow.org/mpijob\", \"kubeflow.org/mxjob\", \"kubeflow.org/paddlejob\", \"kubeflow.org/pytorchjob\", \"kubeflow.org/tfjob\", \"kubeflow.org/xgboostjob\", \"pod\", \"ray.io/rayjob\""),
wantError: fmt.Errorf("integrations.frameworks: Unsupported value: \"unregistered/jobframework\": supported values: \"batch/job\", \"jobset.x-k8s.io/jobset\", \"kubeflow.org/mpijob\", \"kubeflow.org/mxjob\", \"kubeflow.org/paddlejob\", \"kubeflow.org/pytorchjob\", \"kubeflow.org/tfjob\", \"kubeflow.org/xgboostjob\", \"pod\", \"ray.io/raycluster\", \"ray.io/rayjob\""),
},
}

Expand Down
1 change: 1 addition & 0 deletions config/components/manager/controller_manager_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ integrations:
- "batch/job"
- "kubeflow.org/mpijob"
- "ray.io/rayjob"
- "ray.io/raycluster"
- "jobset.x-k8s.io/jobset"
- "kubeflow.org/mxjob"
- "kubeflow.org/paddlejob"
Expand Down
27 changes: 27 additions & 0 deletions config/components/rbac/raycluster_editor_role.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# permissions for end users to edit jobs.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: raycluster-editor-role
labels:
rbac.kueue.x-k8s.io/batch-admin: "true"
rbac.kueue.x-k8s.io/batch-user: "true"
rules:
- apiGroups:
- ray.io
resources:
- rayclusters
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ray.io
resources:
- rayclusters/status
verbs:
- get
22 changes: 22 additions & 0 deletions config/components/rbac/raycluster_viewer_role.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# permissions for end users to view jobs.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: raycluster-viewer-role
labels:
rbac.kueue.x-k8s.io/batch-admin: "true"
rules:
- apiGroups:
- ray.io
resources:
- rayclusters
verbs:
- get
- list
- watch
- apiGroups:
- ray.io
resources:
- rayclusters/status
verbs:
- get
24 changes: 24 additions & 0 deletions config/components/rbac/role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -475,6 +475,30 @@ rules:
- get
- list
- watch
- apiGroups:
- ray.io
resources:
- rayclusters
verbs:
- get
- list
- patch
- update
- watch
- apiGroups:
- ray.io
resources:
- rayclusters/finalizers
verbs:
- get
- update
- apiGroups:
- ray.io
resources:
- rayclusters/status
verbs:
- get
- update
- apiGroups:
- ray.io
resources:
Expand Down
39 changes: 39 additions & 0 deletions config/components/webhook/manifests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,25 @@ webhooks:
resources:
- pods
sideEffects: None
- admissionReviewVersions:
- v1
clientConfig:
service:
name: webhook-service
namespace: system
path: /mutate-ray-io-v1-raycluster
failurePolicy: Fail
name: mraycluster.kb.io
rules:
- apiGroups:
- ray.io
apiVersions:
- v1
operations:
- CREATE
resources:
- rayclusters
sideEffects: None
- admissionReviewVersions:
- v1
clientConfig:
Expand Down Expand Up @@ -438,6 +457,26 @@ webhooks:
resources:
- pods
sideEffects: None
- admissionReviewVersions:
- v1
clientConfig:
service:
name: webhook-service
namespace: system
path: /validate-ray-io-v1-raycluster
failurePolicy: Fail
name: vraycluster.kb.io
rules:
- apiGroups:
- ray.io
apiVersions:
- v1
operations:
- CREATE
- UPDATE
resources:
- rayclusters
sideEffects: None
- admissionReviewVersions:
- v1
clientConfig:
Expand Down
4 changes: 2 additions & 2 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ require (
github.com/open-policy-agent/cert-controller v0.10.1
github.com/prometheus/client_golang v1.18.0
github.com/prometheus/client_model v0.5.0
github.com/ray-project/kuberay/ray-operator v1.0.0
github.com/ray-project/kuberay/ray-operator v0.0.0-20240120000125-c45d959a2e14
go.uber.org/zap v1.26.0
k8s.io/api v0.28.6
k8s.io/apimachinery v0.28.6
Expand Down Expand Up @@ -117,7 +117,7 @@ require (
gopkg.in/natefinch/lumberjack.v2 v2.2.1 // indirect
gopkg.in/yaml.v2 v2.4.0 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
k8s.io/apiextensions-apiserver v0.28.3 // indirect
k8s.io/apiextensions-apiserver v0.28.4 // indirect
k8s.io/gengo v0.0.0-20230829151522-9cce18d56c01 // indirect
k8s.io/kms v0.28.6 // indirect
sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.1.2 // indirect
Expand Down
8 changes: 4 additions & 4 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -281,8 +281,8 @@ github.com/prometheus/common v0.45.0 h1:2BGz0eBc2hdMDLnO/8n0jeB3oPrt2D08CekT0lne
github.com/prometheus/common v0.45.0/go.mod h1:YJmSTw9BoKxJplESWWxlbyttQR4uaEcGyv9MZjVOJsY=
github.com/prometheus/procfs v0.12.0 h1:jluTpSng7V9hY0O2R9DzzJHYb2xULk9VTR1V1R/k6Bo=
github.com/prometheus/procfs v0.12.0/go.mod h1:pcuDEFsWDnvcgNzo4EEweacyhjeA9Zk3cnaOZAZEfOo=
github.com/ray-project/kuberay/ray-operator v1.0.0 h1:i69nvbV7az2FG41VHQgxrmhD+SUl8ca+ek4RPbSE2Q0=
github.com/ray-project/kuberay/ray-operator v1.0.0/go.mod h1:7C7ebIkxtkmOX8w1iiLrKM1j4hkZs/Guzm3WdePk/yg=
github.com/ray-project/kuberay/ray-operator v0.0.0-20240120000125-c45d959a2e14 h1:V2Wux1nlt/fp9YURKxJ1Lg6/+m8poFBm7hXz8N2fVbA=
github.com/ray-project/kuberay/ray-operator v0.0.0-20240120000125-c45d959a2e14/go.mod h1:C96fIymVf98OVpZ8P1PiN9p+cM9OieL5JgyybA6QDp4=
github.com/rogpeppe/fastuuid v1.2.0/go.mod h1:jVj6XXZzXRy/MSR5jhDC/2q6DgLz+nrA6LYCDYWNEvQ=
github.com/rogpeppe/go-internal v1.3.0/go.mod h1:M8bDsm7K2OlrFYOpmOWEs/qY81heoFRclV5y23lUDJ4=
github.com/rogpeppe/go-internal v1.10.0 h1:TMyTOH3F/DB16zRVcYyreMH6GnZZrwQVAoYjRBZyWFQ=
Expand Down Expand Up @@ -702,8 +702,8 @@ honnef.co/go/tools v0.0.1-2020.1.3/go.mod h1:X/FiERA/W4tHapMX5mGpAtMSVEeEUOyHaw9
honnef.co/go/tools v0.0.1-2020.1.4/go.mod h1:X/FiERA/W4tHapMX5mGpAtMSVEeEUOyHaw9vFzvIQ3k=
k8s.io/api v0.28.6 h1:yy6u9CuIhmg55YvF/BavPBBXB+5QicB64njJXxVnzLo=
k8s.io/api v0.28.6/go.mod h1:AM6Ys6g9MY3dl/XNaNfg/GePI0FT7WBGu8efU/lirAo=
k8s.io/apiextensions-apiserver v0.28.3 h1:Od7DEnhXHnHPZG+W9I97/fSQkVpVPQx2diy+2EtmY08=
k8s.io/apiextensions-apiserver v0.28.3/go.mod h1:NE1XJZ4On0hS11aWWJUTNkmVB03j9LM7gJSisbRt8Lc=
k8s.io/apiextensions-apiserver v0.28.4 h1:AZpKY/7wQ8n+ZYDtNHbAJBb+N4AXXJvyZx6ww6yAJvU=
k8s.io/apiextensions-apiserver v0.28.4/go.mod h1:pgQIZ1U8eJSMQcENew/0ShUTlePcSGFq6dxSxf2mwPM=
k8s.io/apimachinery v0.28.6 h1:RsTeR4z6S07srPg6XYrwXpTJVMXsjPXn0ODakMytSW0=
k8s.io/apimachinery v0.28.6/go.mod h1:QFNX/kCl/EMT2WTSz8k4WLCv2XnkOLMaL8GAVRMdpsA=
k8s.io/apiserver v0.28.6 h1:SfS5v4I5UGvh0q/1rzvNwLFsK+r7YzcsixnUc0NwoEk=
Expand Down
3 changes: 2 additions & 1 deletion pkg/controller/jobframework/setup_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import (
"github.com/google/go-cmp/cmp/cmpopts"
kubeflow "github.com/kubeflow/mpi-operator/pkg/apis/kubeflow/v2beta1"
kftraining "github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1"
rayv1 "github.com/ray-project/kuberay/ray-operator/apis/ray/v1"
rayjobapi "github.com/ray-project/kuberay/ray-operator/apis/ray/v1alpha1"
batchv1 "k8s.io/api/batch/v1"
corev1 "k8s.io/api/core/v1"
Expand Down Expand Up @@ -74,7 +75,7 @@ func TestSetupControllers(t *testing.T) {
for name, tc := range cases {
t.Run(name, func(t *testing.T) {
_, logger := utiltesting.ContextWithLog(t)
k8sClient := utiltesting.NewClientBuilder(jobset.AddToScheme, kubeflow.AddToScheme, rayjobapi.AddToScheme, kftraining.AddToScheme).Build()
k8sClient := utiltesting.NewClientBuilder(jobset.AddToScheme, kubeflow.AddToScheme, rayjobapi.AddToScheme, kftraining.AddToScheme, rayv1.AddToScheme).Build()

mgrOpts := ctrlmgr.Options{
Scheme: k8sClient.Scheme(),
Expand Down
1 change: 1 addition & 0 deletions pkg/controller/jobs/jobs.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,6 @@ import (
_ "sigs.k8s.io/kueue/pkg/controller/jobs/kubeflow/jobs"
_ "sigs.k8s.io/kueue/pkg/controller/jobs/mpijob"
_ "sigs.k8s.io/kueue/pkg/controller/jobs/pod"
_ "sigs.k8s.io/kueue/pkg/controller/jobs/raycluster"
_ "sigs.k8s.io/kueue/pkg/controller/jobs/rayjob"
)
Loading

0 comments on commit 3b37fbf

Please sign in to comment.