added documentation for RayCluster integration (#1607)

added ray cluster sample yaml file added more limitations nit NIT nit updated nit added version nit
kubernetes-sigs · Jan 31, 2024 · 5ed62e4 · 5ed62e4
1 parent 0b0961d
commit 5ed62e4
Show file tree

Hide file tree

Showing 4 changed files with 166 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@ Read the [overview](https://kueue.sigs.k8s.io/docs/overview/) to learn more.
 - **Resource management:** Support resource fair sharing and [preemption](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#preemption) with a variety of policies between different tenants.
 - **Dynamic resource reclaim:** A mechanism to [release](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim) quota as the pods of a Job complete.
 - **Resource flavor fungibility:** Quota [borrowing or preemption](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#flavorfungibility) in ClusterQueue and Cohort.
-- **Integrations:** Built-in support for popular jobs, e.g. [BatchJob](https://kueue.sigs.k8s.io/docs/tasks/run_jobs/), [Kubeflow training jobs](https://kueue.sigs.k8s.io/docs/tasks/run_kubeflow_jobs/), [RayJob](https://kueue.sigs.k8s.io/docs/tasks/run_rayjobs/), [JobSet](https://kueue.sigs.k8s.io/docs/tasks/run_jobsets/),  [plain Pod](https://kueue.sigs.k8s.io/docs/tasks/run_plain_pods/).
+- **Integrations:** Built-in support for popular jobs, e.g. [BatchJob](https://kueue.sigs.k8s.io/docs/tasks/run_jobs/), [Kubeflow training jobs](https://kueue.sigs.k8s.io/docs/tasks/run_kubeflow_jobs/), [RayJob](https://kueue.sigs.k8s.io/docs/tasks/run_rayjobs/), [RayCluster](https://kueue.sigs.k8s.io/docs/tasks/run_rayclusters/), [JobSet](https://kueue.sigs.k8s.io/docs/tasks/run_jobsets/),  [plain Pod](https://kueue.sigs.k8s.io/docs/tasks/run_plain_pods/).
 - **System insight:** Build-in [prometheus metrics](https://kueue.sigs.k8s.io/docs/reference/metrics/) to help monitor the state of the system, as well as Conditions.
 - **AdmissionChecks:** A mechanism for internal or external components to influence whether a workload can be [admitted](https://kueue.sigs.k8s.io/docs/concepts/admission_check/).
 - **Advanced autoscaling support:** Integration with cluster-autoscaler's [provisioningRequest](https://kueue.sigs.k8s.io/docs/admission-check-controllers/provisioning/#job-using-a-provisioningrequest) via admissionChecks.

diff --git a/site/content/en/docs/overview/_index.md b/site/content/en/docs/overview/_index.md
@@ -28,7 +28,7 @@ A core design principle for Kueue is to avoid duplicating mature functionality i
 - **Resource management:** Support resource fair sharing and [preemption](/docs/concepts/cluster_queue/#preemption) with a variety of policies between different tenants.
 - **Dynamic resource reclaim:** A mechanism to [release](/docs/concepts/workload/#dynamic-reclaim) quota as the pods of a Job complete.
 - **Resource flavor fungibility:** Quota [borrowing or preemption](/docs/concepts/cluster_queue/#flavorfungibility) in ClusterQueue and Cohort.
-- **Integrations:** Built-in support for popular jobs, e.g. [BatchJob](/docs/tasks/run_jobs/), [Kubeflow training jobs](/docs/tasks/run_kubeflow_jobs/), [RayJob](/docs/tasks/run_rayjobs/), [JobSet](/docs/tasks/run_jobsets/),  [plain Pod](/docs/tasks/run_plain_pods/).
+- **Integrations:** Built-in support for popular jobs, e.g. [BatchJob](/docs/tasks/run_jobs/), [Kubeflow training jobs](/docs/tasks/run_kubeflow_jobs/), [RayJob](/docs/tasks/run_rayjobs/), [RayCluster](/docs/tasks/run_rayclusters/), [JobSet](/docs/tasks/run_jobsets/),  [plain Pod](/docs/tasks/run_plain_pods/).
 - **System insight:** Build-in [prometheus metrics](/docs/reference/metrics/) to help monitor the state of the system, as well as Conditions.
 - **AdmissionChecks:** A mechanism for internal or external components to influence whether a workload can be [admitted](/docs/concepts/admission_check/).
 - **Advanced autoscaling support:** Integration with cluster-autoscaler's [provisioningRequest](/docs/admission-check-controllers/provisioning/#job-using-a-provisioningrequest) via admissionChecks.

diff --git a/site/content/en/docs/tasks/run_rayclusters.md b/site/content/en/docs/tasks/run_rayclusters.md
@@ -0,0 +1,92 @@
+---
+title: "Run A RayCluster"
+date: 2024-01-17
+weight: 6
+description: >
+  Run a RayCluster on Kueue.
+---
+
+This page shows how to leverage Kueue's scheduling and resource management capabilities when running [RayCluster](https://docs.ray.io/en/latest/cluster/getting-started.html).
+
+This guide is for [batch users](/docs/tasks#batch-user) that have a basic understanding of Kueue. For more information, see [Kueue's overview](/docs/overview).
+
+## Before you begin
+
+1. Make sure you are using Kueue v0.6.0 version or newer and Kuberay 1.1.0 or newer.
+
+2. Check [Administer cluster quotas](/docs/tasks/administer_cluster_quotas) for details on the initial Kueue setup.
+
+3. See [KubeRay Installation](https://ray-project.github.io/kuberay/deploy/installation/) for installation and configuration details of KubeRay.
+
+## RayCluster definition
+
+When running [RayClusters](https://docs.ray.io/en/latest/cluster/getting-started.html) on
+Kueue, take into consideration the following aspects:
+
+### a. Queue selection
+
+The target [local queue](/docs/concepts/local_queue) should be specified in the `metadata.labels` section of the RayCluster configuration.
+
+```yaml
+metadata:
+  name: raycluster-sample
+  namespace: default
+  labels:
+    kueue.x-k8s.io/queue-name: local-queue
+```
+
+### b. Configure the resource needs
+
+The resource needs of the workload can be configured in the `spec`.
+
+```yaml
+    headGroupSpec:
+       spec:
+        affinity: {}
+        containers:
+        - env: []
+          image: rayproject/ray:2.7.0
+          imagePullPolicy: IfNotPresent
+          name: ray-head
+          resources:
+            limits:
+              cpu: "1"
+              memory: 2G
+            requests:
+              cpu: "1"
+              memory: 2G
+          securityContext: {}
+          volumeMounts:
+          - mountPath: /tmp/ray
+            name: log-volume
+    workerGroupSpecs:
+      template:
+        spec:
+          affinity: {}
+          containers:
+          - env: []
+          image: rayproject/ray:2.7.0
+          imagePullPolicy: IfNotPresent
+          name: ray-worker
+          resources:
+            limits:
+            cpu: "1"
+            memory: 1G
+            requests:
+            cpu: "1"
+            memory: 1G
+```
+
+Note that a RayCluster will hold resource quotas while it exists. For optimal resource management, you should delete a RayCluster that is no longer in use.
+
+### c. Limitations
+- Limited Worker Groups: Because a Kueue workload can have a maximum of 8 PodSets, the maximum number of `spec.workerGroupSpecs` is 7
+- In-Tree Autoscaling Disabled: Kueue manages resource allocation for the RayCluster; therefore, the cluster's internal autoscaling mechanisms need to be disabled
+
+## Example RayCluster
+
+The RayCluster looks like the following:
+
+{{< include "examples/jobs/ray-cluster-sample.yaml" "yaml" >}}
+
+You can submit a Ray Job using the [CLI](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/quickstart.html) or log into the Ray Head and execute a job following this [example](https://ray-project.github.io/kuberay/deploy/helm-cluster/#end-to-end-example) with kind cluster. 
diff --git a/site/static/examples/jobs/ray-cluster-sample.yaml b/site/static/examples/jobs/ray-cluster-sample.yaml
@@ -0,0 +1,72 @@
+apiVersion: ray.io/v1
+kind: RayCluster
+metadata:
+  name: raycluster-sample
+  namespace: default
+  labels:
+    kueue.x-k8s.io/queue-name: local-queue
+spec:
+  headGroupSpec:
+    rayStartParams:
+      dashboard-host: 0.0.0.0
+    serviceType: ClusterIP
+    template:
+      metadata:
+        annotations: {}
+      spec:
+        affinity: {}
+        containers:
+        - env: []
+          image: rayproject/ray:2.7.0
+          imagePullPolicy: IfNotPresent
+          name: ray-head
+          resources:
+            limits:
+              cpu: "1"
+              memory: 2G
+            requests:
+              cpu: "1"
+              memory: 2G
+          securityContext: {}
+          volumeMounts:
+          - mountPath: /tmp/ray
+            name: log-volume
+        imagePullSecrets: []
+        nodeSelector: {}
+        tolerations: []
+        volumes:
+        - emptyDir: {}
+          name: log-volume
+  workerGroupSpecs:
+  - groupName: workergroup
+    maxReplicas: 10
+    minReplicas: 1
+    rayStartParams: {}
+    replicas: 4
+    template:
+      metadata:
+        annotations: {}
+      spec:
+        affinity: {}
+        containers:
+        - env: []
+          image: rayproject/ray:2.7.0
+          imagePullPolicy: IfNotPresent
+          name: ray-worker
+          resources:
+            limits:
+              cpu: "1"
+              memory: 1G
+            requests:
+              cpu: "1"
+              memory: 1G
+          securityContext: {}
+          volumeMounts:
+          - mountPath: /tmp/ray
+            name: log-volume
+        imagePullSecrets: []
+        nodeSelector: {}
+        tolerations: []
+        volumes:
+        - emptyDir: {}
+          name: log-volume