Skip to content

Commit

Permalink
added documentation for RayCluster integration (#1607)
Browse files Browse the repository at this point in the history
added ray cluster sample yaml file

added more limitations

nit

NIT

nit

updated

nit

added version

nit
  • Loading branch information
vicentefb authored Jan 31, 2024
1 parent 0b0961d commit 5ed62e4
Show file tree
Hide file tree
Showing 4 changed files with 166 additions and 2 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Read the [overview](https://kueue.sigs.k8s.io/docs/overview/) to learn more.
- **Resource management:** Support resource fair sharing and [preemption](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#preemption) with a variety of policies between different tenants.
- **Dynamic resource reclaim:** A mechanism to [release](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim) quota as the pods of a Job complete.
- **Resource flavor fungibility:** Quota [borrowing or preemption](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#flavorfungibility) in ClusterQueue and Cohort.
- **Integrations:** Built-in support for popular jobs, e.g. [BatchJob](https://kueue.sigs.k8s.io/docs/tasks/run_jobs/), [Kubeflow training jobs](https://kueue.sigs.k8s.io/docs/tasks/run_kubeflow_jobs/), [RayJob](https://kueue.sigs.k8s.io/docs/tasks/run_rayjobs/), [JobSet](https://kueue.sigs.k8s.io/docs/tasks/run_jobsets/), [plain Pod](https://kueue.sigs.k8s.io/docs/tasks/run_plain_pods/).
- **Integrations:** Built-in support for popular jobs, e.g. [BatchJob](https://kueue.sigs.k8s.io/docs/tasks/run_jobs/), [Kubeflow training jobs](https://kueue.sigs.k8s.io/docs/tasks/run_kubeflow_jobs/), [RayJob](https://kueue.sigs.k8s.io/docs/tasks/run_rayjobs/), [RayCluster](https://kueue.sigs.k8s.io/docs/tasks/run_rayclusters/), [JobSet](https://kueue.sigs.k8s.io/docs/tasks/run_jobsets/), [plain Pod](https://kueue.sigs.k8s.io/docs/tasks/run_plain_pods/).
- **System insight:** Build-in [prometheus metrics](https://kueue.sigs.k8s.io/docs/reference/metrics/) to help monitor the state of the system, as well as Conditions.
- **AdmissionChecks:** A mechanism for internal or external components to influence whether a workload can be [admitted](https://kueue.sigs.k8s.io/docs/concepts/admission_check/).
- **Advanced autoscaling support:** Integration with cluster-autoscaler's [provisioningRequest](https://kueue.sigs.k8s.io/docs/admission-check-controllers/provisioning/#job-using-a-provisioningrequest) via admissionChecks.
Expand Down
2 changes: 1 addition & 1 deletion site/content/en/docs/overview/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ A core design principle for Kueue is to avoid duplicating mature functionality i
- **Resource management:** Support resource fair sharing and [preemption](/docs/concepts/cluster_queue/#preemption) with a variety of policies between different tenants.
- **Dynamic resource reclaim:** A mechanism to [release](/docs/concepts/workload/#dynamic-reclaim) quota as the pods of a Job complete.
- **Resource flavor fungibility:** Quota [borrowing or preemption](/docs/concepts/cluster_queue/#flavorfungibility) in ClusterQueue and Cohort.
- **Integrations:** Built-in support for popular jobs, e.g. [BatchJob](/docs/tasks/run_jobs/), [Kubeflow training jobs](/docs/tasks/run_kubeflow_jobs/), [RayJob](/docs/tasks/run_rayjobs/), [JobSet](/docs/tasks/run_jobsets/), [plain Pod](/docs/tasks/run_plain_pods/).
- **Integrations:** Built-in support for popular jobs, e.g. [BatchJob](/docs/tasks/run_jobs/), [Kubeflow training jobs](/docs/tasks/run_kubeflow_jobs/), [RayJob](/docs/tasks/run_rayjobs/), [RayCluster](/docs/tasks/run_rayclusters/), [JobSet](/docs/tasks/run_jobsets/), [plain Pod](/docs/tasks/run_plain_pods/).
- **System insight:** Build-in [prometheus metrics](/docs/reference/metrics/) to help monitor the state of the system, as well as Conditions.
- **AdmissionChecks:** A mechanism for internal or external components to influence whether a workload can be [admitted](/docs/concepts/admission_check/).
- **Advanced autoscaling support:** Integration with cluster-autoscaler's [provisioningRequest](/docs/admission-check-controllers/provisioning/#job-using-a-provisioningrequest) via admissionChecks.
Expand Down
92 changes: 92 additions & 0 deletions site/content/en/docs/tasks/run_rayclusters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
---
title: "Run A RayCluster"
date: 2024-01-17
weight: 6
description: >
Run a RayCluster on Kueue.
---

This page shows how to leverage Kueue's scheduling and resource management capabilities when running [RayCluster](https://docs.ray.io/en/latest/cluster/getting-started.html).

This guide is for [batch users](/docs/tasks#batch-user) that have a basic understanding of Kueue. For more information, see [Kueue's overview](/docs/overview).

## Before you begin

1. Make sure you are using Kueue v0.6.0 version or newer and Kuberay 1.1.0 or newer.

2. Check [Administer cluster quotas](/docs/tasks/administer_cluster_quotas) for details on the initial Kueue setup.

3. See [KubeRay Installation](https://ray-project.github.io/kuberay/deploy/installation/) for installation and configuration details of KubeRay.

## RayCluster definition

When running [RayClusters](https://docs.ray.io/en/latest/cluster/getting-started.html) on
Kueue, take into consideration the following aspects:

### a. Queue selection

The target [local queue](/docs/concepts/local_queue) should be specified in the `metadata.labels` section of the RayCluster configuration.

```yaml
metadata:
name: raycluster-sample
namespace: default
labels:
kueue.x-k8s.io/queue-name: local-queue
```
### b. Configure the resource needs
The resource needs of the workload can be configured in the `spec`.

```yaml
headGroupSpec:
spec:
affinity: {}
containers:
- env: []
image: rayproject/ray:2.7.0
imagePullPolicy: IfNotPresent
name: ray-head
resources:
limits:
cpu: "1"
memory: 2G
requests:
cpu: "1"
memory: 2G
securityContext: {}
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
workerGroupSpecs:
template:
spec:
affinity: {}
containers:
- env: []
image: rayproject/ray:2.7.0
imagePullPolicy: IfNotPresent
name: ray-worker
resources:
limits:
cpu: "1"
memory: 1G
requests:
cpu: "1"
memory: 1G
```

Note that a RayCluster will hold resource quotas while it exists. For optimal resource management, you should delete a RayCluster that is no longer in use.

### c. Limitations
- Limited Worker Groups: Because a Kueue workload can have a maximum of 8 PodSets, the maximum number of `spec.workerGroupSpecs` is 7
- In-Tree Autoscaling Disabled: Kueue manages resource allocation for the RayCluster; therefore, the cluster's internal autoscaling mechanisms need to be disabled

## Example RayCluster

The RayCluster looks like the following:

{{< include "examples/jobs/ray-cluster-sample.yaml" "yaml" >}}

You can submit a Ray Job using the [CLI](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/quickstart.html) or log into the Ray Head and execute a job following this [example](https://ray-project.github.io/kuberay/deploy/helm-cluster/#end-to-end-example) with kind cluster.
72 changes: 72 additions & 0 deletions site/static/examples/jobs/ray-cluster-sample.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: raycluster-sample
namespace: default
labels:
kueue.x-k8s.io/queue-name: local-queue
spec:
headGroupSpec:
rayStartParams:
dashboard-host: 0.0.0.0
serviceType: ClusterIP
template:
metadata:
annotations: {}
spec:
affinity: {}
containers:
- env: []
image: rayproject/ray:2.7.0
imagePullPolicy: IfNotPresent
name: ray-head
resources:
limits:
cpu: "1"
memory: 2G
requests:
cpu: "1"
memory: 2G
securityContext: {}
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
imagePullSecrets: []
nodeSelector: {}
tolerations: []
volumes:
- emptyDir: {}
name: log-volume
workerGroupSpecs:
- groupName: workergroup
maxReplicas: 10
minReplicas: 1
rayStartParams: {}
replicas: 4
template:
metadata:
annotations: {}
spec:
affinity: {}
containers:
- env: []
image: rayproject/ray:2.7.0
imagePullPolicy: IfNotPresent
name: ray-worker
resources:
limits:
cpu: "1"
memory: 1G
requests:
cpu: "1"
memory: 1G
securityContext: {}
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
imagePullSecrets: []
nodeSelector: {}
tolerations: []
volumes:
- emptyDir: {}
name: log-volume

0 comments on commit 5ed62e4

Please sign in to comment.