diff --git a/doc/source/cluster/cli.rst b/doc/source/cluster/cli.rst index 619f0028e26d..d388c6df93a1 100644 --- a/doc/source/cluster/cli.rst +++ b/doc/source/cluster/cli.rst @@ -56,4 +56,4 @@ This section contains commands for managing Ray clusters. .. click:: ray.scripts.scripts:monitor :prog: ray monitor - :show-nested: \ No newline at end of file + :show-nested: diff --git a/doc/source/cluster/configure-manage-dashboard.md b/doc/source/cluster/configure-manage-dashboard.md index 8195b2003bcc..ce8eb9c9e941 100644 --- a/doc/source/cluster/configure-manage-dashboard.md +++ b/doc/source/cluster/configure-manage-dashboard.md @@ -24,10 +24,10 @@ Pass the keyword argument ``dashboard_port`` in your call to ``ray.init()``. :::{tab-item} VM Cluster Launcher Include the ``--dashboard-port`` argument in the `head_start_ray_commands` section of the [Cluster Launcher's YAML file](https://github.com/ray-project/ray/blob/0574620d454952556fa1befc7694353d68c72049/python/ray/autoscaler/aws/example-full.yaml#L172). ```yaml -head_start_ray_commands: - - ray stop +head_start_ray_commands: + - ray stop # Replace ${YOUR_PORT} with the port number you need. - - ulimit -n 65536; ray start --head --dashboard-port=${YOUR_PORT} --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml + - ulimit -n 65536; ray start --head --dashboard-port=${YOUR_PORT} --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml ``` ::: @@ -66,7 +66,7 @@ The dashboard is now visible at ``http://localhost:8265``. :::{tab-item} KubeRay The KubeRay operator makes Dashboard available via a Service targeting the Ray head pod, named ``-head-svc``. Access -Dashboard from within the Kubernetes cluster at ``http://-head-svc:8265``. +Dashboard from within the Kubernetes cluster at ``http://-head-svc:8265``. There are two ways to expose Dashboard outside the Cluster: @@ -77,7 +77,7 @@ Follow the [instructions](kuberay-ingress) to set up ingress to access Ray Dashb You can also view the dashboard from outside the Kubernetes cluster by using port-forwarding: ```shell -$ kubectl port-forward service/${RAYCLUSTER_NAME}-head-svc 8265:8265 +$ kubectl port-forward service/${RAYCLUSTER_NAME}-head-svc 8265:8265 # Visit ${YOUR_IP}:8265 for the Dashboard (e.g. 127.0.0.1:8265 or ${YOUR_VM_IP}:8265) ``` @@ -199,7 +199,7 @@ Grafana is a tool that supports advanced visualizations of Prometheus metrics an To view embedded time-series visualizations in Ray Dashboard, the following must be set up: 1. The head node of the cluster is able to access Prometheus and Grafana. -2. The browser of the dashboard user is able to access Grafana. +2. The browser of the dashboard user is able to access Grafana. Configure these settings using the `RAY_GRAFANA_HOST`, `RAY_PROMETHEUS_HOST`, `RAY_PROMETHEUS_NAME`, and `RAY_GRAFANA_IFRAME_HOST` environment variables when you start the Ray Clusters. diff --git a/doc/source/cluster/faq.rst b/doc/source/cluster/faq.rst index f453f078abb4..71a68f40b6a7 100644 --- a/doc/source/cluster/faq.rst +++ b/doc/source/cluster/faq.rst @@ -76,7 +76,7 @@ connections. The solution for this problem is to start the worker nodes more slo Problems getting a SLURM cluster to work ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -A class of issues exist with starting Ray on SLURM clusters. While the exact causes aren't understood, (as of June 2023), some Ray +A class of issues exist with starting Ray on SLURM clusters. While the exact causes aren't understood, (as of June 2023), some Ray improvements mitigate some of the resource contention. Some of the issues reported are as follows: @@ -100,4 +100,4 @@ any of the options `--entrypoint-num-cpus`, `--entrypoint-num-gpus`, `--entrypoint-resources` or `--entrypoint-memory` to `ray job submit`, or the corresponding arguments if using the Python SDK. If these are specified, the job entrypoint will be scheduled on a node that has the requested resources -available. \ No newline at end of file +available. diff --git a/doc/source/cluster/images/ray-job-diagram.svg b/doc/source/cluster/images/ray-job-diagram.svg index 5ee79c9abb15..375725802176 100644 --- a/doc/source/cluster/images/ray-job-diagram.svg +++ b/doc/source/cluster/images/ray-job-diagram.svg @@ -1 +1 @@ - \ No newline at end of file + diff --git a/doc/source/cluster/kubernetes/benchmarks/memory-scalability-benchmark.md b/doc/source/cluster/kubernetes/benchmarks/memory-scalability-benchmark.md index f9f8abfcb2ea..534f5871f073 100644 --- a/doc/source/cluster/kubernetes/benchmarks/memory-scalability-benchmark.md +++ b/doc/source/cluster/kubernetes/benchmarks/memory-scalability-benchmark.md @@ -86,4 +86,4 @@ In addition, the number of custom resources in the Kubernetes cluster does not h * Note that the x-axis "Number of Pods" is the number of Pods that are created rather than running. If the Kubernetes cluster does not have enough computing resources, the GKE Autopilot adds a new Kubernetes node into the cluster. This process may take a few minutes, so some Pods may be pending in the process. -This lag may can explain why the memory usage is somewhat throttled. \ No newline at end of file +This lag may can explain why the memory usage is somewhat throttled. diff --git a/doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml b/doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml index e5b96c581b18..5a2d01839e9b 100644 --- a/doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml +++ b/doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml @@ -1,4 +1,4 @@ -# This is a RayCluster configuration for PyTorch image training benchmark with a 1Gi training set. +# This is a RayCluster configuration for PyTorch image training benchmark with a 1Gi training set. apiVersion: ray.io/v1alpha1 kind: RayCluster metadata: diff --git a/doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml b/doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml index 35cf942b20fc..c3a919b1da7a 100644 --- a/doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml +++ b/doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml @@ -1,4 +1,4 @@ -# If your Kubernetes has a default deny network policy for pods, you need to manually apply this network policy +# If your Kubernetes has a default deny network policy for pods, you need to manually apply this network policy # to allow the bidirectional communication among the head and worker nodes in the Ray cluster. # Ray Head Ingress @@ -92,4 +92,4 @@ spec: - to: - podSelector: matchLabels: - app: ray-cluster-head \ No newline at end of file + app: ray-cluster-head diff --git a/doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml b/doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml index 5507a9065033..2d56d1050769 100644 --- a/doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml +++ b/doc/source/cluster/kubernetes/configs/static-ray-cluster.tls.yaml @@ -14,7 +14,7 @@ apiVersion: v1 kind: ConfigMap metadata: name: tls -data: +data: gencert_head.sh: | #!/bin/sh ## Create tls.key @@ -380,4 +380,4 @@ spec: # Kubernetes testing environments such as Kind and minikube. requests: cpu: "500m" - memory: "1G" \ No newline at end of file + memory: "1G" diff --git a/doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml b/doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml index 84b46c36f5f5..aabbb5d694e9 100644 --- a/doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml +++ b/doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml @@ -1,5 +1,5 @@ -# This section is only required for deploying Redis on Kubernetes for the purpose of enabling Ray -# to write GCS metadata to an external Redis for fault tolerance. If you have already deployed Redis +# This section is only required for deploying Redis on Kubernetes for the purpose of enabling Ray +# to write GCS metadata to an external Redis for fault tolerance. If you have already deployed Redis # on Kubernetes, this section can be removed. kind: ConfigMap apiVersion: v1 diff --git a/doc/source/cluster/kubernetes/examples/distributed-checkpointing-with-gcsfuse.md b/doc/source/cluster/kubernetes/examples/distributed-checkpointing-with-gcsfuse.md index c8c512095fee..9f28cf609278 100644 --- a/doc/source/cluster/kubernetes/examples/distributed-checkpointing-with-gcsfuse.md +++ b/doc/source/cluster/kubernetes/examples/distributed-checkpointing-with-gcsfuse.md @@ -179,7 +179,7 @@ else: ) ``` -You can verify automatic checkpoint recovery by redeploying the same RayJob: +You can verify automatic checkpoint recovery by redeploying the same RayJob: ``` kubectl create -f ray-job.pytorch-image-classifier.yaml ``` @@ -205,92 +205,92 @@ Result( If the previous job failed at an earlier checkpoint, the job should resume from the last saved checkpoint and run until `max_epochs=10`. For example, if the last run failed at epoch 7, the training automatically resumes using `checkpoint_000006` and run 3 more iterations until epoch 10: ``` -(TorchTrainer pid=611, ip=10.108.2.65) Restored on 10.108.2.65 from checkpoint: Checkpoint(filesystem=local, path=/mnt/cluster_storage/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/checkpoint_000006) -(RayTrainWorker pid=671, ip=10.108.2.65) Setting up process group for: env:// [rank=0, world_size=4] -(TorchTrainer pid=611, ip=10.108.2.65) Started distributed worker processes: -(TorchTrainer pid=611, ip=10.108.2.65) - (ip=10.108.2.65, pid=671) world_rank=0, local_rank=0, node_rank=0 -(TorchTrainer pid=611, ip=10.108.2.65) - (ip=10.108.1.83, pid=589) world_rank=1, local_rank=0, node_rank=1 -(TorchTrainer pid=611, ip=10.108.2.65) - (ip=10.108.0.72, pid=590) world_rank=2, local_rank=0, node_rank=2 -(TorchTrainer pid=611, ip=10.108.2.65) - (ip=10.108.3.76, pid=590) world_rank=3, local_rank=0, node_rank=3 -(RayTrainWorker pid=589, ip=10.108.1.83) Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth -(RayTrainWorker pid=671, ip=10.108.2.65) +(TorchTrainer pid=611, ip=10.108.2.65) Restored on 10.108.2.65 from checkpoint: Checkpoint(filesystem=local, path=/mnt/cluster_storage/finetune-resnet/TorchTrainer_96923_00000_0_2024-04-29_17-21-29/checkpoint_000006) +(RayTrainWorker pid=671, ip=10.108.2.65) Setting up process group for: env:// [rank=0, world_size=4] +(TorchTrainer pid=611, ip=10.108.2.65) Started distributed worker processes: +(TorchTrainer pid=611, ip=10.108.2.65) - (ip=10.108.2.65, pid=671) world_rank=0, local_rank=0, node_rank=0 +(TorchTrainer pid=611, ip=10.108.2.65) - (ip=10.108.1.83, pid=589) world_rank=1, local_rank=0, node_rank=1 +(TorchTrainer pid=611, ip=10.108.2.65) - (ip=10.108.0.72, pid=590) world_rank=2, local_rank=0, node_rank=2 +(TorchTrainer pid=611, ip=10.108.2.65) - (ip=10.108.3.76, pid=590) world_rank=3, local_rank=0, node_rank=3 +(RayTrainWorker pid=589, ip=10.108.1.83) Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth +(RayTrainWorker pid=671, ip=10.108.2.65) 0%| | 0.00/97.8M [00:00 Label: tench, Tinca tinca diff --git a/doc/source/cluster/kubernetes/examples/rayjob-kueue-gang-scheduling.md b/doc/source/cluster/kubernetes/examples/rayjob-kueue-gang-scheduling.md index 686e90d34530..18c919bfc661 100644 --- a/doc/source/cluster/kubernetes/examples/rayjob-kueue-gang-scheduling.md +++ b/doc/source/cluster/kubernetes/examples/rayjob-kueue-gang-scheduling.md @@ -27,8 +27,8 @@ to manage resources that RayJob and RayCluster consume. See the Gang scheduling is essential when working with expensive, limited hardware accelerators like GPUs. It prevents RayJobs from partially provisioning Ray clusters and claiming but not using the GPUs. -Kueue suspends a RayJob until the Kubernetes cluster and the underlying cloud provider can guarantee -the capacity that the RayJob needs to execute. This approach greatly improves GPU utilization and +Kueue suspends a RayJob until the Kubernetes cluster and the underlying cloud provider can guarantee +the capacity that the RayJob needs to execute. This approach greatly improves GPU utilization and cost, especially when GPU availability is limited. ## Create a Kubernetes cluster on GKE @@ -191,7 +191,7 @@ Following is the expected behavior when you deploy a GPU-requiring RayJob to a c * Once the required GPU nodes are available, the ProvisioningRequest is satisfied. * Kueue admits the RayJob, allowing Kubernetes to schedule the Ray nodes on the newly provisioned nodes, and the RayJob execution begins. -If GPUs are unavailable, Kueue keeps suspending the RayJob. In addition, the node autoscaler avoids +If GPUs are unavailable, Kueue keeps suspending the RayJob. In addition, the node autoscaler avoids provisioning new nodes until it can fully satisfy the RayJob's GPU requirements. Upon creating a RayJob, notice that the RayJob status is immediately `suspended` despite the ClusterQueue having GPU quotas available. diff --git a/doc/source/cluster/kubernetes/examples/rayjob-kueue-priority-scheduling.md b/doc/source/cluster/kubernetes/examples/rayjob-kueue-priority-scheduling.md index cc24929f2ed4..d15213cbd5a6 100644 --- a/doc/source/cluster/kubernetes/examples/rayjob-kueue-priority-scheduling.md +++ b/doc/source/cluster/kubernetes/examples/rayjob-kueue-priority-scheduling.md @@ -103,7 +103,7 @@ The YAML manifest configures: * **LocalQueue** * The LocalQueue `user-queue` is a namespaced object in the `default` namespace which belongs to a ClusterQueue. A typical practice is to assign a namespace to a tenant, team or user, of an organization. Users submit jobs to a LocalQueue, instead of to a ClusterQueue directly. * **WorkloadPriorityClass** - * The WorkloadPriorityClass `prod-priority` has a higher value than the WorkloadPriorityClass `dev-priority`. This means that RayJob custom resources with the `prod-priority` priority class take precedence over RayJob custom resources with the `dev-priority` priority class. + * The WorkloadPriorityClass `prod-priority` has a higher value than the WorkloadPriorityClass `dev-priority`. This means that RayJob custom resources with the `prod-priority` priority class take precedence over RayJob custom resources with the `dev-priority` priority class. Create the Kueue resources: ```bash diff --git a/doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md b/doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md index 6f469d2e9995..4ba8c6c91a09 100644 --- a/doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md +++ b/doc/source/cluster/kubernetes/examples/stable-diffusion-rayservice.md @@ -2,7 +2,7 @@ # Serve a StableDiffusion text-to-image model on Kubernetes -> **Note:** The Python files for the Ray Serve application and its client are in the [ray-project/serve_config_examples](https://github.com/ray-project/serve_config_examples) repo +> **Note:** The Python files for the Ray Serve application and its client are in the [ray-project/serve_config_examples](https://github.com/ray-project/serve_config_examples) repo and [the Ray documentation](https://docs.ray.io/en/latest/serve/tutorials/stable-diffusion.html). ## Step 1: Create a Kubernetes cluster with GPUs @@ -56,7 +56,7 @@ Note that the RayService's Kubernetes service will be created after the Serve ap ## Step 5: Send a request to the text-to-image model ```sh -# Step 5.1: Download `stable_diffusion_req.py` +# Step 5.1: Download `stable_diffusion_req.py` curl -LO https://raw.githubusercontent.com/ray-project/serve_config_examples/master/stable_diffusion/stable_diffusion_req.py # Step 5.2: Set your `prompt` in `stable_diffusion_req.py`. @@ -66,4 +66,4 @@ python stable_diffusion_req.py # Check output.png ``` -* You can refer to the document ["Serving a Stable Diffusion Model"](https://docs.ray.io/en/latest/serve/tutorials/stable-diffusion.html) for an example output image. \ No newline at end of file +* You can refer to the document ["Serving a Stable Diffusion Model"](https://docs.ray.io/en/latest/serve/tutorials/stable-diffusion.html) for an example output image. diff --git a/doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md b/doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md index d0e3c5dbea9a..de465c13d7c7 100644 --- a/doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md +++ b/doc/source/cluster/kubernetes/examples/text-summarizer-rayservice.md @@ -54,7 +54,7 @@ Note that the RayService's Kubernetes service will be created after the Serve ap ## Step 5: Send a request to the text_summarizer model ```sh -# Step 5.1: Download `text_summarizer_req.py` +# Step 5.1: Download `text_summarizer_req.py` curl -LO https://raw.githubusercontent.com/ray-project/serve_config_examples/master/text_summarizer/text_summarizer_req.py # Step 5.2: Send a request to the Summarizer model. @@ -71,4 +71,4 @@ kubectl delete -f ray-service.text-summarizer.yaml ## Step 7: Uninstall your kuberay operator -Follow [this document](https://github.com/ray-project/kuberay/tree/master/helm-chart/kuberay-operator) to uninstall the latest stable KubeRay operator via Helm repository. \ No newline at end of file +Follow [this document](https://github.com/ray-project/kuberay/tree/master/helm-chart/kuberay-operator) to uninstall the latest stable KubeRay operator via Helm repository. diff --git a/doc/source/cluster/kubernetes/getting-started/raycluster-quick-start.md b/doc/source/cluster/kubernetes/getting-started/raycluster-quick-start.md index 42e7b5e15939..72326b6bbf09 100644 --- a/doc/source/cluster/kubernetes/getting-started/raycluster-quick-start.md +++ b/doc/source/cluster/kubernetes/getting-started/raycluster-quick-start.md @@ -85,7 +85,7 @@ Note that in production scenarios, you will want to use larger Ray pods. In fact ## Step 4: Run an application on a RayCluster -Now, let's interact with the RayCluster we've deployed. +Now, let's interact with the RayCluster we've deployed. ### Method 1: Execute a Ray job in the head Pod @@ -102,7 +102,7 @@ kubectl exec -it $HEAD_POD -- python -c "import ray; ray.init(); print(ray.clust # 2023-04-07 10:57:46,472 INFO worker.py:1243 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS # 2023-04-07 10:57:46,472 INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379... -# 2023-04-07 10:57:46,482 INFO worker.py:1550 -- Connected to Ray cluster. View the dashboard at http://10.244.0.6:8265 +# 2023-04-07 10:57:46,482 INFO worker.py:1550 -- Connected to Ray cluster. View the dashboard at http://10.244.0.6:8265 # {'object_store_memory': 802572287.0, 'memory': 3000000000.0, 'node:10.244.0.6': 1.0, 'CPU': 2.0, 'node:10.244.0.7': 1.0} ``` diff --git a/doc/source/cluster/kubernetes/getting-started/rayjob-quick-start.md b/doc/source/cluster/kubernetes/getting-started/rayjob-quick-start.md index 14b9c73aefe2..770ff1e70727 100644 --- a/doc/source/cluster/kubernetes/getting-started/rayjob-quick-start.md +++ b/doc/source/cluster/kubernetes/getting-started/rayjob-quick-start.md @@ -164,8 +164,8 @@ kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.1.1/ra ``` The `ray-job.shutdown.yaml` defines a RayJob custom resource with `shutdownAfterJobFinishes: true` and `ttlSecondsAfterFinished: 10`. -Hence, the KubeRay operator deletes the RayCluster 10 seconds after the Ray job finishes. Note that the submitter job is not deleted -because it contains the ray job logs and does not use any cluster resources once completed. In addition, the submitter job will always +Hence, the KubeRay operator deletes the RayCluster 10 seconds after the Ray job finishes. Note that the submitter job is not deleted +because it contains the ray job logs and does not use any cluster resources once completed. In addition, the submitter job will always be cleaned up when the RayJob is eventually deleted due to its owner reference back to the RayJob. ## Step 8: Check the RayJob status diff --git a/doc/source/cluster/kubernetes/getting-started/rayservice-quick-start.md b/doc/source/cluster/kubernetes/getting-started/rayservice-quick-start.md index 0851d1378400..4bd634d5ecbd 100644 --- a/doc/source/cluster/kubernetes/getting-started/rayservice-quick-start.md +++ b/doc/source/cluster/kubernetes/getting-started/rayservice-quick-start.md @@ -43,7 +43,7 @@ Please note that the YAML file in this example uses `serveConfigV2` to specify a kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.1.1/ray-operator/config/samples/ray-service.sample.yaml ``` -## Step 4: Verify the Kubernetes cluster status +## Step 4: Verify the Kubernetes cluster status ```sh # Step 4.1: List all RayService custom resources in the `default` namespace. diff --git a/doc/source/cluster/kubernetes/images/AutoscalerOperator.svg b/doc/source/cluster/kubernetes/images/AutoscalerOperator.svg index 1af7fed88348..a2a5729e96f6 100644 --- a/doc/source/cluster/kubernetes/images/AutoscalerOperator.svg +++ b/doc/source/cluster/kubernetes/images/AutoscalerOperator.svg @@ -1 +1 @@ - \ No newline at end of file + diff --git a/doc/source/cluster/kubernetes/images/kubeflow-architecture.svg b/doc/source/cluster/kubernetes/images/kubeflow-architecture.svg index af920873a932..784fbe35d0f5 100644 --- a/doc/source/cluster/kubernetes/images/kubeflow-architecture.svg +++ b/doc/source/cluster/kubernetes/images/kubeflow-architecture.svg @@ -1 +1 @@ - \ No newline at end of file + diff --git a/doc/source/cluster/kubernetes/images/rbac-clusterrole.svg b/doc/source/cluster/kubernetes/images/rbac-clusterrole.svg index 742676b7e3b6..154a9d1d9ca5 100644 --- a/doc/source/cluster/kubernetes/images/rbac-clusterrole.svg +++ b/doc/source/cluster/kubernetes/images/rbac-clusterrole.svg @@ -1 +1 @@ - \ No newline at end of file + diff --git a/doc/source/cluster/kubernetes/images/rbac-role-multi-namespaces.svg b/doc/source/cluster/kubernetes/images/rbac-role-multi-namespaces.svg index ab62a76865a3..88b1b299cdee 100644 --- a/doc/source/cluster/kubernetes/images/rbac-role-multi-namespaces.svg +++ b/doc/source/cluster/kubernetes/images/rbac-role-multi-namespaces.svg @@ -1 +1 @@ - \ No newline at end of file + diff --git a/doc/source/cluster/kubernetes/images/rbac-role-one-namespace.svg b/doc/source/cluster/kubernetes/images/rbac-role-one-namespace.svg index dd751180ad66..f5cf33e374f7 100644 --- a/doc/source/cluster/kubernetes/images/rbac-role-one-namespace.svg +++ b/doc/source/cluster/kubernetes/images/rbac-role-one-namespace.svg @@ -1 +1 @@ - \ No newline at end of file + diff --git a/doc/source/cluster/kubernetes/k8s-ecosystem/ingress.md b/doc/source/cluster/kubernetes/k8s-ecosystem/ingress.md index 66ae8fdfeb49..bb006ccc67b0 100644 --- a/doc/source/cluster/kubernetes/k8s-ecosystem/ingress.md +++ b/doc/source/cluster/kubernetes/k8s-ecosystem/ingress.md @@ -188,7 +188,7 @@ helm repo update helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1 # Step 4: Install RayCluster and create an ingress separately. -# More information about change of setting was documented in https://github.com/ray-project/kuberay/pull/699 +# More information about change of setting was documented in https://github.com/ray-project/kuberay/pull/699 # and `ray-operator/config/samples/ray-cluster.separate-ingress.yaml` curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.1.1/ray-operator/config/samples/ray-cluster.separate-ingress.yaml kubectl apply -f ray-operator/config/samples/ray-cluster.separate-ingress.yaml diff --git a/doc/source/cluster/kubernetes/k8s-ecosystem/kubeflow.md b/doc/source/cluster/kubernetes/k8s-ecosystem/kubeflow.md index 91320b342677..2553f8c6924a 100644 --- a/doc/source/cluster/kubernetes/k8s-ecosystem/kubeflow.md +++ b/doc/source/cluster/kubernetes/k8s-ecosystem/kubeflow.md @@ -71,9 +71,9 @@ kubectl get pod -l ray.io/cluster=raycluster-kuberay * As mentioned in Step 4, Ray is very sensitive to the Python versions and Ray versions between the server (RayCluster) and client (JupyterLab) sides. Open a terminal in the JupyterLab: ```sh # Check Python version. The version's MAJOR and MINOR should match with RayCluster (i.e. Python 3.8) - python --version + python --version # Python 3.8.10 - + # Install Ray 2.2.0 pip install -U ray[default]==2.2.0 ``` diff --git a/doc/source/cluster/kubernetes/k8s-ecosystem/kueue.md b/doc/source/cluster/kubernetes/k8s-ecosystem/kueue.md index d963789745e5..9d821fffb00f 100644 --- a/doc/source/cluster/kubernetes/k8s-ecosystem/kueue.md +++ b/doc/source/cluster/kubernetes/k8s-ecosystem/kueue.md @@ -11,7 +11,7 @@ Refer to [Priority Scheduling with RayJob and Kueue](kuberay-kueue-priority-sche * To admit a job to start, which triggers Kubernetes to create pods. * To preempt a job, which triggers Kubernetes to delete active pods. -Kueue has native support for some KubeRay APIs. Specifically, you can use Kueue to manage resources consumed by RayJob and RayCluster. +Kueue has native support for some KubeRay APIs. Specifically, you can use Kueue to manage resources consumed by RayJob and RayCluster. See the [Kueue documentation](https://kueue.sigs.k8s.io/docs/overview/) to learn more. ## Step 0: Create a Kind cluster @@ -94,7 +94,7 @@ The YAML manifest configures: * **LocalQueue** * The LocalQueue `user-queue` is a namespaced object in the `default` namespace which belongs to a ClusterQueue. A typical practice is to assign a namespace to a tenant, team, or user of an organization. Users submit jobs to a LocalQueue, instead of to a ClusterQueue directly. * **WorkloadPriorityClass** - * The WorkloadPriorityClass `prod-priority` has a higher value than the WorkloadPriorityClass `dev-priority`. RayJob custom resources with the `prod-priority` priority class take precedence over RayJob custom resources with the `dev-priority` priority class. + * The WorkloadPriorityClass `prod-priority` has a higher value than the WorkloadPriorityClass `dev-priority`. RayJob custom resources with the `prod-priority` priority class take precedence over RayJob custom resources with the `dev-priority` priority class. Create the Kueue resources: ```bash diff --git a/doc/source/cluster/kubernetes/k8s-ecosystem/pyspy.md b/doc/source/cluster/kubernetes/k8s-ecosystem/pyspy.md index f8809103b398..2b41c7dee42d 100644 --- a/doc/source/cluster/kubernetes/k8s-ecosystem/pyspy.md +++ b/doc/source/cluster/kubernetes/k8s-ecosystem/pyspy.md @@ -52,7 +52,7 @@ kubectl port-forward svc/raycluster-py-spy-head-svc 8265:8265 kubectl exec -it ${YOUR_HEAD_POD} -- bash # (Head Pod) Run a sample job in the Pod -# `long_running_task` includes a `while True` loop to ensure the task remains actively running indefinitely. +# `long_running_task` includes a `while True` loop to ensure the task remains actively running indefinitely. # This allows you ample time to view the Stack Trace and CPU Flame Graph via Ray Dashboard. python3 samples/long_running_task.py ``` diff --git a/doc/source/cluster/kubernetes/k8s-ecosystem/volcano.md b/doc/source/cluster/kubernetes/k8s-ecosystem/volcano.md index e163521301e9..37368fbdcd55 100644 --- a/doc/source/cluster/kubernetes/k8s-ecosystem/volcano.md +++ b/doc/source/cluster/kubernetes/k8s-ecosystem/volcano.md @@ -34,7 +34,7 @@ batchScheduler: * Pass the `--set batchScheduler.enabled=true` flag when running on the command line: ```shell -# Install the Helm chart with --enable-batch-scheduler flag set to true +# Install the Helm chart with --enable-batch-scheduler flag set to true helm install kuberay-operator kuberay/kuberay-operator --version 1.0.0 --set batchScheduler.enabled=true ``` @@ -79,7 +79,7 @@ For guidance, see [examples](https://github.com/volcano-sh/volcano/tree/master/e ## Example -Before going through the example, remove any running Ray Clusters to ensure a successful run through of the example below. +Before going through the example, remove any running Ray Clusters to ensure a successful run through of the example below. ```shell kubectl delete raycluster --all ``` diff --git a/doc/source/cluster/kubernetes/troubleshooting/troubleshooting.md b/doc/source/cluster/kubernetes/troubleshooting/troubleshooting.md index 3b92e26e3553..06b8d3542a6a 100644 --- a/doc/source/cluster/kubernetes/troubleshooting/troubleshooting.md +++ b/doc/source/cluster/kubernetes/troubleshooting/troubleshooting.md @@ -7,7 +7,7 @@ If you don't find an answer to your question here, please don't hesitate to conn # Contents -- [Use ARM-based docker images for Apple M1 or M2 MacBooks](#docker-image-for-apple-macbooks) +- [Use ARM-based docker images for Apple M1 or M2 MacBooks](#docker-image-for-apple-macbooks) - [Upgrade KubeRay](#upgrade-kuberay) - [Worker init container](#worker-init-container) - [Cluster domain](#cluster-domain) @@ -17,7 +17,7 @@ If you don't find an answer to your question here, please don't hesitate to conn (docker-image-for-apple-macbooks)= ## Use ARM-based docker images for Apple M1 or M2 MacBooks -Ray builds different images for different platforms. Until Ray moves to building multi-architecture images, [tracked by this Github issue](https://github.com/ray-project/ray/issues/39364), use platform-specific docker images in the head and worker group specs of the [RayCluster config](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#image). +Ray builds different images for different platforms. Until Ray moves to building multi-architecture images, [tracked by this Github issue](https://github.com/ray-project/ray/issues/39364), use platform-specific docker images in the head and worker group specs of the [RayCluster config](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#image). Use an image with the tag `aarch64`, for example, `image: rayproject/ray:2.20.0-aarch64`), if you are running KubeRay on a MacBook M1 or M2. diff --git a/doc/source/cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md b/doc/source/cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md index 79a492590972..1f691875db4c 100644 --- a/doc/source/cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md +++ b/doc/source/cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md @@ -7,7 +7,7 @@ The configuration outlined here can be applied to most KubeRay examples found in ## Step 1: Create a Kubernetes cluster on Amazon EKS -Follow the first two steps in [this AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#) to: +Follow the first two steps in [this AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#) to: (1) create your Amazon EKS cluster and (2) configure your computer to communicate with your cluster. ## Step 2: Create node groups for the Amazon EKS cluster @@ -17,7 +17,7 @@ The following section provides more detailed information. ### Create a CPU node group -Typically, avoid running GPU workloads on the Ray head. Create a CPU node group for all Pods except Ray GPU +Typically, avoid running GPU workloads on the Ray head. Create a CPU node group for all Pods except Ray GPU workers, such as the KubeRay operator, Ray head, and CoreDNS Pods. Here's a common configuration that works for most KubeRay examples in the docs: @@ -46,11 +46,11 @@ Create a GPU node group for Ray GPU workers. ```sh # Install the DaemonSet kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml - + # Verify that your nodes have allocatable GPUs. If the GPU node fails to detect GPUs, # please verify whether the DaemonSet schedules the Pod on the GPU node. kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" - + # Example output: # NAME GPU # ip-....us-west-2.compute.internal 4 diff --git a/doc/source/cluster/kubernetes/user-guides/config.md b/doc/source/cluster/kubernetes/user-guides/config.md index eacb07535901..5ca8df8f1c73 100644 --- a/doc/source/cluster/kubernetes/user-guides/config.md +++ b/doc/source/cluster/kubernetes/user-guides/config.md @@ -171,7 +171,7 @@ the same Ray version as the CR's `spec.rayVersion`. If you are using a nightly or development Ray image, you can specify Ray's latest release version under `spec.rayVersion`. -For Apple M1 or M2 MacBooks, see [Use ARM-based docker images for Apple M1 or M2 MacBooks](docker-image-for-apple-macbooks) to specify the +For Apple M1 or M2 MacBooks, see [Use ARM-based docker images for Apple M1 or M2 MacBooks](docker-image-for-apple-macbooks) to specify the correct image. You must install code dependencies for a given Ray task or actor on each Ray node that diff --git a/doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.md b/doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.md index c457b1de01be..0dacd38f4c7c 100644 --- a/doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.md +++ b/doc/source/cluster/kubernetes/user-guides/configuring-autoscaling.md @@ -16,7 +16,7 @@ Autoscaling can reduce workload costs, but adds node launch overheads and can be We recommend starting with non-autoscaling clusters if you're new to Ray. ``` -```{admonition} Ray Autoscaling V2 alpha with KubeRay (@ray 2.10.0) +```{admonition} Ray Autoscaling V2 alpha with KubeRay (@ray 2.10.0) With Ray 2.10, Ray Autoscaler V2 alpha is available with KubeRay. It has improvements on observability and stability. Please see the [section](kuberay-autoscaler-v2) for more details. ``` @@ -263,7 +263,7 @@ The [ray-cluster.autoscaler.yaml](https://github.com/ray-project/kuberay/blob/v1 ### 1. Enabling autoscaling * **`enableInTreeAutoscaling`**: By setting `enableInTreeAutoscaling: true`, the KubeRay operator automatically configures an autoscaling sidecar container for the Ray head Pod. -* **`minReplicas` / `maxReplicas` / `replicas`**: +* **`minReplicas` / `maxReplicas` / `replicas`**: Set the `minReplicas` and `maxReplicas` fields to define the range for `replicas` in an autoscaling `workerGroup`. Typically, you would initialize both `replicas` and `minReplicas` with the same value during the deployment of an autoscaling cluster. Subsequently, the Ray Autoscaler adjusts the `replicas` field as it adds or removes Pods from the cluster. @@ -305,7 +305,7 @@ The default values are indicated below: * **`image`**: This field overrides the Autoscaler container image. -The container uses the same **image** as the Ray container by default. +The container uses the same **image** as the Ray container by default. * **`imagePullPolicy`**: This field overrides the Autoscaler container's image pull policy. @@ -377,7 +377,7 @@ See [(Advanced) Understanding the Ray Autoscaler in the Context of Kubernetes](r The release of Ray 2.10.0 introduces the alpha version of Ray Autoscaler V2 integrated with KubeRay, bringing enhancements in terms of observability and stability: -1. **Observability**: The Autoscaler V2 provides instance level tracing on each Ray worker's lifecycle, making it easier to debug and understand the Autoscaler behavior. It also reports +1. **Observability**: The Autoscaler V2 provides instance level tracing on each Ray worker's lifecycle, making it easier to debug and understand the Autoscaler behavior. It also reports the idle information (why it's idle, why it's not idle) of each node: ```bash @@ -463,6 +463,6 @@ spec: - replicas: 1 template: spec: - restartPolicy: Never + restartPolicy: Never ... ``` diff --git a/doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md b/doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md index 7e94ba7177fb..c9c6c24c8e49 100644 --- a/doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md +++ b/doc/source/cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md @@ -59,11 +59,11 @@ If you encounter any issues with the GPU drivers installed by GKE, you can manua # Install NVIDIA GPU device driver kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml -# Verify that your nodes have allocatable GPUs +# Verify that your nodes have allocatable GPUs kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" -# Verify that your nodes have allocatable GPUs +# Verify that your nodes have allocatable GPUs # NAME GPU # ...... # ...... 1 -``` \ No newline at end of file +``` diff --git a/doc/source/cluster/kubernetes/user-guides/gke-gcs-bucket.md b/doc/source/cluster/kubernetes/user-guides/gke-gcs-bucket.md index 82ca976a0399..c55997d8c1d8 100644 --- a/doc/source/cluster/kubernetes/user-guides/gke-gcs-bucket.md +++ b/doc/source/cluster/kubernetes/user-guides/gke-gcs-bucket.md @@ -59,7 +59,7 @@ kubectl annotate serviceaccount my-ksa \ ## Create a Google Cloud Storage Bucket and allow the Google Cloud Service Account to access it -Please follow the documentation at to create a bucket using the Google Cloud Console or the `gsutil` command line tool. +Please follow the documentation at to create a bucket using the Google Cloud Console or the `gsutil` command line tool. This example gives the principal `my-iam-sa@my-project-id.iam.gserviceaccount.com` "Storage Admin" permissions on the bucket. Enable the permissions in the Google Cloud Console ("Permissions" tab under "Buckets" > "Bucket Details") or with the following command: @@ -100,7 +100,7 @@ Use `kubectl get pod` to get the name of the Ray head pod. Then run the followi kubectl exec -it raycluster-mini-head-xxxx -- /bin/bash ``` -In the shell, run `pip install google-cloud-storage` to install the Google Cloud Storage Python client library. +In the shell, run `pip install google-cloud-storage` to install the Google Cloud Storage Python client library. (For production use cases, you will need to make sure `google-cloud-storage` is installed on every node of your cluster, or use `ray.init(runtime_env={"pip": ["google-cloud-storage"]})` to have the package installed as needed at runtime -- see for more details.) @@ -121,13 +121,13 @@ def check_gcs_read_write(): client = storage.Client() bucket = client.get_bucket(GCP_GCS_BUCKET) blob = bucket.blob(GCP_GCS_FILE) - + # Write to the bucket blob.upload_from_string("Hello, Ray on GKE!") - + # Read from the bucket content = blob.download_as_text() - + return content result = ray.get(check_gcs_read_write.remote()) diff --git a/doc/source/cluster/kubernetes/user-guides/helm-chart-rbac.md b/doc/source/cluster/kubernetes/user-guides/helm-chart-rbac.md index d219ffa34f72..4e5c3daa5b89 100644 --- a/doc/source/cluster/kubernetes/user-guides/helm-chart-rbac.md +++ b/doc/source/cluster/kubernetes/user-guides/helm-chart-rbac.md @@ -34,7 +34,7 @@ helm install kuberay-operator . * Set to `true` in most cases. Set to `false` in the uncommon case of using a Kubernetes cluster managed by GitOps tools such as ArgoCD. For additional details, refer to [ray-project/kuberay#1162](https://github.com/ray-project/kuberay/pull/1162). Default: true. The [values.yaml](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml) file contains detailed descriptions of the parameters. -See these pull requests for more context on parameters: +See these pull requests for more context on parameters: * [ray-project/kuberay#1106](https://github.com/ray-project/kuberay/pull/1106) * [ray-project/kuberay#1162](https://github.com/ray-project/kuberay/pull/1162) * [ray-project/kuberay#1190](https://github.com/ray-project/kuberay/pull/1190) diff --git a/doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md b/doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md index 036ebd176d34..2051e819960c 100644 --- a/doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md +++ b/doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md @@ -48,7 +48,7 @@ curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operat kubectl apply -f ray-cluster.external-redis.yaml ``` -### Step 4: Verify the Kubernetes cluster status +### Step 4: Verify the Kubernetes cluster status ```sh # Step 4.1: List all Pods in the `default` namespace. diff --git a/doc/source/cluster/kubernetes/user-guides/logging.md b/doc/source/cluster/kubernetes/user-guides/logging.md index 25fc5ffa6207..26a82e97a518 100644 --- a/doc/source/cluster/kubernetes/user-guides/logging.md +++ b/doc/source/cluster/kubernetes/user-guides/logging.md @@ -179,7 +179,7 @@ Run the following command to create a ConfigMap named `cluster-info` with the cl ```shell ClusterName=fluent-bit-demo -RegionName=us-west-2 +RegionName=us-west-2 FluentBitHttpPort='2020' FluentBitReadFromHead='Off' [[ ${FluentBitReadFromHead} = 'On' ]] && FluentBitReadFromTail='Off'|| FluentBitReadFromTail='On' @@ -293,6 +293,3 @@ Follow the steps below to set the environment variable ``RAY_LOG_TO_STDERR=1`` o :::: - - - diff --git a/doc/source/cluster/kubernetes/user-guides/observability.md b/doc/source/cluster/kubernetes/user-guides/observability.md index a789c9fe7586..a614fa3c26f6 100644 --- a/doc/source/cluster/kubernetes/user-guides/observability.md +++ b/doc/source/cluster/kubernetes/user-guides/observability.md @@ -79,4 +79,4 @@ kubectl exec -it $HEAD_POD -- ray summary actors # 9 ... ALIVE: 1 # 10 ... ALIVE: 1 # 11 ... ALIVE: 1 -``` \ No newline at end of file +``` diff --git a/doc/source/cluster/kubernetes/user-guides/pod-command.md b/doc/source/cluster/kubernetes/user-guides/pod-command.md index 002401accd4f..942a8a190d42 100644 --- a/doc/source/cluster/kubernetes/user-guides/pod-command.md +++ b/doc/source/cluster/kubernetes/user-guides/pod-command.md @@ -52,7 +52,7 @@ Note that this environment variable doesn't include the `ulimit` command. ```sh # Example of the environment variable `KUBERAY_GEN_RAY_START_CMD` in the head Pod. ray start --head --dashboard-host=0.0.0.0 --num-cpus=1 --block --metrics-export-port=8080 --memory=2147483648 - ``` + ``` The head Pod's `command`/`args` looks like the following: diff --git a/doc/source/cluster/kubernetes/user-guides/pod-security.md b/doc/source/cluster/kubernetes/user-guides/pod-security.md index 14302ef0928e..8b1c75e66a09 100644 --- a/doc/source/cluster/kubernetes/user-guides/pod-security.md +++ b/doc/source/cluster/kubernetes/user-guides/pod-security.md @@ -3,10 +3,10 @@ # Pod Security Kubernetes defines three different Pod Security Standards, including `privileged`, `baseline`, and `restricted`, to broadly -cover the security spectrum. The `privileged` standard allows users to do known privilege escalations, and thus it is not +cover the security spectrum. The `privileged` standard allows users to do known privilege escalations, and thus it is not safe enough for security-critical applications. -This document describes how to configure RayCluster YAML file to apply `restricted` Pod security standard. The following +This document describes how to configure RayCluster YAML file to apply `restricted` Pod security standard. The following references can help you understand this document better: * [Kubernetes - Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/#restricted) @@ -30,8 +30,8 @@ The `kind-config.yaml` enables audit logging with the audit policy defined in `a defines an auditing policy to listen to the Pod events in the namespace `pod-security`. With this policy, we can check whether our Pods violate the policies in `restricted` standard or not. -The feature [Pod Security Admission](https://kubernetes.io/docs/concepts/security/pod-security-admission/) is firstly -introduced in Kubernetes v1.22 (alpha) and becomes stable in Kubernetes v1.25. In addition, KubeRay currently supports +The feature [Pod Security Admission](https://kubernetes.io/docs/concepts/security/pod-security-admission/) is firstly +introduced in Kubernetes v1.22 (alpha) and becomes stable in Kubernetes v1.25. In addition, KubeRay currently supports Kubernetes from v1.19 to v1.24. (At the time of writing, we have not tested KubeRay with Kubernetes v1.25). Hence, I use **Kubernetes v1.24** in this step. # Step 2: Check the audit logs @@ -54,9 +54,9 @@ kubectl label --overwrite ns pod-security \ pod-security.kubernetes.io/enforce-version=latest ``` -With the `pod-security.kubernetes.io` labels, the built-in Kubernetes Pod security admission controller will apply the +With the `pod-security.kubernetes.io` labels, the built-in Kubernetes Pod security admission controller will apply the `restricted` Pod security standard to all Pods in the namespace `pod-security`. The label -`pod-security.kubernetes.io/enforce=restricted` means that the Pod will be rejected if it violate the policies defined in +`pod-security.kubernetes.io/enforce=restricted` means that the Pod will be rejected if it violate the policies defined in `restricted` security standard. See [Pod Security Admission](https://kubernetes.io/docs/concepts/security/pod-security-admission/) for more details about the labels. # Step 4: Install the KubeRay operator @@ -124,7 +124,7 @@ python3 samples/xgboost_example.py # Check the job status in the dashboard on your browser. # http://127.0.0.1:8265/#/job => The job status should be "SUCCEEDED". -# (Head Pod) Make sure Python dependencies can be installed under `restricted` security standard +# (Head Pod) Make sure Python dependencies can be installed under `restricted` security standard pip3 install jsonpatch echo $? # Check the exit code of `pip3 install jsonpatch`. It should be 0. diff --git a/doc/source/cluster/kubernetes/user-guides/static-ray-cluster-without-kuberay.md b/doc/source/cluster/kubernetes/user-guides/static-ray-cluster-without-kuberay.md index d5530ecb6a19..7c6d0a562220 100644 --- a/doc/source/cluster/kubernetes/user-guides/static-ray-cluster-without-kuberay.md +++ b/doc/source/cluster/kubernetes/user-guides/static-ray-cluster-without-kuberay.md @@ -58,10 +58,10 @@ namespace, specify the namespace in your kubectl commands: ! kubectl apply -f https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml -# Note that the Ray cluster has fault tolerance enabled by default using the external Redis. +# Note that the Ray cluster has fault tolerance enabled by default using the external Redis. # Please set the Redis IP address in the config. -# The password is currently set as '' for the external Redis. +# The password is currently set as '' for the external Redis. # Please download the config file and substitute the real password for the empty string if the external Redis has a password. ``` diff --git a/doc/source/cluster/metrics.md b/doc/source/cluster/metrics.md index ec687eee606e..6ef27ca5af76 100644 --- a/doc/source/cluster/metrics.md +++ b/doc/source/cluster/metrics.md @@ -233,7 +233,7 @@ Here are some instructions for each of the paths: (grafana)= ### Simplest: Setting up Grafana with Ray-provided configurations -Grafana is a tool that supports advanced visualizations of Prometheus metrics and allows you to create custom dashboards with your favorite metrics. +Grafana is a tool that supports advanced visualizations of Prometheus metrics and allows you to create custom dashboards with your favorite metrics. ::::{tab-set} @@ -241,7 +241,7 @@ Grafana is a tool that supports advanced visualizations of Prometheus metrics an ```{admonition} Note :class: note -The instructions below describe one way of starting a Grafana server on a macOS machine. Refer to the [Grafana documentation](https://grafana.com/docs/grafana/latest/setup-grafana/start-restart-grafana/#start-the-grafana-server) for how to start Grafana servers in different systems. +The instructions below describe one way of starting a Grafana server on a macOS machine. Refer to the [Grafana documentation](https://grafana.com/docs/grafana/latest/setup-grafana/start-restart-grafana/#start-the-grafana-server) for how to start Grafana servers in different systems. For KubeRay users, follow [these instructions](kuberay-prometheus-grafana) to set up Grafana. ``` diff --git a/doc/source/cluster/running-applications/job-submission/cli.rst b/doc/source/cluster/running-applications/job-submission/cli.rst index 2c0193ef55b6..0b74e9608d75 100644 --- a/doc/source/cluster/running-applications/job-submission/cli.rst +++ b/doc/source/cluster/running-applications/job-submission/cli.rst @@ -3,7 +3,7 @@ Ray Jobs CLI API Reference ========================== -This section contains commands for :ref:`Ray Job Submission `. +This section contains commands for :ref:`Ray Job Submission `. .. _ray-job-submit-doc: @@ -12,7 +12,7 @@ This section contains commands for :ref:`Ray Job Submission `. .. warning:: - When using the CLI, do not wrap the entrypoint command in quotes. For example, use + When using the CLI, do not wrap the entrypoint command in quotes. For example, use ``ray job submit --working-dir="." -- python script.py`` instead of ``ray job submit --working-dir="." -- "python script.py"``. Otherwise you may encounter the error ``/bin/sh: 1: python script.py: not found``. @@ -50,4 +50,4 @@ This section contains commands for :ref:`Ray Job Submission `. .. click:: ray.dashboard.modules.job.cli:delete :prog: ray job delete - :show-nested: \ No newline at end of file + :show-nested: diff --git a/doc/source/cluster/running-applications/job-submission/openapi.yml b/doc/source/cluster/running-applications/job-submission/openapi.yml index f5e109d8f913..e0c7a453f4e3 100644 --- a/doc/source/cluster/running-applications/job-submission/openapi.yml +++ b/doc/source/cluster/running-applications/job-submission/openapi.yml @@ -275,8 +275,8 @@ paths: get: summary: Tail Job Logs description: | - WebSocket endpoint for tailing the logs of a job - (Not documented in OpenAPI, see + WebSocket endpoint for tailing the logs of a job + (Not documented in OpenAPI, see https://docs.ray.io/en/latest/_modules/ray/dashboard/modules/job/sdk.html#JobSubmissionClient.tail_job_logs for example usage). parameters: @@ -306,7 +306,7 @@ paths: schema: description: The error message. type: string - + components: @@ -471,4 +471,4 @@ components: - STOPPED - SUCCEEDED - FAILED - type: string \ No newline at end of file + type: string diff --git a/doc/source/cluster/running-applications/job-submission/quickstart.rst b/doc/source/cluster/running-applications/job-submission/quickstart.rst index 39e6b662b926..c40e344e3b38 100644 --- a/doc/source/cluster/running-applications/job-submission/quickstart.rst +++ b/doc/source/cluster/running-applications/job-submission/quickstart.rst @@ -53,7 +53,7 @@ Start with a sample script that you can run locally. The following script uses R ray.init() print(ray.get(hello_world.remote())) -Create an empty working directory with the preceding Python script inside a file named ``script.py``. +Create an empty working directory with the preceding Python script inside a file named ``script.py``. .. code-block:: bash @@ -79,7 +79,7 @@ Alternatively, you can also pass the ``--address=http://127.0.0.1:8265`` flag ex Additionally, if you wish to pass headers per HTTP request to the Cluster, use the `RAY_JOB_HEADERS` environment variable. This environment variable must be in JSON form. .. code-block:: bash - + $ export RAY_JOB_HEADERS='{"KEY": "VALUE"}' To submit the job, use ``ray job submit``. @@ -88,7 +88,7 @@ For local clusters this argument isn't strictly necessary, but for remote cluste .. code-block:: bash - $ ray job submit --working-dir your_working_directory -- python script.py + $ ray job submit --working-dir your_working_directory -- python script.py # Job submission server address: http://127.0.0.1:8265 @@ -119,9 +119,9 @@ This command runs the entrypoint script on the Ray Cluster's head node and waits .. note:: - By default the entrypoint script runs on the head node. To override this behavior, specify one of the - `--entrypoint-num-cpus`, `--entrypoint-num-gpus`, `--entrypoint-resources`, or - `--entrypoint-memory` arguments to the `ray job submit` command. + By default the entrypoint script runs on the head node. To override this behavior, specify one of the + `--entrypoint-num-cpus`, `--entrypoint-num-gpus`, `--entrypoint-resources`, or + `--entrypoint-memory` arguments to the `ray job submit` command. See :ref:`Specifying CPU and GPU resources ` for more details. Interacting with Long-running Jobs @@ -150,7 +150,7 @@ Now submit the job: .. code-block:: shell - $ ray job submit --no-wait --working-dir your_working_directory -- python script.py + $ ray job submit --no-wait --working-dir your_working_directory -- python script.py # Job submission server address: http://127.0.0.1:8265 # ------------------------------------------------------- @@ -216,7 +216,7 @@ Run the following command on your local machine, where ``cluster.yaml`` is the c ray dashboard cluster.yaml -Once this command is running, verify that you can view the Ray Dashboard in your local browser at ``http://127.0.0.1:8265``. +Once this command is running, verify that you can view the Ray Dashboard in your local browser at ``http://127.0.0.1:8265``. Also, verify that you set the environment variable ``RAY_ADDRESS`` to ``"http://127.0.0.1:8265"``. After this setup, you can use the Jobs CLI on the local machine as in the preceding example to interact with the remote Ray cluster. Using the CLI on Kubernetes @@ -255,13 +255,13 @@ Submit this job using the default environment. This environment is the environme .. code-block:: bash - $ ray job submit -- python script.py + $ ray job submit -- python script.py # Job submission server address: http://127.0.0.1:8265 - # + # # ------------------------------------------------------- # Job 'raysubmit_seQk3L4nYWcUBwXD' submitted successfully # ------------------------------------------------------- - # + # # Next steps # Query the logs of the job: # ray job logs raysubmit_seQk3L4nYWcUBwXD @@ -269,10 +269,10 @@ Submit this job using the default environment. This environment is the environme # ray job status raysubmit_seQk3L4nYWcUBwXD # Request the job to be stopped: # ray job stop raysubmit_seQk3L4nYWcUBwXD - # + # # Tailing logs until the job exits (disable with --no-wait): # requests version: 2.28.1 - # + # # ------------------------------------------ # Job 'raysubmit_seQk3L4nYWcUBwXD' succeeded # ------------------------------------------ @@ -281,7 +281,7 @@ Now submit the job with a runtime environment that pins the version of the ``req .. code-block:: bash - $ ray job submit --runtime-env-json='{"pip": ["requests==2.26.0"]}' -- python script.py + $ ray job submit --runtime-env-json='{"pip": ["requests==2.26.0"]}' -- python script.py # Job submission server address: http://127.0.0.1:8265 # ------------------------------------------------------- @@ -308,6 +308,6 @@ Now submit the job with a runtime environment that pins the version of the ``req If both the Driver and Job specify a runtime environment, Ray tries to merge them and raises an exception if they conflict. See :ref:`runtime environments ` for more details. -- See :ref:`Ray Jobs CLI ` for a full API reference of the CLI. +- See :ref:`Ray Jobs CLI ` for a full API reference of the CLI. - See :ref:`Ray Jobs SDK ` for a full API reference of the SDK. - For more information, see :ref:`Programmatic job submission ` and :ref:`Job submission using REST `. diff --git a/doc/source/cluster/running-applications/job-submission/ray-client.rst b/doc/source/cluster/running-applications/job-submission/ray-client.rst index 78e36f66b59f..94f3d5f8b478 100644 --- a/doc/source/cluster/running-applications/job-submission/ray-client.rst +++ b/doc/source/cluster/running-applications/job-submission/ray-client.rst @@ -87,7 +87,7 @@ Step 2: Configure Access Ensure that your local machine can access the Ray Client port on the head node. -The easiest way to accomplish this is to use SSH port forwarding or `K8s port-forwarding `_. +The easiest way to accomplish this is to use SSH port forwarding or `K8s port-forwarding `_. This allows you to connect to the Ray Client server on the head node via ``localhost``. First, open up an SSH connection with your Ray cluster and forward the @@ -300,6 +300,6 @@ Ray workers are started in the ``/tmp/ray/session_latest/runtime_resources/_ray_ Troubleshooting --------------- -Error: Attempted to reconnect a session that has already been cleaned up +Error: Attempted to reconnect a session that has already been cleaned up ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This error happens when Ray Client reconnects to a head node that does not recognize the client. This can happen if the head node restarts unexpectedly and loses state. On Kubernetes, this can happen if the head pod restarts after being evicted or crashing. diff --git a/doc/source/cluster/running-applications/job-submission/sdk.rst b/doc/source/cluster/running-applications/job-submission/sdk.rst index 87ce51323a98..99a12c95a400 100644 --- a/doc/source/cluster/running-applications/job-submission/sdk.rst +++ b/doc/source/cluster/running-applications/job-submission/sdk.rst @@ -48,7 +48,7 @@ Let's start with a sample script that can be run locally. The following script u ray.init() print(ray.get(hello_world.remote())) -SDK calls are made via a ``JobSubmissionClient`` object. To initialize the client, provide the Ray cluster head node address and the port used by the Ray Dashboard (``8265`` by default). For this example, we'll use a local Ray cluster, but the same example will work for remote Ray cluster addresses; see +SDK calls are made via a ``JobSubmissionClient`` object. To initialize the client, provide the Ray cluster head node address and the port used by the Ray Dashboard (``8265`` by default). For this example, we'll use a local Ray cluster, but the same example will work for remote Ray cluster addresses; see :ref:`Using a Remote Cluster ` for details on setting up port forwarding. .. code-block:: python @@ -154,8 +154,8 @@ The output should look something like the following: To get information about all jobs, call ``client.list_jobs()``. This returns a ``Dict[str, JobInfo]`` object mapping Job IDs to their information. -Job information (status and associated metadata) is stored on the cluster indefinitely. -To delete this information, you may call ``client.delete_job(job_id)`` for any job that is already in a terminal state. +Job information (status and associated metadata) is stored on the cluster indefinitely. +To delete this information, you may call ``client.delete_job(job_id)`` for any job that is already in a terminal state. See the :ref:`SDK API Reference ` for more details. Dependency Management @@ -210,19 +210,19 @@ If any of these arguments are specified, the entrypoint script will be scheduled The same arguments are also available as options ``--entrypoint-num-cpus``, ``--entrypoint-num-gpus``, ``--entrypoint-memory``, and ``--entrypoint-resources`` to ``ray job submit`` in the Jobs CLI; see :ref:`Ray Job Submission CLI Reference `. -If ``num_gpus`` is not specified, GPUs will still be available to the entrypoint script, but Ray will not provide isolation in terms of visible devices. +If ``num_gpus`` is not specified, GPUs will still be available to the entrypoint script, but Ray will not provide isolation in terms of visible devices. To be precise, the environment variable ``CUDA_VISIBLE_DEVICES`` will not be set in the entrypoint script; it will only be set inside tasks and actors that have `num_gpus` specified in their ``@ray.remote()`` decorator. .. note:: Resources specified by ``entrypoint_num_cpus``, ``entrypoint_num_gpus``, ``entrypoint-memory``, and ``entrypoint_resources`` are separate from any resources specified - for tasks and actors within the job. - + for tasks and actors within the job. + For example, if you specify ``entrypoint_num_gpus=1``, then the entrypoint script will be scheduled on a node with at least 1 GPU, but if your script also contains a Ray task defined with ``@ray.remote(num_gpus=1)``, then the task will be scheduled to use a different GPU (on the same node if the node has at least 2 GPUs, or on a different node otherwise). .. note:: - + As with the ``num_cpus``, ``num_gpus``, ``resources``, and ``_memory`` arguments to ``@ray.remote()`` described in :ref:`resource-requirements`, these arguments only refer to logical resources used for scheduling purposes. The actual CPU and GPU utilization is not controlled or limited by Ray. @@ -234,7 +234,7 @@ To be precise, the environment variable ``CUDA_VISIBLE_DEVICES`` will not be set Client Configuration -------------------------------- -Additional client connection options, such as custom HTTP headers and cookies, can be passed to the ``JobSubmissionClient`` class. +Additional client connection options, such as custom HTTP headers and cookies, can be passed to the ``JobSubmissionClient`` class. A full list of options can be found in the :ref:`API Reference `. TLS Verification diff --git a/doc/source/cluster/vms/references/ray-cluster-configuration.rst b/doc/source/cluster/vms/references/ray-cluster-configuration.rst index c7e831943aee..f10dfde25b45 100644 --- a/doc/source/cluster/vms/references/ray-cluster-configuration.rst +++ b/doc/source/cluster/vms/references/ray-cluster-configuration.rst @@ -1158,7 +1158,7 @@ If enabled, Ray will use private IP addresses for communication between nodes. This should be omitted if your network interfaces use public IP addresses. If enabled, Ray CLI commands (e.g. ``ray up``) will have to be run from a machine -that is part of the same VPC as the cluster. +that is part of the same VPC as the cluster. This option does not affect the existence of public IP addresses for the nodes, it only affects which IP addresses are used by Ray. The existence of public IP addresses is @@ -1184,10 +1184,10 @@ controlled by your cloud provider's configuration. .. tab-item:: Azure If enabled, Ray will provision and use a public IP address for communication with the head node, - regardless of the value of ``use_internal_ips``. This option can be used in combination with + regardless of the value of ``use_internal_ips``. This option can be used in combination with ``use_internal_ips`` to avoid provisioning excess public IPs for worker nodes (i.e., communicate among nodes using private IPs, but provision a public IP for head node communication only). If - ``use_internal_ips`` is ``False``, then this option has no effect. + ``use_internal_ips`` is ``False``, then this option has no effect. * **Required:** No * **Importance:** Low diff --git a/doc/source/cluster/vms/user-guides/community/slurm.rst b/doc/source/cluster/vms/user-guides/community/slurm.rst index f0c1eb33e012..62602835548f 100644 --- a/doc/source/cluster/vms/user-guides/community/slurm.rst +++ b/doc/source/cluster/vms/user-guides/community/slurm.rst @@ -283,4 +283,3 @@ Here are some community-contributed templates for using SLURM with Ray: .. _`YASPI`: https://github.com/albanie/yaspi .. _`Convenient python interface`: https://github.com/pengzhenghao/use-ray-with-slurm - diff --git a/doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md b/doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md index 2e5975ad8185..8b2bc32047f1 100644 --- a/doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md +++ b/doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md @@ -16,13 +16,13 @@ Another way to prepare the vSphere environment is with VMware Cloud Foundation ( ## Prepare the frozen VM -The vSphere Ray cluster launcher requires the vSphere environment to have a VM in a frozen state for deploying a Ray cluster. This VM has all the dependencies installed and is later used to rapidly create head and worker nodes by VMware's [instant clone](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-853B1E2B-76CE-4240-A654-3806912820EB.html) technology. The details of the Ray cluster provisioning process using frozen VM can be found in this [Ray on vSphere architecture document](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/vsphere/ARCHITECTURE.md). +The vSphere Ray cluster launcher requires the vSphere environment to have a VM in a frozen state for deploying a Ray cluster. This VM has all the dependencies installed and is later used to rapidly create head and worker nodes by VMware's [instant clone](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-853B1E2B-76CE-4240-A654-3806912820EB.html) technology. The details of the Ray cluster provisioning process using frozen VM can be found in this [Ray on vSphere architecture document](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/vsphere/ARCHITECTURE.md). You can follow the vm-packer-for-ray's [document](https://github.com/vmware-ai-labs/vm-packer-for-ray/blob/main/README.md) to use Packer to create and set up the frozen VM, or a set of frozen VMs in which each one will be hosted on a distinct ESXi host in the vSphere cluster. By default, Ray clusters' head and worker node VMs will be placed in the same resource pool as the frozen VM. When building and deploying the frozen VM, there are a couple of things to note: * The VM's network adapter should be connected to the port group or NSX segment configured in the above section. And the `Connect At Power On` check box should be selected. * After the frozen VM is built, a private key file (`ray-bootstrap-key.pem`) and a public key file (`ray_bootstrap_public_key.key`) will be generated under the HOME directory of the current user. If you want to deploy Ray clusters from another machine, these files should be copied to that machine's HOME directory to be picked up by the vSphere cluster launcher. -* An OVF will be generated in the content library. If you want to deploy Ray clusters in other vSphere deployments, you can use the content library's [publish and subscribe](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-254B2CE8-20A8-43F0-90E8-3F6776C2C896.html) feature to sync the frozen VM's template to another vSphere environment. Then you can leverage Ray Cluster Launcher to help you create a single frozen VM or multiple frozen VMs firstly, then help you create the Ray cluster, check the [document](https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html?highlight=yaml#vsphere-config-frozen-vm) for how to compose the yaml file to help to deploy the frozen VM(s) from an OVF template. +* An OVF will be generated in the content library. If you want to deploy Ray clusters in other vSphere deployments, you can use the content library's [publish and subscribe](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-254B2CE8-20A8-43F0-90E8-3F6776C2C896.html) feature to sync the frozen VM's template to another vSphere environment. Then you can leverage Ray Cluster Launcher to help you create a single frozen VM or multiple frozen VMs firstly, then help you create the Ray cluster, check the [document](https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html?highlight=yaml#vsphere-config-frozen-vm) for how to compose the yaml file to help to deploy the frozen VM(s) from an OVF template. ## Install Ray cluster launcher