Ray on GKE

This repository contains a Terraform template for running Ray on Google Kubernetes Engine. We've also included some example notebooks (applications/ray/example_notebooks), including one that serves a GPT-J-6B model with Ray AIR (see here for the original notebook).

This module assumes you already have a functional GKE cluster. If not, follow the instructions under infrastructure/README.md to install a Standard or Autopilot GKE cluster, then follow the instructions in this module to install Ray.

This module deploys the following, once per user:

User namespace
Kubernetes service accounts
Kuberay cluster
Prometheus monitoring
Logging container

Installation

Preinstall the following on your computer:

Terraform
Gcloud

NOTE: Terraform keeps state metadata in a local file called terraform.tfstate. Deleting the file may cause some resources to not be cleaned up correctly even if you delete the cluster. We suggest using terraform destory before reapplying/reinstalling.

If needed, git clone https://github.com/GoogleCloudPlatform/ai-on-gke
cd applications/ray
Find the name and location of the GKE cluster you want to use. Run gcloud container clusters list --project=<your GCP project> to see all the available clusters. Note: If you created the GKE cluster via the infrastructure repo, you can get the cluster info from platform.tfvars
Edit workloads.tfvars with your environment specific variables and configurations.
Run terraform init && terraform apply --var-file workloads.tfvars

Using Ray with Ray Jobs API

To connect to the remote GKE cluster with the Ray API, setup the Ray dashboard. Run the following command to port-forward:

kubectl port-forward -n <namespace> service/example-cluster-kuberay-head-svc 8265:8265

And then open the dashboard using the following URL:

http://localhost:8265

Set the RAY_ADDRESS environment variable: export RAY_ADDRESS="http://127.0.0.1:8265"
Create a working directory with some job file ray_job.py.
Submit the job: ray job submit --working-dir %your_working_directory% -- python ray_job.py
Note the job submission ID from the output, eg.: Job 'raysubmit_inB2ViQuE29aZRJ5' succeeded

See Ray docs for more info.

Using Ray with Jupyter

If you want to connect to the Ray cluster via a Jupyter notebook or to try the example notebooks in the repo, please first install JupyterHub via applications/jupyter/README.md.

Logging and Monitoring

This repository comes with out-of-the-box integrations with Google Cloud Logging and Managed Prometheus for monitoring. To see your Ray cluster logs:

Open Cloud Console and open Logging
If using Jupyter notebook for job submission, use the following query parameters:

resource.type="k8s_container"
resource.labels.cluster_name=%CLUSTER_NAME%
resource.labels.pod_name=%RAY_HEAD_POD_NAME%
resource.labels.container_name="fluentbit"

If using Ray Jobs API: (a) Note the job ID returned by the ray job submit API. Eg: Job submission: ray job submit --working-dir /Users/imreddy/ray_working_directory -- python script.py Job submission ID: Job 'raysubmit_kFWB6VkfyqK1CbEV' submitted successfully (b) Get the namespace name from user/variables.tf or kubectl get namespaces (c) Use the following query to search for the job logs:

resource.labels.namespace_name=%NAMESPACE_NAME%
jsonpayload.job_id=%RAY_JOB_ID%

To see monitoring metrics:

Open Cloud Console and open Metrics Explorer
In "Target", select "Prometheus Target" and then "Ray".
Select the metric you want to view, and then click "Apply".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Ray on GKE

Installation

Using Ray with Ray Jobs API

Using Ray with Jupyter

Logging and Monitoring

Files

README.md

Latest commit

History

README.md

File metadata and controls

Ray on GKE

Installation

Using Ray with Ray Jobs API

Using Ray with Jupyter

Logging and Monitoring