Skip to content

Deploy and scale distributed python applications on Amazon EKS using Ray

License

Notifications You must be signed in to change notification settings

aws-samples/aws-do-ray

aws-do-ray

AWS do Ray (aws-do-ray)
Create and manage your Ray clusters on Amazon EKS using the do-framework

Fig. 1 - Ray on EKS cluster sample



Overview

The aws-do-ray project aims to simplify the deployment and scaling of distributed Python application using Ray on Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon SageMaker Hyperpod. While following the principles of the do-framework and using the Depend on Docker template, it uses Docker to containerize all tools necessary to deploy and manage Ray clusters, jobs, and services. The aws-do-ray container shell is equipped with intuitive action scripts and comes pre-configured with convenient shortcuts which save extensive typing and increase productivity. This project provides a streamlined solution for administrators and developers, enabling them to focus on the task at hand, rather than infrastructure management. In summary, aws-do-ray is a simple, flexible, and universal DevOps solution for Ray workloads on AWS.

Prerequisites

The only prerequisites needed to run this project are:

Usage

A typical workflow for the aws-do-ray project is described below.

Fig.2 - Typical workflow for the aws-do-ray project



To use the project, you can clone and configure it, then run the ./build.sh, ./run.sh, and ./exec.sh scripts to open the aws-do-ray shell. Execute the ./setup-dependencies.sh script, then from the raycluster directory execute ./raycluster-config.sh and ./raycluster-create.sh. Once the cluster is created, examples can be executed from the jobs folder, submit one of the jobs using `./job-submit.sh

# Build and run container
git clone https://github.com/aws-samples/aws-do-ray
cd aws-do-ray
./config.sh
./build.sh
./run.sh
./exec.sh

# Create Ray cluster
./setup-dependencies.sh
cd raycluster
./raycluster-config.sh
./raycluster-create.sh
./raycluster-status.sh

# Run job
cd jobs
./job-submit.sh quickstart
./job-list.sh

Configure

To configure your aws client, execute aws configure within the container shell, or outside the container, if you have aws CLI v2.x installed.

If you are not connected to an EKS cluster yet, follow the instructions here to generate a ~/.kube/config file. The project will automatically use the current kubernetes conext to retrieve the name of the EKS cluster it should connect to. Your current EKS cluser context can be displayed by using the following command: kubectl config current-context.

All configuration settings of the aws-do-ray project are centralized in its .env) file. To review or change any of the settings, simply execute ./config.sh). The project automatically sets all variables, but you can manually override any of them by setting your preferred value in the .env file.

  • AWS_REGION should match the AWS Region where the cluster is deployed.
  • The AWS_EKS_CLUSTER setting should match the name of your existing EKS Cluster.
  • AWS_EKS_HYPERPOD_CLUSTER setting should match the name of your existing EKS Hyperpod Cluster
  • CLUSTER_TYPE setting must either be "eks" or "hyperpod" depending on what type of cluster you are using.

To configure credentials of your aws client, run aws configure. Credentials you configure either on the host or in the containr will be mounted into the aws-do-eks container according to the VOL_MAP setting in .env. If you set the following environment variables AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN, they will be carried over into the aws-do-ray container when the ./run.sh script is executed.

Build

This project follows the Depend on Docker template to build a container including all needed tools and utilities for creation and management of Ray. Please execute the ./build.sh script to create the aws-do-ray container image and tag it using the registry and version tag specified in the project configuration. If desired, the image name or registry address can be modified in the project configuration file .env. A pre-built aws-do-ray container is available on the AWS public registry

Run

The ./run.sh script starts the project container. If you would like to run the pre-built aws-do-ray container, you can configure the project with REGISTRY=public.ecr.aws/hpc-cloud/aws-do-ray, prior to executing ./run.sh or just execute the following command:

docker run --rm -it -v ${HOME}/.aws:/root/aws -v ${HOME}/.kube:/root/.kube --workdir /ray public.ecr.aws/hpc-cloud/aws-do-ray bash

Status

To check the status of the container, execute ./status.sh. If the container is in the Exited state, it can be started with ./start.sh.

Exec

After the container is started, use the ./exec.sh script to open a bash shell in the container. All necessary tools to allow creation, management, and operation of Ray are available in this shell.

Deploy KubeRay operator

Once you have opened the aws-do-ray shell you will be dropped in the /ray directory where you will find the ./setup-dependencies.sh script. This deployment creates a kuberay namespace and a kuberay-operator pod in the kuberay namespace. It will then dynamically provision an FSx for Lustre volume for a shared file system in your Ray cluster. Upon successful deployment, you will have the kuberay operator pod running in the kuberay namespace, and a bound persistent volume claim (PVC) in your current namespace. To check the state of the kuberay-operator pod, use command: kubectl -n kuberay get pods, and to check the state of your PVC, please run kubectl get pvc.

The KubeRay operator

The KubeRay Operator gets deployed on the EKS cluster through the ./setup-dependencies.sh script. KubeRay creates the following Custom Resource Definitions (CRDs): RayCluster, RayService, and RayJobs.

  1. RayCluster: primary resource for managing Ray instances on Kubernetes. It represents a cluster of Ray nodes, including a head node and multiple worker nodes. The RayCluster CRD determines how the Ray nodes are set up, how they communicate, and how resources are allocated among them. The nodes in a Ray cluster manifest as pods in the EKS cluster.

  2. RayJob: represents a single executable job that runs on a RayCluster. It is a higher-level abstraction used to submit tasks or batches of tasks that should be executed by the RayCluster.

  3. RayService: Kubernetes resource that enables long-running Ray applications. It allows for the deployment of Ray applications that need to be exposed for external communication, typically through a service endpoint.

Fig.2 - Types of Ray custom resources in Kubernetes



The KubeRay operator relies on the Kubernetes API and works on EKS as well as HyperPod clusters with EKS support. A diagram showing deployment of Ray on SageMaker HyperPod is shown below.

Fig.3 - KubeRay operator deployment on SageMaker HyperPod EKS cluster



Distributed training jobs

Additional information about your distributed training jobs.

  1. From Ray Documentation, specifying a shared storage location (such as cloud storage or NFS) is optional for single-node clusters, but it is required for multi-node clusters. Using a local path will raise an error during checkpointing for multi-node clusters. This is why the ./setup-dependencies.sh script creates an FSx for Lustre volume. For other deployments, like S3 Mount point, please refer to the Deploy Scripts section of this document. Once you have a shared storage path, use storage_path in the RunConfig of your Python training scripts to save checkpoints, logs, and model artifacts. By default, it points to your new FSx for Lustre mount.

  2. Within the python code provided, you can also set num_workers to an int (the number of ray workers you are using) and use_gpu to a boolean (True or False, default is set to True). Default is num_workers=2 and use_gpu=True.

Create a RayCluster

Within the /ray directory, you will find the /raycluster directory. This directory contains the following scripts:

  • ./raycluster-create.sh : this script creates the ray cluster specified in the raycluster-template.yaml file.
  • ./raycluster-delete.sh : this script deletes the ray cluster specified in the raycluster-template.yaml file.
  • ./raycluster-pods.sh : this script allows you to see your currently running pods(or nodes) of your raycluster.
  • ./raycluster-status.sh : this script retrieves the status of your current raycluster.
  • You can run [re] to expose ray cluster to port :8265, and [rh] to hide it. This is also done automatically when needed by other scripts.
  • ./raycluster-config.sh : run this to edit the raycluster-template.yaml, or simply open the raycluster-template.yaml in your favorite editor.
  • raycluster-template.yaml : a default ray cluster configuration with every option you can have in a ray cluster. "Batteries included but swappable".
  • raycluster-template-autoscaler.yaml : the same ray cluster configuration but with the ray autoscaler enabled.
  • ./jobs/job-submit.sh <job> : this script allows you to submit a Python Script for a job. You can put your code within the /jobs section of the repo with a directory named after the script you want to execute, with that script within that directory. Or you can submit it via file system that has your script that is attached to your ray pods.
    • If your script is in the /jobs folder, it will submit the ray job via the ray job submission SDK (dashboard must be exposed via re) or it will submit directly through the head pod. Just run ./job-submit.sh <script name>. Ex/ ./job-submit.sh dt-pytorch.
    • If your script is in a file system that is attached to your ray pods, it you must specify the directory that the script is in relative to your head pod. Run ./job-submit.sh <script name> <directory>. Ex/ ./job-submit.sh dt-pytorch fsx/code/dt-pytorch where my dt-pytorch.py file is located in directory fsx/code/dt-pytorch.

RayCluster template

For everything you need to know about the details of a RayCluster configuration, please refer to the comments in the template, as well as this doc. But as a quick reference, here are the main concepts in the template:

  • metadata: name:
    • This is where you can name your raycluster.
  • nodeSelector in both headGroupSpec and workerGroupSpecs:
    • This is where you can specify which nodes your head pod and worker pods get assigned to. Preferably assign the worker group pods to the nodes with GPU's.
  • replicas
    • This defines how many min, max, and desired worker pods are in your RayCluster.
  • containers: resources: limits/requests:
    • These fields are under both headGroupSpec and workerGroupSpecs and these values set resource limits and requests for your pods. Please confirm your node resource capabilities before setting these values.
  • containers: image:
    • This is the container image each pod runs. It is best practice that the head pod and worker pods use the same container image, ex/ "rayproject/ray-ml:latest"
  • containers: env: name: (AWS KEYS)
    • After deploying your kubectl secrets by running ./deploy/kubectl-secrets/kubectl-secret-keys.sh your Ray pods will now have IAM permissions to access other buckets/filesystems/etc. If this is needed, please uncomment this section out in the template.
  • volumeMounts and volumes under headGroupSpec and workerGroupSpecs
    • This is where you can mount volumes like S3, EFS, FSx for Lustre on to your pods. This is needed for multi node distributed training jobs.

Ray dashboard

In order to access the Ray Dashboard, the Istio Ingress Gateway service of the Ray deployment needs to be exposed outside the cluster. In a production deployment typically an Application Load Balancer (ALB) is used, however this requires a DNS domain registration and a matching SSL certificate.

For an easy way to expose the Ray Dashboard, we can use kubectl port-forward. To start the port-forward, simply execute ray-expose.sh or re. To stop the port-forward, simply execute ray-hide.sh or rh.

If you are on a machine with its own browser, just navigate to http://localhost:8265 to open the Ray Dashboard.

Fig.4 - Ray Dashboard Overview



Fig.5 - Ray Dashboard Jobs



Fig.6 - Ray Dashboard Metrics



Create a RayJob

Within the /ray directory, you will find the /rayjob directory. Within this directory, you will find these scripts:

RayJob documentation

You can find RayJob Documentation here

Create a RayService

Within the /ray directory, you will find the /rayservice directory, which contains the following scripts:

Ray Serve QuickStart

RayServe Quickstart on Kubernetes can be found here

Serve Config V2 Section of RayServe Template

This section defines the configuration for Ray Serve applications. More details here.

applications: A list of applications to be deployed.

  • name: The name of the application, in this case, image_classifier.
  • import_path: The import path for the application's module, serve-train-images.app.
  • route_prefix: The route prefix for accessing the application, /classify.
  • runtime_env: Specifies the runtime environment for the application.
    • working_dir: The working directory for the application, specified as an S3 path.
    • pip: A list of Python packages to be installed in the runtime environment.
  • deployments: A list of deployments for the application.
    • name: The name of the deployment, ImageClassificationModel.
    • num_replicas: The number of replicas for the deployment, set to 1.
    • ray_actor_options: Options for the Ray actors.
      • num_cpus: The number of CPUs allocated for each actor, set to 1.

Deploy scripts

Prometheus & Grafana

The aws-do-ray project provides an example setup to monitor Ray clusters in Kubernetes using Prometheus & Grafana.

Action scripts are located in the /ray/deploy/prometheus folder.

  • ./deploy-prometheus.sh : deploys all prometheus/grafana pods in order to scrape your Ray pod metrics
  • ./expose-prometheus.sh port forwards the prometheus/grafana dashbaord so you can open the UI locally In order to see data on your Ray dashboard, please follow these steps:
    • Sign in with username: admin, password: prom-operator
    • Import Grafana dashboard file ‘dashboard_default.json’. Click “dashboards” → “new” → import → “upload json” from ray/deploy/prometheus/kuberay/config/grafana/default_grafana_dashboard.json
    • After reloading the page in your browser, you should be able to see the Grafana metrics on the Ray dashboard


Fig.6 - Ray Dashboard Prometheus & Grafana Metrics

Kubectl secrets

The /ray/deploy/kubectl-secrets folder contains the following script:

./kubectl-secret-keys.sh : creates a kubectl secret for cases when python code needs access to your AWS credentials.

S3 Mountpoint

The /ray/deploy/s3-mountpoint folder contains the following scripts:

./deploy.sh: creates an IAM OIDC identity provider for your cluster, creates an IAM policy, creates an IAM role, and installs the mountpoint for Amazon S3 CSI driver. Please ensure you have either exported S3_BUCKET_NAME as an environment variable, or manually replaced $S3_BUCKET_NAME in the ./deploy.sh script.

./s3-create.sh: creates a PV and a PVC which you can then use to mount to your ray pods within the "volumes" section in your raycluster template.

FSx for Lustre

Related scripts are found in the /ray/deploy/fsx folder.

Please ensure your "AWS_EKS_CLUSTER" and "AWS_REGION" are set in your .env file. If not, you can manually set these variables within the deploy.sh code.

./deploy.sh: creates an IAM OIDC identity provider for your cluster, deploys FSx for Lustre CSI driver, and creates an IAM role bound to the service account used by the driver.

The Amazon FSx for Lustre CSI driver presents you with two options for provisioning a file system.

Dynamic provisioning: This option leverages Persistent Volume Claims (PVCs) in Kubernetes. You define a PVC with desired storage specifications. The CSI Driver automatically provisions the FSx file system for you based on the PVC request. This allows for easier scaling and eliminates the need to manually create file systems.

Static provisioning: In this method, you manually create the FSx file system before using the CSI Driver. You'll need to configure details like subnet ID and security groups for the file system. Then, you can use the Driver to bing the PV to a PVC and mount this pre-created file system within your container as a volume.

Dynamic Provisioning

The setup-dependencies.sh script creates a dynamic FSxL volume, but this can also be done independently.

If you would like to use dynamic provisioning, ensure you have your desired configuration in dynamic-storageclass.yaml as well as inputting your "subnetID" and your "securityGroupIds".

These variables are retrieved once the container is built and set as environment variables... but if you'd like them to be altered, you can change $SUBNET_ID and $SECURITYGROUP_ID in dynamic-storageclass.yaml

  • subnetId - The subnet ID that the FSx for Lustre filesystem should be created inside. Using the $SUBNET_ID environment variable, we are referencing the same private subnet that was used for EKS or EKS HyperPod cluster creation.

  • securityGroupIds - A list of security group IDs that should be attached to the filesystem. Using the $SECURITY_GROUP environment variable, we are referencing the same security group that was use for EKS or EKS HyperPod cluster creation.

Once the storage class has been configured, you can run ./dynamic-create.sh to create an FSxL volume.

Static Provisioning

If you would like to use static provisioning, ensure you your volumeHandle: is set with your FSx file system ID, dnsname: is set with your FSx file system DNS name, and your mountname: is set with your FSx file system mount name in 'static-pv.yaml'. Also ensure that your fileSystemId: is set with your FSx file system ID, subnetId: is set with your subnet ID, and your securityGroupIds: are set with your security group ID(s) within static-storageclass.yaml.

Running ./static-create.sh creates a PV and a PVC which you can then use to mount to your Ray pods within the "volumes" section in your raycluster template.

Container command reference

The project home folder offers a number of additional scripts for management of the aws-do-ray container.

  • ./config.sh – configure aws-do-ray project settings interactively
  • ./build.sh – build aws-do-ray container image
  • ./push.sh – push aws-do-ray container image to configured registry
  • ./pull.sh – pull aws-do-ray container image from a configured existing registry
  • ./run.sh – run aws-do-ray container
  • ./status.sh – show logs of the running aws-do-ray container
  • ./logs.sh – show logs of the running aws-do-ray container
  • ./start.sh – start the aws-do-ray container if it is currently in “Exited” status
  • ./exec.sh – execute a command inside the running aws-do-ray container, the default command is bash
  • ./stop.sh – stop and remove the aws-do-ray container
  • ./test.sh – run container unit tests

Troubleshooting

  • Worker pods can't be scheduled to worker nodes

    • This can be due to taints present on your nodes or tolerations missing from your pods. Make sure worker node group contains the taints that are specified as tolerations in the ray cluster yaml. Alternatively, you can take out the taints and tolerations all together.
  • Error: You must be logged in to the server (Unauthorized)

    • Ensure you are connected to the right AWS account, please run aws sts get-caller-identity in the terminal to verify your current identity
    • Ensure you are connected to the right EKS cluster and region, please run kubectl config current-context to check the current context and aws eks update-kubeconfig --region <region-code> --name <my-cluster> to change the current context if needed.
  • EKS API Serve Unauthorized Error (trouble accessing ray cluster from another EC2 instance)

  • [An error occurred (InvalidClientTokenId) when calling the GetCallerIdentity operation: The security token included in the request is invalid]

    • You may need to run [unset AWS_PROFILE] to rely on the AWS credentials provided through the environment variables rather than the default profile in ~/.aws/credentials or ~/.aws/config.
    • You man need to delete ACCESS_TOKEN from your ~/.aws/credentials file if your token has expired

Security

See CONTRIBUTING for more information.

License

This project is licensed under the MIT-0 License. See the LICENSE file.

Disclaimer

This sample code should not be used in production accounts, on production workloads, or on production or other critical data. You are responsible for testing, securing, and optimizing the sample code as appropriate for production-grade use based on your specific quality control practice and standards.

References

Credits

  • Mark Vinciguerra - @mvincig
  • Alex Iankoulski - @iankouls
  • Florian Stahl - @flostahl
  • Milena Boytchef - @boytchef