Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Support autoscaling TPU node pools #4633

Merged
merged 8 commits into from
Feb 1, 2025

Conversation

romilbhardwaj
Copy link
Collaborator

Adds support for autoscaling TPU node pools on GKE.

Currently single-host TPU v5 is supported.

Users must configure autoscaler: gke and a large provision_timeout to let the cluster autoscale.

# ~/.sky/config.yaml
kubernetes:
  autoscaler: gke
  provision_timeout: 900

Tested on scale-from-zero nodepool withct5lp-hightpu-1t (tpu v5e, 1x1):

  1. Script to create GKE cluster with an autoscaling TPU nodepool:
gcloud beta container --project "sky-dev-465" clusters create "tputest" --region "us-west1" --tier "standard" --no-enable-basic-auth --cluster-version "1.31.4-gke.1256000" --release-channel "regular" --machine-type "e2-standard-8" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "1" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET --enable-ip-alias --network "projects/sky-dev-465/global/networks/default" --subnetwork "projects/sky-dev-465/regions/us-west1/subnetworks/default" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --enable-ip-access --security-posture=standard --workload-vulnerability-scanning=disabled --no-enable-master-authorized-networks --no-enable-google-cloud-access --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --binauthz-evaluation-mode=DISABLED --enable-managed-prometheus --enable-shielded-nodes && gcloud beta container --project "sky-dev-465" node-pools create "tpupool" --cluster "tputest" --region "us-west1" --machine-type "ct5lp-hightpu-1t" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "0" --enable-autoscaling --total-min-nodes "0" --total-max-nodes "2" --location-policy "ANY" --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0
  1. Run sky launch -c tpu --cloud kubernetes --gpus tpu-v5-lite-podslice

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @romilbhardwaj! LGTM.

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
@romilbhardwaj romilbhardwaj merged commit 269dfb1 into master Feb 1, 2025
18 checks passed
@romilbhardwaj romilbhardwaj deleted the k8s_gke_tpu_autoscaling branch February 1, 2025 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants