Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray TPU Webhook Autoscaling Support #740

Merged
merged 6 commits into from
Jul 19, 2024

Conversation

ryanaoleary
Copy link
Collaborator

This PR adds support for autoscaling RayClusters by generating TPU_WORKER_HOSTNAMES when intercepting each Pod, rather than when intercepting the RayCluster CR.

This PR has been tested as follows:

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@andrewsykim
Copy link
Collaborator

This PR adds support for autoscaling RayClusters by generating TPU_WORKER_HOSTNAMES when intercepting each Pod, rather than when intercepting the RayCluster CR.

Will existing pods need to have the TPU_WORKER_HOSTNAMES env var updated when new replicas are added or does it only apply for new pods added by the autoscaler?

@ryanaoleary
Copy link
Collaborator Author

This PR adds support for autoscaling RayClusters by generating TPU_WORKER_HOSTNAMES when intercepting each Pod, rather than when intercepting the RayCluster CR.

Will existing pods need to have the TPU_WORKER_HOSTNAMES env var updated when new replicas are added or does it only apply for new pods added by the autoscaler?

We don't need to update existing Pods. We assume that TPU PodSlices are scaled atomically so the TPU_WORKER_HOSTNAMES we add to each Pod include the DNS hostnames for all the Pods in the slice, even if they haven't been created yet.

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@andrewsykim
Copy link
Collaborator

andrewsykim commented Jul 19, 2024

I think this is good to merge, but let's ensure #723 is also merged before cutting a new tag with the autoscaling changes

@ryanaoleary ryanaoleary merged commit dc3a615 into GoogleCloudPlatform:main Jul 19, 2024
5 checks passed
ryanaoleary added a commit to ryanaoleary/ai-on-gke that referenced this pull request Jul 19, 2024
* Generate hostnames at Pod creation

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Should not fatal log in deletePod

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* deletePod admission always succeeds

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Remove unused tests make command

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Update tests and add error checking

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Just return an error instead of logging

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary added a commit that referenced this pull request Jul 25, 2024
* Generate hostnames at Pod creation

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Update tests and add error checking

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Make webhook stateless in between mutate calls

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Formatting changes

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Fix bug causing incorrect IDs

* Add cluster role and log formatting changes

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Filter pods by Ray worker group label

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Vulnerability fixes

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Better names and add ServiceAccount

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Change version back to v1.1

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Change implementation to use PodInformer

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Use PodLister

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* updateSliceToWorkerIDs returns error

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Use mutex lock in updateSliceToWorkerIDs

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Update unit tests and fix comments

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Remove global client var

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Just return err instead of logging

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* TODO comment

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Lock when reading from shared sliceToWorkerIDs mapping

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Switch to using RWMutex

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Ray TPU Webhook Autoscaling Support (#740)

* Generate hostnames at Pod creation

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Should not fatal log in deletePod

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* deletePod admission always succeeds

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Remove unused tests make command

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Update tests and add error checking

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Just return an error instead of logging

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Generate hostnames at Pod creation

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Update tests and add error checking

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Close stop channel on webhook termination

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Refactor webhook to avoid using global vars

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Fix comments

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Change service account name

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Return BadRequest if invalid kind

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Fix comments

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Change error messages

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Fatal log in main

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Update function comments

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Refactor to minimize indentations

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Change sliceToWorkerIDs nil check to use len

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Write http.Error to header

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Don't fatal log in validateRayCluster

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Check for nil admission request

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Add doc comment

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Update expected errors

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Better getNextWorkerID logic

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Update replicaIndex and nextWorkerID tests

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Refactor webhook unit tests

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Create numOfHosts pods for Pod List

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Log admission request object name

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Fix nits and go vet output

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Initial cloudbuil commit

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Fix vet command

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Update cloudbuild

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Fix cloudbuild errors

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Add dir

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Remove arg

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Change to bash command

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* increase timeout time

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Fix validateRayCluster test

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Fix nits for cloudbuild

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Break early in validateRayCluster

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Remove unnecessary args from validateRayCluster test

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Change break to continue

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Remove unused vars from webhook tests and add edge cases

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Update helm chart

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants