Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Multiple Separate TPU Worker Groups per RayCluster #467

Merged
merged 5 commits into from
Apr 5, 2024

Conversation

ryanaoleary
Copy link
Collaborator

@ryanaoleary ryanaoleary commented Mar 27, 2024

This PR adds support for multiple worker groups that request TPUs (both single-host and multi-host) in the same Ray Cluster. This PR was tested by manually creating a RayCluster with single-host, multi-host, single-slice, and multi-slice worker groups and verifying that all were injected with the correct TPU_NAME, TPU_WORKER_ID, replicaIndex, and TPU_WORKER_HOSTNAMES.

This PR also adds unit tests for getReplicaIndex and getNextWorkerID, these tests can be run using go test from the applications/ray/kuberay-tpu-webhook folder.

@ryanaoleary ryanaoleary self-assigned this Mar 27, 2024
applications/ray/kuberay-tpu-webhook/main.go Show resolved Hide resolved
applications/ray/kuberay-tpu-webhook/main.go Show resolved Hide resolved
applications/ray/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
applications/ray/kuberay-tpu-webhook/main.go Show resolved Hide resolved
applications/ray/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
applications/ray/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
@ryanaoleary ryanaoleary force-pushed the multi-worker-groups branch 2 times, most recently from f3b6ced to 6438777 Compare April 1, 2024 18:31
@ryanaoleary
Copy link
Collaborator Author

/gcbrun

@ryanaoleary ryanaoleary enabled auto-merge (squash) April 5, 2024 02:50
@ryanaoleary ryanaoleary merged commit f9b6038 into main Apr 5, 2024
8 checks passed
@ryanaoleary ryanaoleary deleted the multi-worker-groups branch April 12, 2024 21:03
kfswain pushed a commit that referenced this pull request Apr 15, 2024
* Support for multiple seperate TPU workergroups per RayCluster

* Add namespace to slice struct, logs, and comments

* Added unit tests for getReplicaIndex and getNextWorkerID

* added two more test cases for edge cases

* Fixed comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants