Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Webhook Unit Test Suite #578

Merged
merged 17 commits into from
Apr 22, 2024
Merged

Add Webhook Unit Test Suite #578

merged 17 commits into from
Apr 22, 2024

Conversation

ryanaoleary
Copy link
Collaborator

@ryanaoleary ryanaoleary commented Apr 10, 2024

This PR adds a comprehensive unit test suite for the Ray TPU webhook. These unit tests can be run with the command go test from the applications/ray/kuberay-tpu-webhook folder. These tests cover 75.3% of statements in the webhook including all major pieces of logic.

This PR also updates some of the logic in the webhook code to resolve corner cases found during testing. This includes adding a getNumTPUChipsRequested function which returns the google.com/tpu request value of a container, using that value to calculate the # TPU VM hosts for a given TPU worker group (rather than using acceleratorType to guess the machineType). This PR also removes the headlessServiceName var and instead dynamically generates the headless service name for any given cluster.

A follow-up PR will add end-to-end tests for the webhook.

This PR has been tested with:

  • Unit tests
  • Manual tests

@ryanaoleary ryanaoleary marked this pull request as ready for review April 10, 2024 19:08
@ryanaoleary
Copy link
Collaborator Author

/gcbrun

@ryanaoleary ryanaoleary merged commit f0ee9e6 into main Apr 22, 2024
6 of 8 checks passed
@ryanaoleary ryanaoleary deleted the webhook-unit-tests branch April 22, 2024 21:04
alpha-amundson pushed a commit that referenced this pull request May 3, 2024
* Test suite initial commit

* Remove unused annotations from certs/

* remove duplicate sample

* Added more unit tests

* Change logging level and fix tpu v5e chips per host calculation

* Add remaining unit tests

* Calculate TPU chips per VM host using google.com/tpu Resource value

* Remove headlessServiceName var (dynamically generate it with current cluster name instead)

* Resolve merge issues

* Fix headlessServiceName and finish tests

* Change RayCluster name back to default value

* Final test changes

* Fix bug for determining replicaIndex with workergroup name with dashes

* Add troubleshooting guide

* Add test for nil workerGroupSpecs in validateRayCluster

* Update troubleshooting guide

* Add example Pod env
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants