-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Provisioner] Remove ray dependency for GCP and move TPU node to new provisioner #2943
Conversation
Awesome!! May want to do an install speed comparison, and/or smoke tests? |
…move-ray-dependency
Thanks for the comment @concretevitamin! Added the smoke tests and the speed comparison in the PR description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @Michaelvll to see the installation speedups! Did a pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @Michaelvll, some minor nits, thanks.
sky/provision/gcp/instance_utils.py
Outdated
# Delete TPU node. | ||
"""Delete a TPU node with gcloud CLI. | ||
|
||
This is used for both stopping and terminating a TPU node. It is ok to call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case maybe best to name it stop_or_delete_tpu_node()
. How does the CLI cmd determine when to stop and when to delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, maybe I am not clear enough in the docstr. This function always delete the tpu accelerator, no matter it is stopping or terminating the cluster, because the host VM will be correctly stopped or terminated. Whenever we restart a stopped TPU node cluster, we will create a new TPU accelerator to attach to the host VM.
Just updated the docstr. PTAL
@@ -44,7 +44,7 @@ | |||
# e.g., when we add new events to skylet, or we fix a bug in skylet. | |||
# | |||
# TODO(zongheng,zhanghao): make the upgrading of skylet automatic? | |||
SKYLET_VERSION = '5' | |||
SKYLET_VERSION = '6' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update the comment above to add this case as a reason we must bump the version? For future guidance.
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…kypilot into gcp-remove-ray-dependency
This used to be true, but since skypilot-org#2943, 'ray' is the only provisioner.
This used to be true, but since skypilot-org#2943, 'ray' is the only provisioner. Add other keys that are now present instead.
* add config_dict['config_hash'] output to write_cluster_config * fix docstring for write_cluster_config This used to be true, but since #2943, 'ray' is the only provisioner. Add other keys that are now present instead. * when using --fast, check if config_hash matches, and if not, provision * mock hashing method in unit test This is needed since some files in the fake file mounts don't actually exist, like the wheel path. * check config hash within provision with lock held * address other PR review comments * rename to skip_if_no_cluster_updates Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * add assert details Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * address PR comments and update docstrings * fix test * update docstrings Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * address PR comments * fix lint and tests * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * refactor skip_if_no_cluster_update var * clarify comment * format exception --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* [perf] use uv for venv creation and pip install (#4414) * Revert "remove `uv` from runtime setup due to azure installation issue (#4401)" This reverts commit 0b20d56. * on azure, use --prerelease=allow to install azure-cli * use uv venv --seed * fix backwards compatibility * really fix backwards compatibility * use uv to set up controller dependencies * fix python 3.8 * lint * add missing file * update comment * split out azure-cli dep * fix lint for dependencies * use runpy.run_path rather than modifying sys.path * fix cloud dependency installation commands * lint * Update sky/utils/controller_utils.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * [Minor] README updates. (#4436) * [Minor] README touches. * update * update * make --fast robust against credential or wheel updates (#4289) * add config_dict['config_hash'] output to write_cluster_config * fix docstring for write_cluster_config This used to be true, but since #2943, 'ray' is the only provisioner. Add other keys that are now present instead. * when using --fast, check if config_hash matches, and if not, provision * mock hashing method in unit test This is needed since some files in the fake file mounts don't actually exist, like the wheel path. * check config hash within provision with lock held * address other PR review comments * rename to skip_if_no_cluster_updates Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * add assert details Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * address PR comments and update docstrings * fix test * update docstrings Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * address PR comments * fix lint and tests * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * refactor skip_if_no_cluster_update var * clarify comment * format exception --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * format * format * format * fix --------- Co-authored-by: Christopher Cooper <cooperc@assemblesys.com> Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Changes
skypilot[gcp]
.gcp_utils
.Tested (run the relevant ones):
bash format.sh
sky launch -c test-gcp --cloud gcp --cpus 2+ echo hi
;sky exec test-gcp echo hi
sky launch -c test-tpu-node examples/tpu/tpu_app.yaml
;sky stop test-tpu-node
;sky start test-tpu-node
;sky exec test-tpu-node examples/tpu/tpu_app.yaml
;sky down test-tpu-node
sky launch -c test-tpu-node examples/tpu/tpu_app.yaml
;sky autostop -i 0 test-tpu-node
;sky status -r test-tpu-node
;sky start test-tpu-node
;sky autostop -i 0 --down test-tpu-node
;sky status -r test-tpu-node
pytest tests/test_smoke.py --gcp
(with onlyskypilot[gcp]
installed): Failed the following tests due to the lack of credentials for AWS.pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh
(only with the cloud related dependencies installed)sky launch -c test-tpu-node examples/tpu/tpu_app.yaml
; this branch:sky exec test-tpu-node examples/tpu/tpu_app.yaml
;sky autostop -i 0 test-tpu-node
sky launch -c test-tpu-node examples/tpu/tpu_app.yaml
; this branch:sky exec test-tpu-node examples/tpu/tpu_app.yaml
;sky launch -c test-tpu-node examples/tpu/tpu_app.yaml
;sky autostop -i 0 test-tpu-node
pip install .[gcp]
- 42.337spip install .[gcp]
- 26.453s