Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Provisioner] Remove ray dependency for GCP and move TPU node to new provisioner #2943

Merged
merged 25 commits into from
Jan 12, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jan 4, 2024

Changes

  1. Removes the dependency for ray for skypilot[gcp].
  2. Moved the TPU node creation and deletion in the new provisioner to avoid the depedencies of ray in the backend.
  3. Refactor the minimal permissions added in Relax required permissions for custom VPC (GCP) #2944 to gcp_utils.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test-gcp --cloud gcp --cpus 2+ echo hi; sky exec test-gcp echo hi
    • sky launch -c test-tpu-node examples/tpu/tpu_app.yaml; sky stop test-tpu-node; sky start test-tpu-node; sky exec test-tpu-node examples/tpu/tpu_app.yaml; sky down test-tpu-node
    • sky launch -c test-tpu-node examples/tpu/tpu_app.yaml; sky autostop -i 0 test-tpu-node; sky status -r test-tpu-node; sky start test-tpu-node; sky autostop -i 0 --down test-tpu-node; sky status -r test-tpu-node
  • All smoke tests: pytest tests/test_smoke.py --gcp (with only skypilot[gcp] installed): Failed the following tests due to the lack of credentials for AWS.
    tests/test_smoke.py::TestStorageWithCredentials::test_nonexistent_bucket
    tests/test_smoke.py::TestStorageWithCredentials::test_upload_to_existing_bucket
    tests/test_smoke.py::test_example_app
    tests/test_smoke.py::TestStorageWithCredentials::test_bucket_bulk_deletion
    tests/test_smoke.py::TestStorageWithCredentials::test_upload_source_with_spaces
    tests/test_smoke.py::TestStorageWithCredentials::test_multiple_buckets_creation_and_deletion
    tests/test_smoke.py::TestStorageWithCredentials::test_new_bucket_creation_and_deletion
    tests/test_smoke.py::TestStorageWithCredentials::test_bucket_external_deletion
    tests/test_smoke.py::TestStorageWithCredentials::test_private_bucket
    tests/test_smoke.py::TestStorageWithCredentials::test_list_source
    tests/test_smoke.py::test_gcp_storage_mounts_with_stop
    tests/test_smoke.py::test_spot_storage
    tests/test_smoke.py::test_sky_bench
    tests/test_smoke.py::test_multiple_accelerators_unordered_with_default
    tests/test_smoke.py::test_file_mounts
    tests/test_smoke.py::TestStorageWithCredentials::test_excluded_file_cloud_storage_upload_copy
    tests/test_smoke.py::test_multiple_resources
    tests/test_smoke.py::TestStorageWithCredentials::test_public_bucket
    tests/test_smoke.py::TestStorageWithCredentials::test_public_bucket
    tests/test_smoke.py::TestStorageWithCredentials::test_copy_mount_existing_storage
    
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh (only with the cloud related dependencies installed)
    • master: sky launch -c test-tpu-node examples/tpu/tpu_app.yaml; this branch: sky exec test-tpu-node examples/tpu/tpu_app.yaml; sky autostop -i 0 test-tpu-node
    • master: sky launch -c test-tpu-node examples/tpu/tpu_app.yaml; this branch: sky exec test-tpu-node examples/tpu/tpu_app.yaml; sky launch -c test-tpu-node examples/tpu/tpu_app.yaml; sky autostop -i 0 test-tpu-node
  • Installation speed
    • master: pip install .[gcp] - 42.337s
    • current PR: pip install .[gcp] - 26.453s

@concretevitamin
Copy link
Member

Awesome!! May want to do an install speed comparison, and/or smoke tests?

@Michaelvll Michaelvll changed the title [Provisioner] Remove ray dependency for GCP [Provisioner] Remove ray dependency for GCP and move TPU node to new provisioner Jan 5, 2024
@Michaelvll
Copy link
Collaborator Author

Awesome!! May want to do an install speed comparison, and/or smoke tests?

Thanks for the comment @concretevitamin! Added the smoke tests and the speed comparison in the PR description.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @Michaelvll to see the installation speedups! Did a pass.

sky/backends/backend_utils.py Show resolved Hide resolved
sky/clouds/gcp.py Outdated Show resolved Hide resolved
sky/clouds/utils/gcp_utils.py Show resolved Hide resolved
sky/provision/gcp/constants.py Show resolved Hide resolved
sky/setup_files/setup.py Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @Michaelvll, some minor nits, thanks.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
# Delete TPU node.
"""Delete a TPU node with gcloud CLI.

This is used for both stopping and terminating a TPU node. It is ok to call
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case maybe best to name it stop_or_delete_tpu_node(). How does the CLI cmd determine when to stop and when to delete?

Copy link
Collaborator Author

@Michaelvll Michaelvll Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, maybe I am not clear enough in the docstr. This function always delete the tpu accelerator, no matter it is stopping or terminating the cluster, because the host VM will be correctly stopped or terminated. Whenever we restart a stopped TPU node cluster, we will create a new TPU accelerator to attach to the host VM.

Just updated the docstr. PTAL

@@ -44,7 +44,7 @@
# e.g., when we add new events to skylet, or we fix a bug in skylet.
#
# TODO(zongheng,zhanghao): make the upgrading of skylet automatic?
SKYLET_VERSION = '5'
SKYLET_VERSION = '6'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the comment above to add this case as a reason we must bump the version? For future guidance.

sky/provision/gcp/instance_utils.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
@Michaelvll Michaelvll merged commit 71525cd into master Jan 12, 2024
19 checks passed
@Michaelvll Michaelvll deleted the gcp-remove-ray-dependency branch January 12, 2024 21:58
cg505 added a commit to cg505/skypilot that referenced this pull request Oct 31, 2024
This used to be true, but since skypilot-org#2943, 'ray' is the only provisioner.
cg505 added a commit to cg505/skypilot that referenced this pull request Nov 7, 2024
This used to be true, but since skypilot-org#2943, 'ray' is the only provisioner.
Add other keys that are now present instead.
cg505 added a commit that referenced this pull request Dec 4, 2024
* add config_dict['config_hash'] output to write_cluster_config

* fix docstring for write_cluster_config

This used to be true, but since #2943, 'ray' is the only provisioner.
Add other keys that are now present instead.

* when using --fast, check if config_hash matches, and if not, provision

* mock hashing method in unit test

This is needed since some files in the fake file mounts don't actually exist,
like the wheel path.

* check config hash within provision with lock held

* address other PR review comments

* rename to skip_if_no_cluster_updates

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* add assert details

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* address PR comments and update docstrings

* fix test

* update docstrings

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* address PR comments

* fix lint and tests

* Update sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* refactor skip_if_no_cluster_update var

* clarify comment

* format exception

---------

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Michaelvll added a commit that referenced this pull request Dec 9, 2024
* [perf] use uv for venv creation and pip install (#4414)

* Revert "remove `uv` from runtime setup due to azure installation issue (#4401)"

This reverts commit 0b20d56.

* on azure, use --prerelease=allow to install azure-cli

* use uv venv --seed

* fix backwards compatibility

* really fix backwards compatibility

* use uv to set up controller dependencies

* fix python 3.8

* lint

* add missing file

* update comment

* split out azure-cli dep

* fix lint for dependencies

* use runpy.run_path rather than modifying sys.path

* fix cloud dependency installation commands

* lint

* Update sky/utils/controller_utils.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

---------

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* [Minor] README updates. (#4436)

* [Minor] README touches.

* update

* update

* make --fast robust against credential or wheel updates (#4289)

* add config_dict['config_hash'] output to write_cluster_config

* fix docstring for write_cluster_config

This used to be true, but since #2943, 'ray' is the only provisioner.
Add other keys that are now present instead.

* when using --fast, check if config_hash matches, and if not, provision

* mock hashing method in unit test

This is needed since some files in the fake file mounts don't actually exist,
like the wheel path.

* check config hash within provision with lock held

* address other PR review comments

* rename to skip_if_no_cluster_updates

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* add assert details

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* address PR comments and update docstrings

* fix test

* update docstrings

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* address PR comments

* fix lint and tests

* Update sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* refactor skip_if_no_cluster_update var

* clarify comment

* format exception

---------

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* format

* format

* format

* fix

---------

Co-authored-by: Christopher Cooper <cooperc@assemblesys.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants