-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cudo] Add new cloud Cudo compute using Provisioner method #2975
[Cudo] Add new cloud Cudo compute using Provisioner method #2975
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for submitting the PR for support Cudo compute @JungleCatSW! The code looks fantastic and very clean. Just left several comments. : )
…PUs, catalog fetch speed up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick update @JungleCatSW! Left some minor comments. I think it should be good to go once we have done some basic tests after we get the access to the cloud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick update @JungleCatSW! I am trying out the PR with the Cudo compute. Left several compute for the fix of the memory filtering.
I also encountered the following issue:
sky launch -c test-cudo --gpus RTXA4000 --memory 10+ --num-nodes 4 echo hi
The cluster provisioning fails due to an error related to the disk id:
D 01-16 19:03:30 provisioner.py:170] provision_record = provision.run_instances(provider_name,
D 01-16 19:03:30 provisioner.py:170] File "/home/gcpuser/skypilot/sky/provision/__init__.py", line 41, in _wrapper
D 01-16 19:03:30 provisioner.py:170] return impl(*args, **kwargs)
D 01-16 19:03:30 provisioner.py:170] File "/home/gcpuser/skypilot/sky/provision/cudo/instance.py", line 87, in run_instances
D 01-16 19:03:30 provisioner.py:170] instance_id = cudo_wrapper.launch(
D 01-16 19:03:30 provisioner.py:170] File "/home/gcpuser/skypilot/sky/provision/cudo/cudo_wrapper.py", line 44, in launch
D 01-16 19:03:30 provisioner.py:170] raise e
D 01-16 19:03:30 provisioner.py:170] File "/home/gcpuser/skypilot/sky/provision/cudo/cudo_wrapper.py", line 41, in launch
D 01-16 19:03:30 provisioner.py:170] vm = api.create_vm(cudo().cudo_api.project_id(), request)
D 01-16 19:03:30 provisioner.py:170] File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api/virtual_machines_api.py", line 487, in create_vm
D 01-16 19:03:30 provisioner.py:170] (data) = self.create_vm_with_http_info(project_id, create_vm_body, **kwargs) # noqa: E501
D 01-16 19:03:30 provisioner.py:170] File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api/virtual_machines_api.py", line 557, in create_vm_with_http_info
D 01-16 19:03:30 provisioner.py:170] return self.api_client.call_api(
D 01-16 19:03:30 provisioner.py:170] File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api_client.py", line 326, in call_api
D 01-16 19:03:30 provisioner.py:170] return self.__call_api(resource_path, method,
D 01-16 19:03:30 provisioner.py:170] File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api_client.py", line 158, in __call_api
D 01-16 19:03:30 provisioner.py:170] response_data = self.request(
D 01-16 19:03:30 provisioner.py:170] File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api_client.py", line 368, in request
D 01-16 19:03:30 provisioner.py:170] return self.rest_client.POST(url,
D 01-16 19:03:30 provisioner.py:170] File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/rest.py", line 269, in POST
D 01-16 19:03:30 provisioner.py:170] return self.request("POST", url,
D 01-16 19:03:30 provisioner.py:170] File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/rest.py", line 228, in request
D 01-16 19:03:30 provisioner.py:170] raise ApiException(http_resp=r)
D 01-16 19:03:30 provisioner.py:170] cudo_compute.rest.ApiException: (409)
D 01-16 19:03:30 provisioner.py:170] Reason: Conflict
D 01-16 19:03:30 provisioner.py:170] HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 16 Jan 2024 19:03:30 GMT', 'Content-Type': 'application/json', 'Content-Length': '70', 'Connection': 'keep-alive', 'vary': 'Origin', 'CF-Cache-Status': 'DYNAMIC', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=5iQ83c%2B9r7u6IVUB0XNJ8w6S%2FwN0Wyt%2F7G%2FSduK%2Fq16YJvwePFNSg%2FHD%2BooFs4uJkl9BxGSPH19jdHOyOfJbknvgFoXEx8diTVqSW2e4vO148p1QL%2BZYy7XraojND0QqT7YJID%2FozGk%3D"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains; preload', 'X-Content-Type-Options': 'nosniff', 'Server': 'cloudflare', 'CF-RAY': '84689e682d8e5dd6-HKG', 'alt-svc': 'h3=":443"; ma=86400'})
D 01-16 19:03:30 provisioner.py:170] HTTP response body: {"code":6,"message":"A disk with that id already exists","details":[]}
This should be fine if it is a transient issue with the cloud, but a more serious problem is the resource leakage caused by the failure of the auto termination triggered by SkyPilot. After the error is raised above, SkyPilot will call the sky.provision.cudo.instance::terminate_instance
to terminate the partially ready VMs, and the following error occurs:
D 01-16 19:03:30 provisioner.py:179] Terminating the failed cluster.
D 01-16 19:03:31 cloud_vm_ray_backend.py:1152] Got error(s) in Cudo:[cudo_compute.rest.ApiException] (400)
D 01-16 19:03:31 cloud_vm_ray_backend.py:1152] Reason: Bad Request
D 01-16 19:03:31 cloud_vm_ray_backend.py:1152] HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 16 Jan 2024 19:03:31 GMT', 'Content-Type': 'application/json', 'Content-Length': '211', 'Connection': 'keep-alive', 'vary': 'Origin', 'CF-Cache-Status': 'DYNAMIC', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=PuvVoBDN1Qo8X3NHl0oqIc02Ldsh%2B1c2OeOgafxQhBCC8vg7fJlUJHoM%2Bh1DrdZHWgSsYnQWV4YrgkFal0FFih7PhyYr5TokcIpc1qjl6bwTSW51jBKF%2FXHnzkGJ4psfANopLiyGnUU%3D"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains; preload', 'X-Content-Type-Options': 'nosniff', 'Server': 'cloudflare', 'CF-RAY': '84689e6facad04ce-HKG', 'alt-svc': 'h3=":443"; ma=86400'})
D 01-16 19:03:31 cloud_vm_ray_backend.py:1152] HTTP response body: {"code":9,"message":"Invalid vm state (prol)","details":[{"@type":"type.googleapis.com/google.rpc.PreconditionFailure","violations":[{"type":"","subject":"","description":"vm cannot currently be terminated"}]}]}
It seems to be a problem for not being able to call termination on a VM with state prol
. Should we wait until the instance to get out of the pending stage before we call termination operation?
Another issue: it seems when I try to launch a multi-node cluster with Do we have an idea why the internal IP is empty and how can we fix that? If it is fine, we can also set the internal IP to be the same as external IP, connecting the VMs directly using the extneral IPs. |
Everything except the disk id issue is resolved, I need to make some changes elsewhere to fix that. I will let you know when it is ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the quick fix @JungleCatSW! This is fantastic! I just tried launching the single and multiple VMs with autodown and it seems working correctly. We are happy to merge this PR in first with the master branch merged into this branch and the comments below fixed.
After the PR is merged, there will be several TODOs:
- Add instruction for how to setup the cudo credentials in our https://github.com/skypilot-org/skypilot/blob/master/docs/source/getting-started/installation.rst
- Add tests in our tests: https://github.com/skypilot-org/skypilot/blob/master/tests/conftest.py#L23-L26
sky/provision/cudo/instance.py
Outdated
head_instance_id = instance_id | ||
|
||
# Wait for instances to be ready. | ||
retries = 12 # times 10 second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried sky launch --num-nodes 2 --gpus RTXA4000 --cloud cudo
and it seems the VM is still in starting
state after more than 6 minutes. Probably the 120 seconds wait time is not enough? Similar thing happens to RTX3080
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: it seems to be a cloud backend issue as sky launch --gpus RTXA4000 --cloud cudo
also took significant amount of time, but RTXA6000
seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have adjusted it to 10 minutes, although they should usually launch in under a minute. I will have to do some investigation there. Let me know if 10 minutes is too long...
Could we run the |
ba34949
to
7e58797
Compare
# Conflicts: # sky/backends/backend_utils.py # sky/clouds/service_catalog/__init__.py # sky/setup_files/setup.py # tests/conftest.py
…esources_utils.DiskTier]
Refactored to use the Cudo Compute PR into a v2 provisioner.
There is also a fetch method as our catalog changes reguarly so there will be a pr on the catalog repo too.
Update:
Fixed memory filtering, copied external for multi node launches, fixed disk id issue, added a wait for correct state for terminate.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh