[Cudo] Add new cloud Cudo compute using Provisioner method #2975

JungleCatSW · 2024-01-11T10:21:07Z

Refactored to use the Cudo Compute PR into a v2 provisioner.
There is also a fetch method as our catalog changes reguarly so there will be a pr on the catalog repo too.

Update:

Fixed memory filtering, copied external for multi node launches, fixed disk id issue, added a wait for correct state for terminate.

Tested (run the relevant ones):

[x ] Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Michaelvll

Thank you for submitting the PR for support Cudo compute @JungleCatSW! The code looks fantastic and very clean. Just left several comments. : )

sky/backends/backend_utils.py

sky/clouds/cudo.py

sky/provision/cudo/config.py

sky/provision/cudo/__init__.py

sky/clouds/cudo.py

sky/setup_files/MANIFEST.in

sky/templates/cudo-ray.yml.j2

…PUs, catalog fetch speed up.

Michaelvll

Thanks for the quick update @JungleCatSW! Left some minor comments. I think it should be good to go once we have done some basic tests after we get the access to the cloud.

sky/clouds/cudo.py

sky/clouds/service_catalog/data_fetchers/requirements.txt

sky/clouds/cudo.py

…requirements.txt

Michaelvll

Thanks for the quick update @JungleCatSW! I am trying out the PR with the Cudo compute. Left several compute for the fix of the memory filtering.

I also encountered the following issue:
sky launch -c test-cudo --gpus RTXA4000 --memory 10+ --num-nodes 4 echo hi
The cluster provisioning fails due to an error related to the disk id:

D 01-16 19:03:30 provisioner.py:170]     provision_record = provision.run_instances(provider_name,
D 01-16 19:03:30 provisioner.py:170]   File "/home/gcpuser/skypilot/sky/provision/__init__.py", line 41, in _wrapper
D 01-16 19:03:30 provisioner.py:170]     return impl(*args, **kwargs)
D 01-16 19:03:30 provisioner.py:170]   File "/home/gcpuser/skypilot/sky/provision/cudo/instance.py", line 87, in run_instances
D 01-16 19:03:30 provisioner.py:170]     instance_id = cudo_wrapper.launch(
D 01-16 19:03:30 provisioner.py:170]   File "/home/gcpuser/skypilot/sky/provision/cudo/cudo_wrapper.py", line 44, in launch
D 01-16 19:03:30 provisioner.py:170]     raise e
D 01-16 19:03:30 provisioner.py:170]   File "/home/gcpuser/skypilot/sky/provision/cudo/cudo_wrapper.py", line 41, in launch
D 01-16 19:03:30 provisioner.py:170]     vm = api.create_vm(cudo().cudo_api.project_id(), request)
D 01-16 19:03:30 provisioner.py:170]   File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api/virtual_machines_api.py", line 487, in create_vm
D 01-16 19:03:30 provisioner.py:170]     (data) = self.create_vm_with_http_info(project_id, create_vm_body, **kwargs)  # noqa: E501
D 01-16 19:03:30 provisioner.py:170]   File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api/virtual_machines_api.py", line 557, in create_vm_with_http_info
D 01-16 19:03:30 provisioner.py:170]     return self.api_client.call_api(
D 01-16 19:03:30 provisioner.py:170]   File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api_client.py", line 326, in call_api
D 01-16 19:03:30 provisioner.py:170]     return self.__call_api(resource_path, method,
D 01-16 19:03:30 provisioner.py:170]   File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api_client.py", line 158, in __call_api
D 01-16 19:03:30 provisioner.py:170]     response_data = self.request(
D 01-16 19:03:30 provisioner.py:170]   File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api_client.py", line 368, in request
D 01-16 19:03:30 provisioner.py:170]     return self.rest_client.POST(url,
D 01-16 19:03:30 provisioner.py:170]   File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/rest.py", line 269, in POST
D 01-16 19:03:30 provisioner.py:170]     return self.request("POST", url,
D 01-16 19:03:30 provisioner.py:170]   File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/rest.py", line 228, in request
D 01-16 19:03:30 provisioner.py:170]     raise ApiException(http_resp=r)
D 01-16 19:03:30 provisioner.py:170] cudo_compute.rest.ApiException: (409)
D 01-16 19:03:30 provisioner.py:170] Reason: Conflict
D 01-16 19:03:30 provisioner.py:170] HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 16 Jan 2024 19:03:30 GMT', 'Content-Type': 'application/json', 'Content-Length': '70', 'Connection': 'keep-alive', 'vary': 'Origin', 'CF-Cache-Status': 'DYNAMIC', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=5iQ83c%2B9r7u6IVUB0XNJ8w6S%2FwN0Wyt%2F7G%2FSduK%2Fq16YJvwePFNSg%2FHD%2BooFs4uJkl9BxGSPH19jdHOyOfJbknvgFoXEx8diTVqSW2e4vO148p1QL%2BZYy7XraojND0QqT7YJID%2FozGk%3D"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains; preload', 'X-Content-Type-Options': 'nosniff', 'Server': 'cloudflare', 'CF-RAY': '84689e682d8e5dd6-HKG', 'alt-svc': 'h3=":443"; ma=86400'})
D 01-16 19:03:30 provisioner.py:170] HTTP response body: {"code":6,"message":"A disk with that id already exists","details":[]}

This should be fine if it is a transient issue with the cloud, but a more serious problem is the resource leakage caused by the failure of the auto termination triggered by SkyPilot. After the error is raised above, SkyPilot will call the sky.provision.cudo.instance::terminate_instance to terminate the partially ready VMs, and the following error occurs:

D 01-16 19:03:30 provisioner.py:179] Terminating the failed cluster.
D 01-16 19:03:31 cloud_vm_ray_backend.py:1152] Got error(s) in Cudo:[cudo_compute.rest.ApiException] (400)
D 01-16 19:03:31 cloud_vm_ray_backend.py:1152] Reason: Bad Request
D 01-16 19:03:31 cloud_vm_ray_backend.py:1152] HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 16 Jan 2024 19:03:31 GMT', 'Content-Type': 'application/json', 'Content-Length': '211', 'Connection': 'keep-alive', 'vary': 'Origin', 'CF-Cache-Status': 'DYNAMIC', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=PuvVoBDN1Qo8X3NHl0oqIc02Ldsh%2B1c2OeOgafxQhBCC8vg7fJlUJHoM%2Bh1DrdZHWgSsYnQWV4YrgkFal0FFih7PhyYr5TokcIpc1qjl6bwTSW51jBKF%2FXHnzkGJ4psfANopLiyGnUU%3D"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains; preload', 'X-Content-Type-Options': 'nosniff', 'Server': 'cloudflare', 'CF-RAY': '84689e6facad04ce-HKG', 'alt-svc': 'h3=":443"; ma=86400'})
D 01-16 19:03:31 cloud_vm_ray_backend.py:1152] HTTP response body: {"code":9,"message":"Invalid vm state (prol)","details":[{"@type":"type.googleapis.com/google.rpc.PreconditionFailure","violations":[{"type":"","subject":"","description":"vm cannot currently be terminated"}]}]}

It seems to be a problem for not being able to call termination on a VM with state prol. Should we wait until the instance to get out of the pending stage before we call termination operation?

sky/clouds/cudo.py

sky/provision/cudo/instance.py

Michaelvll · 2024-01-17T00:29:10Z

Another issue: it seems when I try to launch a multi-node cluster with sky launch --num-nodes 2 --cloud cudo --gpus RTXA4000, SkyPilot gets empty internal IPs for the VM. This causes the failure for starting the ray cluster using the internal IPs. Here is an InstanceInfo for a VM printed by the provisioner {'status': 'runn', 'tags': {}, 'name': 'test-cudo-2514-head', 'ip': '149.36.1.9', 'external_ip': '149.36.1.9', 'internal_ip': ''}

Do we have an idea why the internal IP is empty and how can we fix that? If it is fine, we can also set the internal IP to be the same as external IP, connecting the VMs directly using the extneral IPs.

JungleCatSW · 2024-01-17T19:49:44Z

Everything except the disk id issue is resolved, I need to make some changes elsewhere to fix that. I will let you know when it is ready.

Michaelvll

Thank you for the quick fix @JungleCatSW! This is fantastic! I just tried launching the single and multiple VMs with autodown and it seems working correctly. We are happy to merge this PR in first with the master branch merged into this branch and the comments below fixed.

After the PR is merged, there will be several TODOs:

Add instruction for how to setup the cudo credentials in our https://github.com/skypilot-org/skypilot/blob/master/docs/source/getting-started/installation.rst
Add tests in our tests: https://github.com/skypilot-org/skypilot/blob/master/tests/conftest.py#L23-L26

sky/provision/cudo/instance.py

sky/setup_files/setup.py

sky/templates/cudo-ray.yml.j2

sky/provision/cudo/cudo_wrapper.py

sky/provision/cudo/instance.py

Michaelvll · 2024-01-23T05:06:07Z

sky/provision/cudo/instance.py

+            head_instance_id = instance_id
+
+    # Wait for instances to be ready.
+    retries = 12  # times 10 second


I tried sky launch --num-nodes 2 --gpus RTXA4000 --cloud cudo and it seems the VM is still in starting state after more than 6 minutes. Probably the 120 seconds wait time is not enough? Similar thing happens to RTX3080

Update: it seems to be a cloud backend issue as sky launch --gpus RTXA4000 --cloud cudo also took significant amount of time, but RTXA6000 seems fine.

I have adjusted it to 10 minutes, although they should usually launch in under a minute. I will have to do some investigation there. Let me know if 10 minutes is too long...

Michaelvll · 2024-01-31T21:13:44Z

Could we run the format.sh to format the code?

# Conflicts: # sky/backends/backend_utils.py # sky/clouds/service_catalog/__init__.py # sky/setup_files/setup.py # tests/conftest.py

…esources_utils.DiskTier]

JungleCatSW added 9 commits January 4, 2024 15:34

Cudo Compute implementation

b90fff8

small fixes

8ef426c

convert to provisioner

82b599a

working

c0185e1

format

07e7654

moved catalog fetch method

7d0f698

small fixes

99f4ab8

catalog pull frequency

f27919f

template fix

4c09637

Michaelvll mentioned this pull request Jan 13, 2024

Adding catalog for Cudo Compute skypilot-org/skypilot-catalog#57

Merged

Michaelvll force-pushed the master branch from 71213e5 to 9743aa0 Compare January 13, 2024 05:30

Michaelvll reviewed Jan 14, 2024

View reviewed changes

removed leftovers from node provider, resolved PR comments, renamed G…

d4c44e4

…PUs, catalog fetch speed up.

Michaelvll reviewed Jan 15, 2024

View reviewed changes

sky/clouds/cudo.py Outdated Show resolved Hide resolved

sky/clouds/service_catalog/data_fetchers/requirements.txt Outdated Show resolved Hide resolved

sky/clouds/cudo.py Outdated Show resolved Hide resolved

sky/clouds/cudo.py Outdated Show resolved Hide resolved

added indent prefixes, formatting, removed cudo-compute from fetcher …

65a98d0

…requirements.txt

Michaelvll reviewed Jan 16, 2024

View reviewed changes

sky/clouds/cudo.py Outdated Show resolved Hide resolved

sky/clouds/cudo.py Show resolved Hide resolved

sky/provision/cudo/instance.py Outdated Show resolved Hide resolved

Michaelvll mentioned this pull request Jan 16, 2024

[Provisioner] Robustify the termiantion for provision failure to avoid leakage #2990

Merged

5 tasks

Michaelvll reviewed Jan 16, 2024

View reviewed changes

sky/provision/cudo/instance.py Outdated Show resolved Hide resolved

JungleCatSW added 3 commits January 17, 2024 17:30

Wait for status on terminate

a6df060

If internal ip empty copy external ip.

ab5d554

memory filtering, formatting

70fd083

fix disk id issue

3873cff

JungleCatSW requested a review from Michaelvll January 22, 2024 09:38

Michaelvll approved these changes Jan 23, 2024

View reviewed changes

JungleCatSW added 3 commits January 25, 2024 14:40

increase timeout, pip version and small fixes

93c3094

format

b445b68

add cudo to tests

7e58797

JungleCatSW force-pushed the cudo-compute-v2 branch from ba34949 to 7e58797 Compare February 11, 2024 08:49

JungleCatSW added 2 commits February 11, 2024 08:53

Merge branch 'master' into cudo-compute-v2

400c7e7

# Conflicts: # sky/backends/backend_utils.py # sky/clouds/service_catalog/__init__.py # sky/setup_files/setup.py # tests/conftest.py

remove accelerator_in_region_or_zone, change to disk_tier: Optional[r…

87f52e3

…esources_utils.DiskTier]

Michaelvll merged commit a52712e into skypilot-org:master Feb 12, 2024
19 checks passed

Michaelvll mentioned this pull request Mar 5, 2024

[Cloud] Support vast.ai #2930

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cudo] Add new cloud Cudo compute using Provisioner method #2975

[Cudo] Add new cloud Cudo compute using Provisioner method #2975

JungleCatSW commented Jan 11, 2024 •

edited

Loading

Michaelvll left a comment

Michaelvll left a comment

Michaelvll left a comment

Michaelvll commented Jan 17, 2024

JungleCatSW commented Jan 17, 2024

Michaelvll left a comment •

edited

Loading

Michaelvll Jan 23, 2024

Michaelvll Jan 23, 2024

JungleCatSW Jan 25, 2024

Michaelvll commented Jan 31, 2024

[Cudo] Add new cloud Cudo compute using Provisioner method #2975

[Cudo] Add new cloud Cudo compute using Provisioner method #2975

Conversation

JungleCatSW commented Jan 11, 2024 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll commented Jan 17, 2024

JungleCatSW commented Jan 17, 2024

Michaelvll left a comment • edited Loading

Choose a reason for hiding this comment

Michaelvll Jan 23, 2024

Choose a reason for hiding this comment

Michaelvll Jan 23, 2024

Choose a reason for hiding this comment

JungleCatSW Jan 25, 2024

Choose a reason for hiding this comment

Michaelvll commented Jan 31, 2024

JungleCatSW commented Jan 11, 2024 •

edited

Loading

Michaelvll left a comment •

edited

Loading