Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cudo] Add new cloud Cudo compute using Provisioner method #2975

Merged
merged 20 commits into from
Feb 12, 2024

Conversation

JungleCatSW
Copy link
Contributor

@JungleCatSW JungleCatSW commented Jan 11, 2024

Refactored to use the Cudo Compute PR into a v2 provisioner.
There is also a fetch method as our catalog changes reguarly so there will be a pr on the catalog repo too.

Update:

Fixed memory filtering, copied external for multi node launches, fixed disk id issue, added a wait for correct state for terminate.

Tested (run the relevant ones):

  • [x ] Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting the PR for support Cudo compute @JungleCatSW! The code looks fantastic and very clean. Just left several comments. : )

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/clouds/cudo.py Show resolved Hide resolved
sky/clouds/cudo.py Outdated Show resolved Hide resolved
sky/clouds/cudo.py Outdated Show resolved Hide resolved
sky/clouds/cudo.py Outdated Show resolved Hide resolved
sky/provision/cudo/config.py Outdated Show resolved Hide resolved
sky/provision/cudo/__init__.py Outdated Show resolved Hide resolved
sky/clouds/cudo.py Outdated Show resolved Hide resolved
sky/setup_files/MANIFEST.in Outdated Show resolved Hide resolved
sky/templates/cudo-ray.yml.j2 Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update @JungleCatSW! Left some minor comments. I think it should be good to go once we have done some basic tests after we get the access to the cloud.

sky/clouds/cudo.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/data_fetchers/requirements.txt Outdated Show resolved Hide resolved
sky/clouds/cudo.py Outdated Show resolved Hide resolved
sky/clouds/cudo.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update @JungleCatSW! I am trying out the PR with the Cudo compute. Left several compute for the fix of the memory filtering.

I also encountered the following issue:
sky launch -c test-cudo --gpus RTXA4000 --memory 10+ --num-nodes 4 echo hi
The cluster provisioning fails due to an error related to the disk id:

D 01-16 19:03:30 provisioner.py:170]     provision_record = provision.run_instances(provider_name,
D 01-16 19:03:30 provisioner.py:170]   File "/home/gcpuser/skypilot/sky/provision/__init__.py", line 41, in _wrapper
D 01-16 19:03:30 provisioner.py:170]     return impl(*args, **kwargs)
D 01-16 19:03:30 provisioner.py:170]   File "/home/gcpuser/skypilot/sky/provision/cudo/instance.py", line 87, in run_instances
D 01-16 19:03:30 provisioner.py:170]     instance_id = cudo_wrapper.launch(
D 01-16 19:03:30 provisioner.py:170]   File "/home/gcpuser/skypilot/sky/provision/cudo/cudo_wrapper.py", line 44, in launch
D 01-16 19:03:30 provisioner.py:170]     raise e
D 01-16 19:03:30 provisioner.py:170]   File "/home/gcpuser/skypilot/sky/provision/cudo/cudo_wrapper.py", line 41, in launch
D 01-16 19:03:30 provisioner.py:170]     vm = api.create_vm(cudo().cudo_api.project_id(), request)
D 01-16 19:03:30 provisioner.py:170]   File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api/virtual_machines_api.py", line 487, in create_vm
D 01-16 19:03:30 provisioner.py:170]     (data) = self.create_vm_with_http_info(project_id, create_vm_body, **kwargs)  # noqa: E501
D 01-16 19:03:30 provisioner.py:170]   File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api/virtual_machines_api.py", line 557, in create_vm_with_http_info
D 01-16 19:03:30 provisioner.py:170]     return self.api_client.call_api(
D 01-16 19:03:30 provisioner.py:170]   File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api_client.py", line 326, in call_api
D 01-16 19:03:30 provisioner.py:170]     return self.__call_api(resource_path, method,
D 01-16 19:03:30 provisioner.py:170]   File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api_client.py", line 158, in __call_api
D 01-16 19:03:30 provisioner.py:170]     response_data = self.request(
D 01-16 19:03:30 provisioner.py:170]   File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/api_client.py", line 368, in request
D 01-16 19:03:30 provisioner.py:170]     return self.rest_client.POST(url,
D 01-16 19:03:30 provisioner.py:170]   File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/rest.py", line 269, in POST
D 01-16 19:03:30 provisioner.py:170]     return self.request("POST", url,
D 01-16 19:03:30 provisioner.py:170]   File "/opt/conda/envs/sky/lib/python3.10/site-packages/cudo_compute/rest.py", line 228, in request
D 01-16 19:03:30 provisioner.py:170]     raise ApiException(http_resp=r)
D 01-16 19:03:30 provisioner.py:170] cudo_compute.rest.ApiException: (409)
D 01-16 19:03:30 provisioner.py:170] Reason: Conflict
D 01-16 19:03:30 provisioner.py:170] HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 16 Jan 2024 19:03:30 GMT', 'Content-Type': 'application/json', 'Content-Length': '70', 'Connection': 'keep-alive', 'vary': 'Origin', 'CF-Cache-Status': 'DYNAMIC', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=5iQ83c%2B9r7u6IVUB0XNJ8w6S%2FwN0Wyt%2F7G%2FSduK%2Fq16YJvwePFNSg%2FHD%2BooFs4uJkl9BxGSPH19jdHOyOfJbknvgFoXEx8diTVqSW2e4vO148p1QL%2BZYy7XraojND0QqT7YJID%2FozGk%3D"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains; preload', 'X-Content-Type-Options': 'nosniff', 'Server': 'cloudflare', 'CF-RAY': '84689e682d8e5dd6-HKG', 'alt-svc': 'h3=":443"; ma=86400'})
D 01-16 19:03:30 provisioner.py:170] HTTP response body: {"code":6,"message":"A disk with that id already exists","details":[]}

This should be fine if it is a transient issue with the cloud, but a more serious problem is the resource leakage caused by the failure of the auto termination triggered by SkyPilot. After the error is raised above, SkyPilot will call the sky.provision.cudo.instance::terminate_instance to terminate the partially ready VMs, and the following error occurs:

D 01-16 19:03:30 provisioner.py:179] Terminating the failed cluster.
D 01-16 19:03:31 cloud_vm_ray_backend.py:1152] Got error(s) in Cudo:[cudo_compute.rest.ApiException] (400)
D 01-16 19:03:31 cloud_vm_ray_backend.py:1152] Reason: Bad Request
D 01-16 19:03:31 cloud_vm_ray_backend.py:1152] HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 16 Jan 2024 19:03:31 GMT', 'Content-Type': 'application/json', 'Content-Length': '211', 'Connection': 'keep-alive', 'vary': 'Origin', 'CF-Cache-Status': 'DYNAMIC', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=PuvVoBDN1Qo8X3NHl0oqIc02Ldsh%2B1c2OeOgafxQhBCC8vg7fJlUJHoM%2Bh1DrdZHWgSsYnQWV4YrgkFal0FFih7PhyYr5TokcIpc1qjl6bwTSW51jBKF%2FXHnzkGJ4psfANopLiyGnUU%3D"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains; preload', 'X-Content-Type-Options': 'nosniff', 'Server': 'cloudflare', 'CF-RAY': '84689e6facad04ce-HKG', 'alt-svc': 'h3=":443"; ma=86400'})
D 01-16 19:03:31 cloud_vm_ray_backend.py:1152] HTTP response body: {"code":9,"message":"Invalid vm state (prol)","details":[{"@type":"type.googleapis.com/google.rpc.PreconditionFailure","violations":[{"type":"","subject":"","description":"vm cannot currently be terminated"}]}]}

It seems to be a problem for not being able to call termination on a VM with state prol. Should we wait until the instance to get out of the pending stage before we call termination operation?

sky/clouds/cudo.py Outdated Show resolved Hide resolved
sky/clouds/cudo.py Show resolved Hide resolved
sky/provision/cudo/instance.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator

Another issue: it seems when I try to launch a multi-node cluster with sky launch --num-nodes 2 --cloud cudo --gpus RTXA4000, SkyPilot gets empty internal IPs for the VM. This causes the failure for starting the ray cluster using the internal IPs. Here is an InstanceInfo for a VM printed by the provisioner {'status': 'runn', 'tags': {}, 'name': 'test-cudo-2514-head', 'ip': '149.36.1.9', 'external_ip': '149.36.1.9', 'internal_ip': ''}

Do we have an idea why the internal IP is empty and how can we fix that? If it is fine, we can also set the internal IP to be the same as external IP, connecting the VMs directly using the extneral IPs.

@JungleCatSW
Copy link
Contributor Author

Everything except the disk id issue is resolved, I need to make some changes elsewhere to fix that. I will let you know when it is ready.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick fix @JungleCatSW! This is fantastic! I just tried launching the single and multiple VMs with autodown and it seems working correctly. We are happy to merge this PR in first with the master branch merged into this branch and the comments below fixed.

After the PR is merged, there will be several TODOs:

  1. Add instruction for how to setup the cudo credentials in our https://github.com/skypilot-org/skypilot/blob/master/docs/source/getting-started/installation.rst
  2. Add tests in our tests: https://github.com/skypilot-org/skypilot/blob/master/tests/conftest.py#L23-L26

sky/provision/cudo/instance.py Show resolved Hide resolved
sky/setup_files/setup.py Outdated Show resolved Hide resolved
sky/templates/cudo-ray.yml.j2 Outdated Show resolved Hide resolved
sky/provision/cudo/cudo_wrapper.py Outdated Show resolved Hide resolved
sky/provision/cudo/instance.py Outdated Show resolved Hide resolved
sky/provision/cudo/instance.py Outdated Show resolved Hide resolved
head_instance_id = instance_id

# Wait for instances to be ready.
retries = 12 # times 10 second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried sky launch --num-nodes 2 --gpus RTXA4000 --cloud cudo and it seems the VM is still in starting state after more than 6 minutes. Probably the 120 seconds wait time is not enough? Similar thing happens to RTX3080

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: it seems to be a cloud backend issue as sky launch --gpus RTXA4000 --cloud cudo also took significant amount of time, but RTXA6000 seems fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have adjusted it to 10 minutes, although they should usually launch in under a minute. I will have to do some investigation there. Let me know if 10 minutes is too long...

@Michaelvll
Copy link
Collaborator

Could we run the format.sh to format the code?

# Conflicts:
#	sky/backends/backend_utils.py
#	sky/clouds/service_catalog/__init__.py
#	sky/setup_files/setup.py
#	tests/conftest.py
@Michaelvll Michaelvll merged commit a52712e into skypilot-org:master Feb 12, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants