Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Provisioner] Support open ports on RunPod #3748

Merged
merged 13 commits into from
Aug 9, 2024
Merged

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Jul 12, 2024

This PR adds support for opening ports on RunPod. RunPod's support for updating ports on an existing pod will lose all the data stored on the pod, effectively a termination and re-creation cycle, so currently we only support opening ports on creation of the cluster and disable updating ports on the cluster.

TODO:

  • Add smoke test like other clouds
  • Test SkyServe on RunPod e2e
    • RunPod as service replica
    • RunPod as controller

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch --cloud runpod -c rp-port --ports 8000 and test python -m http.server manually with sky serve status rp-port --endpoint 8000
    • sky launch --cloud runpod -c rp-port --ports 8001 on the above cluster and it failed correctly
    • sky serve up examples/serve/http_server/task.yaml -n rp-svc --cloud runpod
      • Change ports from 8081 to 8080 since there is a nginx server running on ports 8081 on runpod clusters
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
    • pytest tests/test_smoke.py::test_runpod_http_server_with_custom_ports
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@Michaelvll Michaelvll self-requested a review July 16, 2024 17:56
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @cblmemo! It is great that we can expose ports on runpod. Left several comments below.

@@ -27,7 +27,7 @@ def do_GET(self):

if __name__ == '__main__':
parser = argparse.ArgumentParser(description='SkyServe HTTP Test Server')
parser.add_argument('--port', type=int, required=False, default=8081)
parser.add_argument('--port', type=int, required=False, default=8080)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we change this port from 8081 to 8080?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RunPod default image has an nginx running at 8081 on launch:

(base) root@2c7c82d210d5:~# lsof -i :8081
COMMAND PID USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
nginx    40 root    9u  IPv4 3927576185      0t0  TCP *:8081 (LISTEN)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment somewhere (probably in the RunPod provisioned codebase) mentioning that the 8081 port is being taken by RunPod image by default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Thanks!

examples/serve/http_server/task.yaml Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
Comment on lines 3071 to 3076
cloud = handle.launched_resources.cloud
if not (cloud.OPEN_PORTS_VERSION <=
clouds.OpenPortsVersion.OPEN_ON_LAUNCH_ONLY):
with rich_utils.safe_status(
'[bold cyan]Launching - Opening new ports'):
self._open_ports(handle)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we let the bulk_provision to open ports for both non-existing and existing clusters? If the cluster already exists and the cloud cannot open new ports on an existing cluster, we can skip the ports opening in the bulk_provision and print a warning here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we have some logic to calculate the newly-required ports (e.g. launched with 8000-8010, now adding 8005-8015, only need to add 8011-8015). If we want to move the entire logic into bulk_provision, we need to move that logic as well. I'm slightly concerned if it will be the best place to put that logic but open to discussions.

Also, this will be a big refactoring for clouds that create SG and update them later - now we need to create/update a fully configured SG in the first place. Do you think we could do this in another PR?

sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/clouds/cloud.py Outdated Show resolved Hide resolved
controller_process.start()
serve_state.set_service_controller_port(service_name,
controller_port)

# TODO(tian): Support HTTPS.
controller_addr = f'http://localhost:{controller_port}'
controller_addr = f'http://{controller_host}:{controller_port}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work with this changed to controller_host for kubernetes? cc'ing @romilbhardwaj.

Copy link
Collaborator Author

@cblmemo cblmemo Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k8s will still use 0.0.0.0 in this setting so it should be fine. Just tested examples/serve/http_server/task.yaml on a local k8s controller and it seems to be working well.

Side notes: When I tried k8s controller + k8s replica, the replica encountered a mysterious bug that could not detect ports opened on k8s replica. Though for k8s controller + aws replica it works well so I think the problem is not introduced by this PR. Will investigate more.

W 07-30 01:58:54 backend_utils.py:2809] Port 8080 not exposed yet. If the cluster was recently started, please retry after a while. Additionally, make sure your LoadBalancer is configured correctly. 
W 07-30 01:58:54 backend_utils.py:2809] To debug, run: kubectl describe service

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirmed that it is not working on master as well. Just filed an issue #3798 to keep track of this 🫡

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @cblmemo! It looks good to me. I think we should be able to let it in once all the tests have passed.

@@ -1468,6 +1468,10 @@ def _retry_zones(
assert to_provision.region == region.name, (to_provision,
region)
num_nodes = handle.launched_nodes
ports_to_open_on_launch = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ports_to_open_on_launch = (
# Some clouds, like RunPod, only support exposing ports during launch. For those
# clouds, we pass the ports to open in the `bulk_provision` to expose the ports
# during provisioning.
# If the `bulk_provision` is to apply on an existing cluster, it should be ignored by
# the underlying provisioner implementation.
ports_to_open_on_launch = (

Please double check the above comment is true. We do ignore the ports to open when it is on an existing cluster, right?

Copy link
Collaborator Author

@cblmemo cblmemo Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Currently it will raise an error on runpod, if there are any new ports to be opened. e.g. provision with 8000-8010 and launch with 8002,8009 will not raise an error but 8005-8015 will. Do you think we should change it to a warning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After an offline discussion, we've decided to raise the error here. Merging now 🫡

@cblmemo cblmemo enabled auto-merge August 9, 2024 06:35
@cblmemo cblmemo added this pull request to the merge queue Aug 9, 2024
Merged via the queue into master with commit 1e6db6e Aug 9, 2024
20 checks passed
@cblmemo cblmemo deleted the runpod-ports branch August 9, 2024 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants