Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve][k8s] K8s replica ports not detected #3798

Open
cblmemo opened this issue Jul 30, 2024 · 5 comments
Open

[Serve][k8s] K8s replica ports not detected #3798

cblmemo opened this issue Jul 30, 2024 · 5 comments

Comments

@cblmemo
Copy link
Collaborator

cblmemo commented Jul 30, 2024

commit 34f13a33d4c4036017ea3f8edb43bd1fa4e89eb8

On latest master, examples/serve/http_server/task.yaml for k8s controller + replica failed to detect the port on k8s replica. Seems like the get_endpoints is the reason.

Controller log:

W 07-30 02:27:51 backend_utils.py:2809] Port 8081 not exposed yet. If the cluster was recently started, please retry after a while. Additionally, make sure your LoadBalancer is configured correctly. 
W 07-30 02:27:51 backend_utils.py:2809] To debug, run: kubectl describe service
I 07-30 02:28:03 controller.py:97] Received 0 inflight requests.
I 07-30 02:28:03 autoscalers.py:236] Num of requests in the last 60 seconds: 0
I 07-30 02:28:03 httptools_impl.py:466] 127.0.0.1:52910 - "POST /controller/load_balancer_sync HTTP/1.1" 200
I 07-30 02:28:23 controller.py:97] Received 0 inflight requests.
I 07-30 02:28:23 autoscalers.py:236] Num of requests in the last 60 seconds: 0
I 07-30 02:28:23 httptools_impl.py:466] 127.0.0.1:36406 - "POST /controller/load_balancer_sync HTTP/1.1" 200
I 07-30 02:28:43 controller.py:97] Received 0 inflight requests.
I 07-30 02:28:43 autoscalers.py:236] Num of requests in the last 60 seconds: 0
I 07-30 02:28:43 httptools_impl.py:466] 127.0.0.1:44694 - "POST /controller/load_balancer_sync HTTP/1.1" 200
W 07-30 02:28:43 backend_utils.py:2809] Port 8081 not exposed yet. If the cluster was recently started, please retry after a while. Additionally, make sure your LoadBalancer is configured correctly. 
W 07-30 02:28:43 backend_utils.py:2809] To debug, run: kubectl describe service
I 07-30 02:28:43 controller.py:60] All replica info: [ReplicaInfo(replica_id=1, cluster_name=k8s-svc-1, version=1, replica_port=8081, is_spot=False, status=ReplicaStatus.STARTING, launched_at=1722306132)]
I 07-30 02:28:43 replica_managers.py:265] Check replica unrecorverable: first_ready_time None, user_app_failed False, status ReplicaStatus.STARTING
I 07-30 02:28:43 autoscalers.py:461] No scaling needed.
W 07-30 02:28:51 backend_utils.py:2809] Port 8081 not exposed yet. If the cluster was recently started, please retry after a while. Additionally, make sure your LoadBalancer is configured correctly. 
W 07-30 02:28:51 backend_utils.py:2809] To debug, run: kubectl describe service
I 07-30 02:28:51 replica_managers.py:509] Error when probing replica 1 with url None: Cannot get the endpoint.
I 07-30 02:28:51 replica_managers.py:1105] Replica 1 is not ready and exceeding initial delay seconds. Terminating the replica...
I 07-30 02:28:51 replica_managers.py:749] Terminating replica 1...
I 07-30 02:28:51 replica_managers.py:720] Syncing down logs for replica 1...
I 07-30 02:28:52 cloud_vm_ray_backend.py:3588] Job 1 logs: /home/sky/sky_logs/replica_jobs/sky-2024-07-30-02-21-40-731124
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.244.0.12']
(k8s-svc, pid=1119) serving at port 8081

I 07-30 02:28:52 replica_managers.py:736] 
I 07-30 02:28:52 replica_managers.py:736] == End of logs (Replica: 1) ==
I 07-30 02:28:52 replica_managers.py:757] preempted: False, replica_id: 1
I 07-30 02:29:02 replica_managers.py:1049] Replicas to probe: 
I 07-30 02:29:03 controller.py:97] Received 0 inflight requests.
I 07-30 02:29:03 autoscalers.py:236] Num of requests in the last 60 seconds: 0
I 07-30 02:29:03 httptools_impl.py:466] 127.0.0.1:44154 - "POST /controller/load_balancer_sync HTTP/1.1" 200
I 07-30 02:29:03 controller.py:60] All replica info: [ReplicaInfo(replica_id=1, cluster_name=k8s-svc-1, version=1, replica_port=8081, is_spot=False, status=ReplicaStatus.SHUTTING_DOWN, launched_at=None)]
@Michaelvll
Copy link
Collaborator

W 07-30 02:28:51 backend_utils.py:2809] Port 8081 not exposed yet. If the cluster was recently started, please retry after a while. Additionally, make sure your LoadBalancer is configured correctly.

It will take a while for the port to be ready on a newly created k8s pod. We added the fix here #3634, but seems there is still an issue here. Could you help check why this happens?

@romilbhardwaj
Copy link
Collaborator

Hmm, I'm not able to replicate this on GKE. Where was your Kubernetes cluster running?

(base) ➜  sky-experiments git:(master) ✗ sky serve status --endpoint http
35.225.92.178:30001
(base) ➜  sky-experiments git:(master) ✗ sky status
Clusters
NAME                           LAUNCHED  RESOURCES                                                                  STATUS  AUTOSTOP  COMMAND
test                           1 hr ago  1x Kubernetes(2CPU--2GB)                                                   UP      -         sky launch -c test --cloud...
sky-serve-controller-2ea485ea  1 hr ago  1x Kubernetes(5CPU--5GB, cpus=5+, mem=5+, disk_size=200, ports=['30001...  UP      -         sky serve up -n http http...

Managed jobs
No in-progress managed jobs. (See: sky jobs -h)

Services
NAME  VERSION  UPTIME   STATUS  REPLICAS  ENDPOINT
http  1        57m 28s  READY   3/3       35.225.92.178:30001

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT                     LAUNCHED     RESOURCES              STATUS  REGION
http          1   1        http://104.197.226.107:8081  57 mins ago  1x Kubernetes(vCPU=1)  READY   kubernetes
http          2   1        http://34.72.10.83:8081      57 mins ago  1x Kubernetes(vCPU=1)  READY   kubernetes
http          3   1        http://34.16.83.15:8081      57 mins ago  1x Kubernetes(vCPU=1)  READY   kubernetes

* To see detailed service status: sky serve status -a

@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 7, 2024

Hmm, I'm not able to replicate this on GKE. Where was your Kubernetes cluster running?

(base) ➜  sky-experiments git:(master) ✗ sky serve status --endpoint http
35.225.92.178:30001
(base) ➜  sky-experiments git:(master) ✗ sky status
Clusters
NAME                           LAUNCHED  RESOURCES                                                                  STATUS  AUTOSTOP  COMMAND
test                           1 hr ago  1x Kubernetes(2CPU--2GB)                                                   UP      -         sky launch -c test --cloud...
sky-serve-controller-2ea485ea  1 hr ago  1x Kubernetes(5CPU--5GB, cpus=5+, mem=5+, disk_size=200, ports=['30001...  UP      -         sky serve up -n http http...

Managed jobs
No in-progress managed jobs. (See: sky jobs -h)

Services
NAME  VERSION  UPTIME   STATUS  REPLICAS  ENDPOINT
http  1        57m 28s  READY   3/3       35.225.92.178:30001

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT                     LAUNCHED     RESOURCES              STATUS  REGION
http          1   1        http://104.197.226.107:8081  57 mins ago  1x Kubernetes(vCPU=1)  READY   kubernetes
http          2   1        http://34.72.10.83:8081      57 mins ago  1x Kubernetes(vCPU=1)  READY   kubernetes
http          3   1        http://34.16.83.15:8081      57 mins ago  1x Kubernetes(vCPU=1)  READY   kubernetes

* To see detailed service status: sky serve status -a

On local kind cluster created by sky local up

@romilbhardwaj
Copy link
Collaborator

Ah, I think I know what's going on. When using sky local up we force using ingress port mode. However, getting ingress endpoint on a local kind cluster works only when running from the host machine since it returns localhost if an external IP is not available:

if ingress_service.status.load_balancer.ingress is None:
# We try to get an IP/host for the service in the following order:
# 1. Try to use assigned external IP if it exists
# 2. Use the skypilot.co/external-ip annotation in the service
# 3. Otherwise return 'localhost'

This won't work when the endpoint is fetched from within the cluster (e.g., from the controller) since localhost will not point to the ingress service. Hence the readiness probe fails:

I 08-07 16:09:29 replica_managers.py:513] Probing replica 3 with url http://localhost:30100/skypilot/default/http-3-2ea4/8081 with http://localhost:30100/skypilot/default/http-3-2ea4/8081/health.
E 08-07 16:09:29 replica_managers.py:539] Error when probing replica 3 with url http://localhost:30100/skypilot/default/http-3-2ea4/8081: requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=30100): Max retries exceeded with url: /skypilot/default/http-3-2ea4/8081/health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x409111fca0>: Failed to establish a new connection: [Errno 111] Connection refused')).

We probably need to special case the ingress endpoint fetching when running inside the controller and using ingress mode to directly use the ingress controller service.

Copy link
Contributor

github-actions bot commented Dec 6, 2024

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Dec 6, 2024
@cblmemo cblmemo removed the Stale label Dec 6, 2024
@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants