Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SkyServe] Fix sky serve up with docker login config #2983

Merged
merged 3 commits into from
Feb 1, 2024

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Jan 14, 2024

Previously, when we sky serve up a yaml w/ docker login config, it will throw an error saying DockerLoginConfig is not JSON serializable, which is because we accidentally added the docker login to resources' to_yaml_config function. This PR addressed the problem and fixed #2982 .

# docker_login_service.yaml
service:
  readiness_probe: /
  replicas: 1

resources:
  cloud: aws
  cpus: 2
  image_id: docker:txia-test-ecr:latest
  ports: 5000

envs:
  SKYPILOT_DOCKER_USERNAME: AWS
  SKYPILOT_DOCKER_PASSWORD: <password here>
  SKYPILOT_DOCKER_SERVER: <uid>.dkr.ecr.us-east-1.amazonaws.com

run: |
  echo $SKYPILOT_DOCKER_USERNAME
  python -m http.server 5000
$ sky serve up docker_login_service.yaml
Service from YAML spec: docker_login_service.yaml
Service Spec:
Readiness probe method:           GET /v1/models
Readiness initial delay seconds:  1200
Replica autoscaling policy:       Fixed 1 replica        
Each replica will use the following resources (estimated):
I 01-13 22:07:14 optimizer.py:694] == Optimizer ==
I 01-13 22:07:14 optimizer.py:706] Target: minimizing cost
I 01-13 22:07:14 optimizer.py:717] Estimated cost: $0.7 / hour
I 01-13 22:07:14 optimizer.py:717] 
TypeError: Object of type DockerLoginConfig is not JSON serializable

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky serve up docker_login_service.yaml and no error is thrown; also the service is working (one replica turns to READY).
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@cblmemo cblmemo requested a review from Michaelvll January 14, 2024 06:59
@benbot
Copy link

benbot commented Jan 16, 2024

I must be missing something. I'm trying to run this branch locally, but sky serve isn't a valid command

@Michaelvll
Copy link
Collaborator

I must be missing something. I'm trying to run this branch locally, but sky serve isn't a valid command

Hey @benbot, could you share the version of skypilot you are using locally with sky --version? It would be nice to try pip uninstall skypilot; pip install -U skypilot-nightly. : )

@Michaelvll
Copy link
Collaborator

Thanks for submitting the PR @cblmemo! Could we also test GCP private docker repo?

@benbot
Copy link

benbot commented Jan 16, 2024

@Michaelvll The skypilot-nightly from pip works fine. I'm trying to use this branch, so I can test out deploying a docker image from a private registry.

I'm seeing version skypilot, version 1.0.0.dev0 FWIW

@Michaelvll
Copy link
Collaborator

@Michaelvll The skypilot-nightly from pip works fine. I'm trying to use this branch, so I can test out deploying a docker image from a private registry.

I'm seeing version skypilot, version 1.0.0.dev0 FWIW

Oops, sorry I misunderstood it. If you would like to try this branch. You could try pip uninstall skypilot skypilot-nightly; pip install -e . to install from the source locally. : )

@benbot
Copy link

benbot commented Jan 18, 2024

Not sure if this is related, but on this branch i'm not able to run my docker images.

I'm being met with this error:

I 01-18 01:55:46 replica_managers.py:111] sky.exceptions.CommandError: Command docker exec sky_container /bin/bash -c 'bash --login -c -i '"'"'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (sudo apt-get update; sudo apt-get install -y rsync curl wget patch openssh-server python3-pip;)'"'"''  failed with return code 127.
I 01-18 01:55:46 replica_managers.py:111] Failed to run docker setup commands

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 18, 2024

Not sure if this is related, but on this branch i'm not able to run my docker images.

I'm being met with this error:

I 01-18 01:55:46 replica_managers.py:111] sky.exceptions.CommandError: Command docker exec sky_container /bin/bash -c 'bash --login -c -i '"'"'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (sudo apt-get update; sudo apt-get install -y rsync curl wget patch openssh-server python3-pip;)'"'"''  failed with return code 127.
I 01-18 01:55:46 replica_managers.py:111] Failed to run docker setup commands

Humm, could you share the whole log so we could identify the problem?

@benbot
Copy link

benbot commented Jan 18, 2024

@cblmemo Here are the full logs from the failure before retrying.

Nothing jumps out at me as particularly useful :(

https://pastebin.com/guu4FV0D

https://pastebin.com/untZJcK4 <--- the same logs, but scrolled up more

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 19, 2024

https://pastebin.com/guu4FV0D

Humm could you share the whole launch log to me? This seems strange and iiuc missing some lines... BTW, just want to make sure, are you at the latest commit of this branch?

@benbot
Copy link

benbot commented Jan 19, 2024

Yeah it does seem like it's missing lines :/

The actual error is the 127 return code from the setup command, which is some kind of "Not Found" error.

I was on the latest branch last I tried, but I'll run it again today and send you the full logs.

It's mostly retrying the ssh connection until it connects then this failure error shows up moments later

@benbot
Copy link

benbot commented Jan 22, 2024

@cblmemo Here is another attempt, different container image, same private registry, same issue.

https://pastebin.com/vwqfG55T

There's nothing else in the log, really. Above this log is the provision config which contains secrets, so I don't want to paste it here. Before that it's just skypilot attempting to find a region where I haven't hit quota yet

@benbot
Copy link

benbot commented Jan 22, 2024

Oh weirdly enough, when it retries it has a different error log, but ends with the same 127 error.

https://pastebin.com/BAn7kuiK

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 23, 2024

Oh weirdly enough, when it retries it has a different error log, but ends with the same 127 error.

https://pastebin.com/BAn7kuiK

Oh I might know the problem... From the bash: apt-get: command not found output, IIUC this image is not Debian-based, and our system only supports Debian-based containers for now. Check this documentation for more details

@benbot
Copy link

benbot commented Jan 23, 2024

No this is a rockylinux container.

Is there any way / are there any plans to use non Debian base images?

Also, I may be blind, but I don't see where on that doc page it says that only Debian images are supported

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 24, 2024

No this is a rockylinux container.

Is there any way / are there any plans to use non Debian base images?

Also, I may be blind, but I don't see where on that doc page it says that only Debian images are supported

The challenging to support non-Debian based image is to infer the package manager in the container. We will try to resolve this 🫡 Here is a related issue: #2673

For the documentation, I attached a screenshot for your reference. Though arguably it is hard to find... I just raised a PR to highlight it. #3021

image

@dbuades
Copy link

dbuades commented Jan 24, 2024

I confirm that this PR works properly when loading a private image from GHCR. Thanks!

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 25, 2024

@Michaelvll this is ready for another look 👀

@benbot
Copy link

benbot commented Jan 30, 2024

Sorry again if this is unrelated, but I'm still seeing some errors with an ubuntu container.

Every serve up with a docker image is saying that the SQLite db is locked.

I 01-30 09:40:26 provisioner.py:73] Launching on GCP us-central1 (us-central1-a)
I 01-30 09:40:41 provisioner.py:372] Successfully provisioned or found existing instance.
I 01-30 09:40:58 provisioner.py:474] Successfully provisioned cluster: sky-serve-controller-e8e0e970
I 01-30 09:40:58 cloud_vm_ray_backend.py:4417] Processing file mounts.
I 01-30 09:40:58 cloud_vm_ray_backend.py:4449] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-01-30-09-40-16-390132/file_mounts.log
I 01-30 09:40:58 backend_utils.py:1287] Syncing (to 1 node): /tmp/service-task-gm-familiar-nogpu-test-s3j0c35v -> ~/.sky/serve/gm_familiar_nogpu_test/task.yaml.tmp
I 01-30 09:41:01 backend_utils.py:1287] Syncing (to 1 node): /tmp/tmpirt8r8eq -> ~/.sky/serve/gm_familiar_nogpu_test/config.yaml
I 01-30 09:41:04 cloud_vm_ray_backend.py:3168] Running setup on 1 node.
Warning: Permanently added '34.67.153.49' (ED25519) to the list of known hosts.
I 01-30 09:41:13 cloud_vm_ray_backend.py:3178] Setup completed.
I 01-30 09:41:24 cloud_vm_ray_backend.py:3275] Job submitted with Job ID: 8

E 01-30 09:41:42 subprocess_utils.py:73] Traceback (most recent call last):
E 01-30 09:41:42 subprocess_utils.py:73]   File "<string>", line 1, in <module>
E 01-30 09:41:42 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/serve_utils.py", line 420, in wait_service_initialization
E 01-30 09:41:42 subprocess_utils.py:73]     record = serve_state.get_service_from_name(service_name)
E 01-30 09:41:42 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/serve_state.py", line 299, in get_service_from_name
E 01-30 09:41:42 subprocess_utils.py:73]     rows = _DB.cursor.execute('SELECT * FROM services WHERE name=(?)',
E 01-30 09:41:42 subprocess_utils.py:73] sqlite3.OperationalError: database is locked
E 01-30 09:41:42 subprocess_utils.py:73] 
E 01-30 09:41:55 subprocess_utils.py:73] Traceback (most recent call last):
E 01-30 09:41:55 subprocess_utils.py:73]   File "<string>", line 1, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/__init__.py", line 41, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky import backends
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/__init__.py", line 4, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.backends.cloud_vm_ray_backend import CloudVmRayBackend
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 26, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky import cloud_stores
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/cloud_stores.py", line 16, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.clouds import gcp
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/clouds/__init__.py", line 16, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.clouds.kubernetes import Kubernetes
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/clouds/kubernetes.py", line 15, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.utils import kubernetes_utils
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/kubernetes_utils.py", line 16, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.backends import backend_utils
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/backend_utils.py", line 36, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky import serve as serve_lib
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/__init__.py", line 8, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.serve.core import down
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/core.py", line 12, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky import execution
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 25, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.utils import controller_utils
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/controller_utils.py", line 22, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.serve import serve_utils
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/serve_utils.py", line 26, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.serve import serve_state
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/serve_state.py", line 50, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     _DB = db_utils.SQLiteConn(_DB_PATH, create_table)
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/db_utils.py", line 86, in __init__
E 01-30 09:41:55 subprocess_utils.py:73]     create_table(self.cursor, self.conn)
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/serve_state.py", line 29, in create_table
E 01-30 09:41:55 subprocess_utils.py:73]     cursor.execute("""\
E 01-30 09:41:55 subprocess_utils.py:73] sqlite3.OperationalError: database is locked
E 01-30 09:41:55 subprocess_utils.py:73] 
sky.exceptions.CommandError: Command python3 -u -c 'from sky.serve import serve_state; from sky.serve import serve_utils; msg = serve_utils.wait_service_initialization('"'"'gm-familiar-nogpu-test'"'"', 8); print(msg, end="", flush=True)' failed with return code 1.
Failed to wait for service initialization

During handling of the above exception, another exception occurred:

sky.exceptions.CommandError: Command python3 -u -c 'import os;from sky.skylet import job_lib, log_lib;job_ids = [8] if [8] is not None else [job_lib.get_latest_job_id()];job_statuses = job_lib.get_statuses_payload(job_ids);print(job_statuses, flush=True)' failed with return code 1.
Failed to get job status.

I'd put this on pastebin again, but it seems to be down.

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 31, 2024

Sorry again if this is unrelated, but I'm still seeing some errors with an ubuntu container.

Every serve up with a docker image is saying that the SQLite db is locked.

I 01-30 09:40:26 provisioner.py:73] Launching on GCP us-central1 (us-central1-a)
I 01-30 09:40:41 provisioner.py:372] Successfully provisioned or found existing instance.
I 01-30 09:40:58 provisioner.py:474] Successfully provisioned cluster: sky-serve-controller-e8e0e970
I 01-30 09:40:58 cloud_vm_ray_backend.py:4417] Processing file mounts.
I 01-30 09:40:58 cloud_vm_ray_backend.py:4449] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-01-30-09-40-16-390132/file_mounts.log
I 01-30 09:40:58 backend_utils.py:1287] Syncing (to 1 node): /tmp/service-task-gm-familiar-nogpu-test-s3j0c35v -> ~/.sky/serve/gm_familiar_nogpu_test/task.yaml.tmp
I 01-30 09:41:01 backend_utils.py:1287] Syncing (to 1 node): /tmp/tmpirt8r8eq -> ~/.sky/serve/gm_familiar_nogpu_test/config.yaml
I 01-30 09:41:04 cloud_vm_ray_backend.py:3168] Running setup on 1 node.
Warning: Permanently added '34.67.153.49' (ED25519) to the list of known hosts.
I 01-30 09:41:13 cloud_vm_ray_backend.py:3178] Setup completed.
I 01-30 09:41:24 cloud_vm_ray_backend.py:3275] Job submitted with Job ID: 8

E 01-30 09:41:42 subprocess_utils.py:73] Traceback (most recent call last):
E 01-30 09:41:42 subprocess_utils.py:73]   File "<string>", line 1, in <module>
E 01-30 09:41:42 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/serve_utils.py", line 420, in wait_service_initialization
E 01-30 09:41:42 subprocess_utils.py:73]     record = serve_state.get_service_from_name(service_name)
E 01-30 09:41:42 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/serve_state.py", line 299, in get_service_from_name
E 01-30 09:41:42 subprocess_utils.py:73]     rows = _DB.cursor.execute('SELECT * FROM services WHERE name=(?)',
E 01-30 09:41:42 subprocess_utils.py:73] sqlite3.OperationalError: database is locked
E 01-30 09:41:42 subprocess_utils.py:73] 
E 01-30 09:41:55 subprocess_utils.py:73] Traceback (most recent call last):
E 01-30 09:41:55 subprocess_utils.py:73]   File "<string>", line 1, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/__init__.py", line 41, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky import backends
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/__init__.py", line 4, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.backends.cloud_vm_ray_backend import CloudVmRayBackend
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 26, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky import cloud_stores
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/cloud_stores.py", line 16, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.clouds import gcp
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/clouds/__init__.py", line 16, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.clouds.kubernetes import Kubernetes
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/clouds/kubernetes.py", line 15, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.utils import kubernetes_utils
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/kubernetes_utils.py", line 16, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.backends import backend_utils
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/backend_utils.py", line 36, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky import serve as serve_lib
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/__init__.py", line 8, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.serve.core import down
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/core.py", line 12, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky import execution
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 25, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.utils import controller_utils
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/controller_utils.py", line 22, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.serve import serve_utils
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/serve_utils.py", line 26, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     from sky.serve import serve_state
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/serve_state.py", line 50, in <module>
E 01-30 09:41:55 subprocess_utils.py:73]     _DB = db_utils.SQLiteConn(_DB_PATH, create_table)
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/db_utils.py", line 86, in __init__
E 01-30 09:41:55 subprocess_utils.py:73]     create_table(self.cursor, self.conn)
E 01-30 09:41:55 subprocess_utils.py:73]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/serve_state.py", line 29, in create_table
E 01-30 09:41:55 subprocess_utils.py:73]     cursor.execute("""\
E 01-30 09:41:55 subprocess_utils.py:73] sqlite3.OperationalError: database is locked
E 01-30 09:41:55 subprocess_utils.py:73] 
sky.exceptions.CommandError: Command python3 -u -c 'from sky.serve import serve_state; from sky.serve import serve_utils; msg = serve_utils.wait_service_initialization('"'"'gm-familiar-nogpu-test'"'"', 8); print(msg, end="", flush=True)' failed with return code 1.
Failed to wait for service initialization

During handling of the above exception, another exception occurred:

sky.exceptions.CommandError: Command python3 -u -c 'import os;from sky.skylet import job_lib, log_lib;job_ids = [8] if [8] is not None else [job_lib.get_latest_job_id()];job_statuses = job_lib.get_statuses_payload(job_ids);print(job_statuses, flush=True)' failed with return code 1.
Failed to get job status.

I'd put this on pastebin again, but it seems to be down.

Hey @benbot - Thanks for reporting this! The error message is actually refers to the service controller, not replicas. Are you reusing some existing controller which have some failure history? If so, sky down sky-serve-controller-xxx and re-sky serve up will fix that for you.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @cblmemo! LGTM.

@cblmemo cblmemo merged commit 25ad4a3 into master Feb 1, 2024
19 checks passed
@cblmemo cblmemo deleted the fix-serve-up-with-docker-login-config branch February 1, 2024 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docker images not working when using serve up
4 participants