Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ux] add sky jobs launch --fast #4231

Merged
merged 3 commits into from
Oct 31, 2024
Merged

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Oct 31, 2024

This flag will make the jobs controller launch use sky launch --fast. There
are a few known situations where this can cause misbehavior in the jobs
controller:

  • The SkyPilot wheel is outdated (due to changes in the SkyPilot code or a
    version upgrade).
  • The user's cloud credentials have changed. In this case the new credentials
    will not be synced, and if there are new clouds available in sky check, the
    cloud depedencies may not be correctly installed.

However, this does speed up jobs launch significantly, so provide it as a
dangerous option. Soon we will add robustness checks to sky launch --fast that
will fix the above caveats, and we can remove this flag and just enable the
behavior by default.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual tests
  • Relevant individual smoke tests: `pytest tests/test_smoke.py::test_managed_jobs
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

This flag will make the jobs controller launch use `sky launch --fast`. There
are a few known situations where this can cause misbehavior in the jobs
controller:
- The SkyPilot wheel is outdated (due to changes in the SkyPilot code or a
  version upgrade).
- The user's cloud credentials have changed. In this case the new credentials
  will not be synced, and if there are new clouds available in `sky check`, the
  cloud depedencies may not be correctly installed.

However, this does speed up `jobs launch` _significantly_, so provide it as a
dangerous option. Soon we will add robustness checks to `sky launch --fast` that
will fix the above caveats, and we can remove this flag and just enable the
behavior by default.
@cg505 cg505 requested a review from romilbhardwaj October 31, 2024 18:36
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505!

sky/cli.py Outdated Show resolved Hide resolved
sky/jobs/core.py Outdated Show resolved Hide resolved
@@ -138,6 +143,7 @@ def launch(
idle_minutes_to_autostop=skylet_constants.
CONTROLLER_IDLE_MINUTES_TO_AUTOSTOP,
retry_until_up=True,
fast=fast,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with this script:

for i in {1..5}; do
  sky jobs launch -y --fast --cpus 2+ -- echo hi2 &
done
wait

The last job failed with FAILED_CONTROLLER. Have you seen this before? https://gist.github.com/romilbhardwaj/7d1871f1c18b3bb0ccd9141e14bd9fdd

Copy link
Collaborator Author

@cg505 cg505 Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see this. Was able to run seq 100 | xargs -P 5 -n 1 bash -c 'sky jobs launch --fast -yd -n parallel-launch-$0 "echo $0"' without any issue.
It kind of looks like the controller just died while starting the job. Not sure what would cause this.

@romilbhardwaj
Copy link
Collaborator

romilbhardwaj commented Oct 31, 2024

Submitted 10 jobs in ~40s - nice!

for i in {1..10}; do
  sky jobs launch -d -y --fast --cpus 2+ -- echo hi2 &
done
wait

However, the controller runs only first few jobs then fails. Probably unrelated to this PR:

Managed jobs
No in-progress managed jobs.
ID  TASK  NAME     RESOURCES   SUBMITTED    TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS
17  -     sky-cmd  1x[CPU:2+]  5 mins ago   4m 50s         -             0            FAILED_CONTROLLER
16  -     sky-cmd  1x[CPU:2+]  5 mins ago   4m 54s         -             0            FAILED_CONTROLLER
15  -     sky-cmd  1x[CPU:2+]  5 mins ago   5m 5s          -             0            FAILED_CONTROLLER
14  -     sky-cmd  1x[CPU:2+]  6 mins ago   5m 9s          -             0            FAILED_CONTROLLER
13  -     sky-cmd  1x[CPU:2+]  6 mins ago   5m 15s         -             0            FAILED_CONTROLLER
12  -     sky-cmd  1x[CPU:2+]  6 mins ago   5m 24s         -             0            FAILED_CONTROLLER
11  -     sky-cmd  1x[CPU:2+]  6 mins ago   1m 1s          5s            0            SUCCEEDED
10  -     sky-cmd  1x[CPU:2+]  6 mins ago   1m 2s          5s            0            SUCCEEDED
9   -     sky-cmd  1x[CPU:2+]  6 mins ago   1m 4s          6s            0            SUCCEEDED
8   -     sky-cmd  1x[CPU:2+]  6 mins ago   1m 13s         6s            0            SUCCEEDED
7   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 47s         -             0            FAILED_CONTROLLER
6   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 1s          5s            0            SUCCEEDED
5   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 2s          5s            0            SUCCEEDED
4   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 3s          5s            0            SUCCEEDED
3   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 5s          5s            0            SUCCEEDED
2   -     sky-cmd  1x[CPU:1+]  18 mins ago  58s            4s            0            SUCCEEDED
1   -     sky-cmd  1x[CPU:1+]  21 mins ago  1m 12s         4s            0            SUCCEEDED

sky jobs logs --controller isn't very helpful:

(base) ➜  ~ sky jobs logs --controller 16
D 10-31 12:29:36 skypilot_config.py:228] Using config path: /Users/romilb/.sky/config.yaml
D 10-31 12:29:36 skypilot_config.py:233] Config loaded:
D 10-31 12:29:36 skypilot_config.py:233] {'allowed_clouds': ['aws', 'gcp'],
D 10-31 12:29:36 skypilot_config.py:233]  'jobs': {'controller': {'resources': {'cpus': '4+', 'memory': '4+'}}},
D 10-31 12:29:36 skypilot_config.py:233]  'kubernetes': {'pod_config': {'spec': {'containers': [{'env': [{'name': 'MY_ENV_VAR',
D 10-31 12:29:36 skypilot_config.py:233]                                                                  'value': 'my_value'}]}]}}}}
D 10-31 12:29:36 skypilot_config.py:245] Config syntax check passed.
D 10-31 12:29:37 backend_utils.py:1937] Refreshing status: Failed get the lock for cluster 'sky-jobs-controller-2ea485ea'. Using the cached status.
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:228] Using config path: /home/gcpuser/.sky/managed_jobs/sky-cmd-3008.config_yaml
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233] Config loaded:
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233] {'allowed_clouds': ['aws', 'gcp'],
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233]  'jobs': {'controller': {'resources': {'cpus': '4+', 'memory': '4+'}}},
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233]  'kubernetes': {'pod_config': {'spec': {'containers': [{'env': [{'name': 'MY_ENV_VAR',
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233]                                                                  'value': 'my_value'}]}]}}}}
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:245] Config syntax check passed.
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:228] Using config path: /home/gcpuser/.sky/managed_jobs/sky-cmd-3008.config_yaml
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233] Config loaded:
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233] {'allowed_clouds': ['aws', 'gcp'],
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233]  'jobs': {'controller': {'resources': {'cpus': '4+', 'memory': '4+'}}},
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233]  'kubernetes': {'pod_config': {'spec': {'containers': [{'env': [{'name': 'MY_ENV_VAR',
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233]                                                                  'value': 'my_value'}]}]}}}}
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:245] Config syntax check passed.
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:53] DAG:
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:53] [Task<name=sky-cmd>(run='echo hi2')
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:53]   resources: <Cloud>(cpus=2+)]
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:180] Submitted managed job 16 (task: 0, name: 'sky-cmd'); SKYPILOT_TASK_ID: sky-managed-2024-10-31-19-22-02-711936_sky-cmd_16-0
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:184] Started monitoring.
(sky-cmd, pid=15103) I 10-31 19:22:02 state.py:337] Launching the spot cluster...
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:146] User config: allowed_clouds -> ['aws', 'gcp']
(sky-cmd, pid=15103) D 10-31 19:22:02 optimizer.py:292] #### Task<name=sky-cmd>(run='echo hi2')
(sky-cmd, pid=15103) D 10-31 19:22:02 optimizer.py:292]   resources: <Cloud>(cpus=2+) ####

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed and tested offline with 100 jobs submitted in parallel. Discovered other bottlenecks unrelated to this PR, which we should file as issues + suggest best practices using xargs to limit parallelism.

@cg505 cg505 enabled auto-merge October 31, 2024 21:04
@cg505 cg505 added this pull request to the merge queue Oct 31, 2024
Merged via the queue into skypilot-org:master with commit 599e155 Oct 31, 2024
20 checks passed
@cg505 cg505 deleted the fast-jobs-launch branch October 31, 2024 21:15
AlexCuadron pushed a commit to cblmemo/skypilot that referenced this pull request Nov 7, 2024
* [ux] add sky jobs launch --fast

This flag will make the jobs controller launch use `sky launch --fast`. There
are a few known situations where this can cause misbehavior in the jobs
controller:
- The SkyPilot wheel is outdated (due to changes in the SkyPilot code or a
  version upgrade).
- The user's cloud credentials have changed. In this case the new credentials
  will not be synced, and if there are new clouds available in `sky check`, the
  cloud depedencies may not be correctly installed.

However, this does speed up `jobs launch` _significantly_, so provide it as a
dangerous option. Soon we will add robustness checks to `sky launch --fast` that
will fix the above caveats, and we can remove this flag and just enable the
behavior by default.

* Apply suggestions from code review

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* fix lint

---------

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
github-merge-queue bot pushed a commit that referenced this pull request Nov 11, 2024
…f options (#4061)

* user can select load balancing policies

* some fixes

* linting

* Fixes according to comments

* Linting

* Linting

* Fixed according to comments

* fix

* removed line from examples

* Reverted changes

* Reverted changes

* Fixed according to comments

* Linting

* Update sky/serve/load_balancer.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* [Catalog] Silently ignore TPU price not found. (#4134)

* [Catalog] Silently ignore TPU price not found.

* assert for non tpu v6e

* format

* [docs] Update GPUs used in docs (#4138)

* Change V100 to H100

* updates

* update

* [k8s] Fix GPU labeling for EKS (#4146)

Fix GPU labelling

* [k8s] Handle @ in context name (#4147)

Handle @ in context name

* [Docs] Typo in distributed jobs docs (#4149)

minor typo

* [Performance] Refactor Azure SDK usage (#4139)

* [Performance] Refactor Azure SDK usage

* lazy import and address comments

* address comments

* fixes

* fixes

* nits

* fixes

* Fix OCI import issue (#4178)

* Fix OCI import issue

* Update sky/clouds/oci.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* edit comments

---------

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* [k8s] Add retry for apparmor failures (#4176)

* Add retry for apparmor failures

* add comment

* [Docs] Update Managed Jobs page. (#4177)

* [Docs] Update Managed Jobs page.

* Lint

* Updates

* Minor: Jobs docs fix. (#4183)

* [Docs] Update Managed Jobs page.

* Lint

* Updates

* reword

* [UX] remove all uses of deprecated `sky jobs` (#4173)

* [UX] remove all uses of deprecated `sky jobs`

* Apply suggestions from code review

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* fix other mentions of "spot jobs"

---------

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* [Azure] Support fractional A10 instance types (#3877)

* fix

* change catalog to float gpu num

* support print float point gpu in sky launch. TODO: test if the ray deployment group works for fractional one

* fix unittest

* format

* patch ray resources to ceil value

* support launch from --gpus A10

* only allow strictly match fractional gpu counts

* address comment

* change back condition

* fix

* apply suggestions from code review

* fix

* Update sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* format

* fix display of fuzzy candidates

* fix precision issue

* fix num gpu required

* refactor in check_resources_fit_cluster

* change type annotation of acc_count

* enable fuzzy fp acc count

* fix k8s

* Update sky/clouds/service_catalog/common.py

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* fix integer gpus

* format

---------

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* [Jobs] Refactor: Extract task failure state update helper (#4185)

refactor: a unified exception handling utility

* [Core] Remove backward compatibility code for 0.6.0 & 0.7.0 (#4175)

* [Core] Remove backward compatibility code for 0.6.0

* remove backwards compatibility for 0.7.0 release

* Update sky/serve/serve_state.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

* remove more

* Revert "remove more"

This reverts commit 34c28e9.

* remove more but not instance tags

---------

Co-authored-by: Christopher Cooper <cooperc@assemblesys.com>
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

* Remove outdated pylint disabling comments (#4196)

Update cloud_vm_ray_backend.py

* [test] update default clouds for smoke tests (#4182)

* [k8s] Show all kubernetes clusters in optimizer table (#4013)

* Show all kubernetes clusters in optimizer table

* format

* Add comment

* [Azure] Allow resource group specifiation for Azure instance provisioning (#3764)

* Allow resource group specifiation for Azure instance provisioning

* Add 'use_external_resource_group' under provider config

* nit

* attached resources deletion

* support deployment removal when terminating

* nit

* delete RoleAssignment when terminating

* update ARM config template

* nit

* nit

* delete role assignment with guid

* update role assignment removal logic

* Separate resource group region and VM, attached resources

* nit

* nit

* nit

* nit

* add error handling for deletion

* format

* deployment naming update

* test

* nit

* update deployment constant names

* update open_ports to wait for the nsg creation corresponding to the VM being provisioned

* format

* nit

* format

* update docstring

* add back deleted snippet

* format

* delete nic with retries

* error handle update

* [dev] restrict pylint to changed files (#4184)

* [dev] restrict pylint to changed files

* fix glob

* avoid use of xargs -d

* Update packer scripts (#4203)

* Update custom image packer script to exclude .sky and include python sys packages

* add comments

* Upgrade Azure SDK version requirement (#4204)

* [Jobs] Add option to specify `max_restarts_on_errors` (#4169)

* Add option to specify `max_retry_on_failure`

* fix recover counts

* fix log streaming

* fix docs

* fix

* fix

* fix

* fix

* fix default value

* Fix spinner

* Add unit test for default strategy

* fix test

* format

* Update docs/source/examples/managed-jobs.rst

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* rename to restarts

* Update docs/source/examples/managed-jobs.rst

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* update docs

* warning instead of error out

* Update docs/source/examples/managed-jobs.rst

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

* rename

* add comment

* fix

* rename

* Update sky/execution.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

* Update sky/execution.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

* address comments

* format

* commit changes for docs

* Format

---------

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

* [Core] Fix job race condition. (#4193)

* [Core] Fix job race condition.

* fix

* simplify url

* change to list_jobs

* upd ray comments

* only store jobs in ray_id_set

* [Core] Fix issue with the wrong path of setup logs (#4209)

* fix issue with a getting setup logs

* More conservative

* print error

* comment

* [Jobs] Fix jobs name (#4213)

* fix issue with a getting setup logs

* More conservative

* print error

* comment

* Fix job name

* [Performance] Speed up Azure A10 instance creation (#4205)

* Use date instead of timestamp in skypilot image names

* Speed up Azure A10 VM creation

* disable nouveau and use smaller instance

* address comments

* address comments

* add todo

* [Tests] Fix public bucket tests (#4216)

fix

* [Catalog] Add TPU V6e. (#4218)

* [Catalog] Add TPU V6e.

* swap if else branch

* [test] smoke test fixes for managed jobs (#4217)

* [test] don't wait for old pending jobs controller messages

`sky jobs queue` used to output a temporary "waiting" message while the managed
jobs controller was still being provisioned/starting. Since #3288 this is not
shown, and instead the queued jobs themselves will show PENDING/STARTING.

This also requires some changes to tests to permit the PENDING and STARTING
states for managed jobs.

* fix default aws region

* [test] wait for RECOVERING more quickly

Smoke tests were failing because some managed jobs were fulling recovering back
to the RUNNING state before the smoke test could catch the RECOVERING case (see
e.g. #4192 `test_managed_jobs_cancellation_gcp`). Change tests that manually
terminate a managed job instance, so that they will wait for the managed job to
change away from the RUNNING state, checking every 10s.

* address PR comments

* fix

* Add user toolkits to all sky custom images and fix PyTorch issue on A10 (#4219)

* Add user toolkits to all sky custom images

* address comments

* [Core] Support TPU v6 (#4220)

* init

* fix

* nit

* format

* add readme

* add inference example

* nit

* add multi-host training

* rephrase catalog doc

* Update examples/tpu/v6e/README.md

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

---------

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>

* [Core] Make home address replacement more robust (#4227)

* Make home address replacement more robust

* format

* [UX] sky launch --fast (#4159)

* [UX] skip provisioning stages if cluster is already available

* add new --skip-setup flag and further limit stages to match sky exec

* rename flag to --fast

* add smoke test for sky launch --fast

* changes stages for --fast

* fix --fast help message

* add api test for fast param (outside CLI)

* lint

* explicitly specify stages

* [Docs] Tpu v6 docs (#4221)

* Update TPU v6 docs

* tpu v6 docs

* add TPU v6

* update

* Fix tpu docs

* fix indents

* restructure TPU doc

* Fix

* Fix

* fix

* Fix TPU

* fix docs

* Update docs/source/reference/tpu.rst

Co-authored-by: Tian Xia <cblmemo@gmail.com>

---------

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* [ux] add sky jobs launch --fast (#4231)

* [ux] add sky jobs launch --fast

This flag will make the jobs controller launch use `sky launch --fast`. There
are a few known situations where this can cause misbehavior in the jobs
controller:
- The SkyPilot wheel is outdated (due to changes in the SkyPilot code or a
  version upgrade).
- The user's cloud credentials have changed. In this case the new credentials
  will not be synced, and if there are new clouds available in `sky check`, the
  cloud depedencies may not be correctly installed.

However, this does speed up `jobs launch` _significantly_, so provide it as a
dangerous option. Soon we will add robustness checks to `sky launch --fast` that
will fix the above caveats, and we can remove this flag and just enable the
behavior by default.

* Apply suggestions from code review

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* fix lint

---------

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* [UX] Show 0.25 on controller queue (#4230)

* Show 0.25 on controller queue

* format

* [Storage] Avoid opt-in regions for S3 (#4239)

* S3 fix + timeout

* S3 fix + timeout

* lint

* Update K8s docker image build and the source artifact registry (#4224)

* Attempt at improving performance of k8s cluster launch

* remove conda env creation

* add multiple regions

* K8s sky launch pulls the new docker images

* Move k8s script

* use us region only

* typo

* Remove --system-site-packages when setup sky cluster (#4168)

* Remove --system-site-packages when setup sky cluster

* add comments

* [AWS/Azure] Avoid error out during image size check (#4244)

* Avoid error out during image size check

* Avoid error for azure

* lint

* [AWS] Disable additional auto update services for ubuntu image with cloud-init (#4252)

* Disable additional auto update services for ubuntu image

* simplify the commands

* [Dashboard] Add a simple status filter. (#4253)

* Disable more potential unattended upgrade sources for AWS (#4246)

* Fix AWS unattended upgrade issue

* more commands

* add retry and disable all unattended

* remove retry

* disable unattended upgrades and add retry in aws default image

* [docs]: OCI key_file path clarrification (#4262)

* [docs]: OCI key_file path clarrification

* Update installation.rst

* [k8s] Parallelize setup for faster multi-node provisioning (#4240)

* parallelize setup

* lint

* Add retries

* lint

* retry for get_remote_home_dir

* optimize privilege check

* parallelize termination

* increase num threads

* comments

* lint

* do not redirect stderr to /dev/null when submitting job (#4247)

* do not redirect stderr to /dev/null when submitting job

Should fix #4199.

* remove grep, add worker_maximum_startup_concurrency override

* [tests] Exclude runpod from smoke tests unless specified (#4238)

Add runpod

* Update comments pointing to Lambda's docs (#4272)

* [Core] Avoid PENDING job to be set to FAILED and speed up job scheduling (#4264)

* fix race condition for setting job status to FAILED during INIT

* Fix

* fix

* format

* Add smoke tests

* revert pending submit

* remove update entirely for the job schedule step

* wait for job 32 to finish

* fix smoke

* move and rename

* Add comment

* minor

* Set minimum port number a Ray worker can listen on to 11002 (#4278)

Set worker minimum port number

* [docs] use k8s instead of kubernetes in the CLI (#4164)

* [docs] use k8s instead of kubernetes in the CLI

* fix docs build script for linux

* Update docs/source/reference/kubernetes/kubernetes-getting-started.rst

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

---------

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* [jobs] autodown managed job clusters (#4267)

* [jobs] autodown managed job clusters

If all goes correctly, the managed job controller should tear down a managed job
cluster once the managed job completes. However, if the controller fails somehow
(e.g. crashes, is terminated, etc), we don't want to leak resources.

As a failsafe, set autodown on the job cluster. This is not foolproof, since the
skylet on the cluster can also crash, but it's likely to catch many cases.

* add comment about autodown duration

* add leading _

* [UX] Improve Formatting of Post Job Creation Logs (#4198)

* Update cloud_vm_ray_backend.py

* Update cloud_vm_ray_backend.py

* format

* Fix `stream_logs` Duplicate Job Handling and TypeError (#4274)

fix: multiple `job_id`

* Update sky/serve/load_balancer.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* feat(serve): Improve load balancing policy error message and display

1. Add available policies to schema validation
2. Show available policies in error message when invalid policy is specified
3. Display load balancing policy in service spec repr when explicitly set

* fix(serve): Update load balancing policy schema to match implemented policies

Only 'round_robin' is currently implemented in LoadBalancingPolicy class

* linting

* refactor(serve): Remove policy enum from schema

Move policy validation to code to avoid duplication and make it easier to maintain when adding new policies

* fix

* linting

* Update sky/serve/service_spec.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* Fix circular import in schemas.py by moving load_balancing_policies import inside function

* linting

---------

Co-authored-by: Tian Xia <cblmemo@gmail.com>
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Co-authored-by: Yika <yikaluo@assemblesys.com>
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Christopher Cooper <cooperc@assemblesys.com>
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
Co-authored-by: Andy Lee <andylizf@outlook.com>
Co-authored-by: landscapepainter <34902420+landscapepainter@users.noreply.github.com>
Co-authored-by: Hysun He <hysunhe@foxmail.com>
Co-authored-by: Cody Brownstein <105375373+cbrownstein-lambda@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants