[Spot] Fix spot pending status #2044

Michaelvll · 2023-06-07T00:02:56Z

This PR fixes a problem caused by #1636, where the spot job not showing up in the sky spot queue if the spot job is in PENDING status. This is found when running the test in #2041.

Before this PR:
Only 1 pending spot job will show up in the sky spot queue, other later pending jobs will only appear when it starts running.

After this PR:
All the pending spot jobs will also appear in the sky spot queue with PENDING status.

Tested (run the relevant ones):

Any manual or new tests for this PR (please specify below)
- Submit more than 16 spot jobs, and all the spot jobs should be in the correct status.
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

…ot-pending

concretevitamin

Took a pass. Left some questions.

sky/spot/spot_utils.py

sky/backends/cloud_vm_ray_backend.py

sky/backends/backend_utils.py

concretevitamin · 2023-06-07T03:04:32Z

sky/backends/cloud_vm_ray_backend.py

@@ -1821,7 +1810,7 @@ def _ensure_cluster_ray_started(self, handle: 'CloudVmRayResourceHandle',
            # At this state, an erroneous cluster may not have cached
            # handle.head_ip (global_user_state.add_or_update_cluster(...,
            # ready=True)).
-            use_cached_head_ip=False)
+            use_cached_head_ip=None)


There are a couple places in this PR where previously we always query the cloud for IPs, but with this PR we always try to use cached IPs first. Why is that?

Previously, the function argument does not have a way to do the "cached IP first and fallback to query". We had to be conservative to always query the IPs as the cached IP may not exist.

Now, we make the use_cached_head_ip=None to use the cached IP first and then fallback to query, so that we can reduce the overhead for querying the IP address multiple times, even though we have the IP cached already.

Looks like all callers now use None for this arg. Ok to eliminate the arg?

Good point! Just removed the argument. Testing:

pytest tests/test_smoke.py

concretevitamin · 2023-06-07T03:05:34Z

sky/backends/cloud_vm_ray_backend.py

-        """Runs 'cmd' on the cluster's head node."""
+        """Runs 'cmd' on the cluster's head node.
+
+        use_cached_head_ip: If True, use the cached head IP address. If False,


Can we add other Args too? (I know, code gardening...)

Added. The other arguments are listed in the docstr of SSHCommandRunner.run and log_lib.run. The document here is a bit duplicated. We can also change it a sentence referring to the docstr of those two functions. Wdyt?

sky/backends/cloud_vm_ray_backend.py

…ot-pending

concretevitamin

LGTM, thanks @Michaelvll.

We may want to add this to somewhere in code:

ray job submit to queue spot controller job (immediately returns) -> set underlying spot job to pending in spot_state; -> <how does our expanded job queue work?> -> Spot controller job starts running, codegen=[...; invoke the spot job; ...]

sky/backends/backend_utils.py

sky/backends/cloud_vm_ray_backend.py

concretevitamin · 2023-06-07T16:13:37Z

sky/backends/cloud_vm_ray_backend.py

@@ -1821,7 +1810,7 @@ def _ensure_cluster_ray_started(self, handle: 'CloudVmRayResourceHandle',
            # At this state, an erroneous cluster may not have cached
            # handle.head_ip (global_user_state.add_or_update_cluster(...,
            # ready=True)).
-            use_cached_head_ip=False)
+            use_cached_head_ip=None)


Looks like all callers now use None for this arg. Ok to eliminate the arg?

concretevitamin · 2023-06-07T16:16:50Z

sky/backends/cloud_vm_ray_backend.py

+                # controller process jobs running on the controller VM with 8
+                # CPU cores.
+                # The spot job should be set to PENDING state after the
+                # controller process job has been queued, as our skylet on spot


How does our own job queue work? Does it automatically detect there are too many ray job submit? Or does it rely on some other API calls / tables?

It relies on another pending table. The ray job submit command will be stored in the pending table. When the latest job get the required resources, our scheduler will run the ray job submit command for the first job in the pending table, i.e. the ray job submit will be delayed until the prior job is scheduled.

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

Michaelvll added 8 commits June 6, 2023 16:58

Fix spot pending status

f13c1bf

format

2548ff1

format

2b76f89

Merge branch 'master' of github.com:skypilot-org/skypilot into fix-sp…

119b721

…ot-pending

make into one line

c15227a

optimize ues_cached_ip

a47be76

Merge branch 'master' of github.com:skypilot-org/skypilot into fix-sp…

d51cb56

…ot-pending

lint

de4c412

Michaelvll requested a review from concretevitamin June 7, 2023 01:45

concretevitamin reviewed Jun 7, 2023

View reviewed changes

Michaelvll added 3 commits June 6, 2023 22:29

address comment

819acfe

format

03af17f

Merge branch 'master' of github.com:skypilot-org/skypilot into fix-sp…

0a1fb20

…ot-pending

Michaelvll mentioned this pull request Jun 7, 2023

[100 jobs/Spot] Submitting 100 long running spot jobs should correctly queue them #2041

Closed

3 tasks

concretevitamin approved these changes Jun 7, 2023

View reviewed changes

Michaelvll and others added 5 commits June 7, 2023 09:50

Update sky/backends/backend_utils.py

9f9061b

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

minor comment hcange

82c2ca0

format

6191c7d

Remove use_cached_ip argument

6664975

format

401ef94

Michaelvll merged commit 8b7139e into master Jun 8, 2023

Michaelvll deleted the fix-spot-pending branch June 8, 2023 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spot] Fix spot pending status #2044

[Spot] Fix spot pending status #2044

Michaelvll commented Jun 7, 2023 •

edited

Loading

concretevitamin left a comment

concretevitamin Jun 7, 2023

Michaelvll Jun 7, 2023

concretevitamin Jun 7, 2023

Michaelvll Jun 7, 2023 •

edited

Loading

concretevitamin Jun 7, 2023

Michaelvll Jun 7, 2023

concretevitamin left a comment

concretevitamin Jun 7, 2023

concretevitamin Jun 7, 2023

Michaelvll Jun 7, 2023

[Spot] Fix spot pending status #2044

[Spot] Fix spot pending status #2044

Conversation

Michaelvll commented Jun 7, 2023 • edited Loading

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll Jun 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Jun 7, 2023 •

edited

Loading

Michaelvll Jun 7, 2023 •

edited

Loading