-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spot] Fix spot pending status #2044
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a pass. Left some questions.
sky/backends/cloud_vm_ray_backend.py
Outdated
@@ -1821,7 +1810,7 @@ def _ensure_cluster_ray_started(self, handle: 'CloudVmRayResourceHandle', | |||
# At this state, an erroneous cluster may not have cached | |||
# handle.head_ip (global_user_state.add_or_update_cluster(..., | |||
# ready=True)). | |||
use_cached_head_ip=False) | |||
use_cached_head_ip=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple places in this PR where previously we always query the cloud for IPs, but with this PR we always try to use cached IPs first. Why is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the function argument does not have a way to do the "cached IP first and fallback to query". We had to be conservative to always query the IPs as the cached IP may not exist.
Now, we make the use_cached_head_ip=None
to use the cached IP first and then fallback to query, so that we can reduce the overhead for querying the IP address multiple times, even though we have the IP cached already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like all callers now use None for this arg. Ok to eliminate the arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Just removed the argument. Testing:
-
pytest tests/test_smoke.py
sky/backends/cloud_vm_ray_backend.py
Outdated
"""Runs 'cmd' on the cluster's head node.""" | ||
"""Runs 'cmd' on the cluster's head node. | ||
|
||
use_cached_head_ip: If True, use the cached head IP address. If False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add other Args too? (I know, code gardening...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. The other arguments are listed in the docstr of SSHCommandRunner.run
and log_lib.run
. The document here is a bit duplicated. We can also change it a sentence referring to the docstr of those two functions. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @Michaelvll.
We may want to add this to somewhere in code:
ray job submit
to queue spot controller job (immediately returns) -> set underlying spot job to pending in spot_state; -> <how does our expanded job queue work?> -> Spot controller job starts running, codegen=[...; invoke the spot job; ...]
sky/backends/cloud_vm_ray_backend.py
Outdated
@@ -1821,7 +1810,7 @@ def _ensure_cluster_ray_started(self, handle: 'CloudVmRayResourceHandle', | |||
# At this state, an erroneous cluster may not have cached | |||
# handle.head_ip (global_user_state.add_or_update_cluster(..., | |||
# ready=True)). | |||
use_cached_head_ip=False) | |||
use_cached_head_ip=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like all callers now use None for this arg. Ok to eliminate the arg?
# controller process jobs running on the controller VM with 8 | ||
# CPU cores. | ||
# The spot job should be set to PENDING state after the | ||
# controller process job has been queued, as our skylet on spot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does our own job queue work? Does it automatically detect there are too many ray job submit
? Or does it rely on some other API calls / tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It relies on another pending
table. The ray job submit
command will be stored in the pending
table. When the latest job get the required resources, our scheduler will run the ray job submit
command for the first job in the pending table, i.e. the ray job submit
will be delayed until the prior job is scheduled.
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
This PR fixes a problem caused by #1636, where the spot job not showing up in the
sky spot queue
if the spot job is in PENDING status. This is found when running the test in #2041.Before this PR:
Only 1 pending spot job will show up in the
sky spot queue
, other later pending jobs will only appear when it starts running.After this PR:
All the pending spot jobs will also appear in the
sky spot queue
with PENDING status.Tested (run the relevant ones):
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh