Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Fix spot pending status #2044

Merged
merged 16 commits into from
Jun 8, 2023
Merged

[Spot] Fix spot pending status #2044

merged 16 commits into from
Jun 8, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jun 7, 2023

This PR fixes a problem caused by #1636, where the spot job not showing up in the sky spot queue if the spot job is in PENDING status. This is found when running the test in #2041.

Before this PR:
Only 1 pending spot job will show up in the sky spot queue, other later pending jobs will only appear when it starts running.

After this PR:
All the pending spot jobs will also appear in the sky spot queue with PENDING status.

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
    • Submit more than 16 spot jobs, and all the spot jobs should be in the correct status.
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a pass. Left some questions.

sky/spot/spot_utils.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
@@ -1821,7 +1810,7 @@ def _ensure_cluster_ray_started(self, handle: 'CloudVmRayResourceHandle',
# At this state, an erroneous cluster may not have cached
# handle.head_ip (global_user_state.add_or_update_cluster(...,
# ready=True)).
use_cached_head_ip=False)
use_cached_head_ip=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple places in this PR where previously we always query the cloud for IPs, but with this PR we always try to use cached IPs first. Why is that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the function argument does not have a way to do the "cached IP first and fallback to query". We had to be conservative to always query the IPs as the cached IP may not exist.

Now, we make the use_cached_head_ip=None to use the cached IP first and then fallback to query, so that we can reduce the overhead for querying the IP address multiple times, even though we have the IP cached already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like all callers now use None for this arg. Ok to eliminate the arg?

Copy link
Collaborator Author

@Michaelvll Michaelvll Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Just removed the argument. Testing:

  • pytest tests/test_smoke.py

"""Runs 'cmd' on the cluster's head node."""
"""Runs 'cmd' on the cluster's head node.

use_cached_head_ip: If True, use the cached head IP address. If False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add other Args too? (I know, code gardening...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. The other arguments are listed in the docstr of SSHCommandRunner.run and log_lib.run. The document here is a bit duplicated. We can also change it a sentence referring to the docstr of those two functions. Wdyt?

sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Michaelvll.

We may want to add this to somewhere in code:

ray job submit to queue spot controller job (immediately returns) -> set underlying spot job to pending in spot_state; -> <how does our expanded job queue work?> -> Spot controller job starts running, codegen=[...; invoke the spot job; ...]

sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
@@ -1821,7 +1810,7 @@ def _ensure_cluster_ray_started(self, handle: 'CloudVmRayResourceHandle',
# At this state, an erroneous cluster may not have cached
# handle.head_ip (global_user_state.add_or_update_cluster(...,
# ready=True)).
use_cached_head_ip=False)
use_cached_head_ip=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like all callers now use None for this arg. Ok to eliminate the arg?

# controller process jobs running on the controller VM with 8
# CPU cores.
# The spot job should be set to PENDING state after the
# controller process job has been queued, as our skylet on spot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does our own job queue work? Does it automatically detect there are too many ray job submit? Or does it rely on some other API calls / tables?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It relies on another pending table. The ray job submit command will be stored in the pending table. When the latest job get the required resources, our scheduler will run the ray job submit command for the first job in the pending table, i.e. the ray job submit will be delayed until the prior job is scheduled.

@Michaelvll Michaelvll merged commit 8b7139e into master Jun 8, 2023
@Michaelvll Michaelvll deleted the fix-spot-pending branch June 8, 2023 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants