Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Avoid terminating cluster for resources unavailability #2170

Merged

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jul 2, 2023

Fixes #2169

Tested (run the relevant ones):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change for the same issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, it is from #2166. We can merge that PR first, as otherwise, the debugging is quite hard.

Copy link
Collaborator Author

@Michaelvll Michaelvll Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the changes to make it easier to review.

# This is important for the case, where an existing is
# transitioned into INIT state due to key interruption during
# launching, with the following steps:
# (1) launch, after answering prompt immediately ctrl-c;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we do these two steps for a new cluster name? I imagine with this PR, at step 2 we should not set it to STOPPED and we should do the provisioning loop as usual.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't set it to stop for a new cluster, because the new cluster will only have the following two cases:

  1. the cluster is provisioned, but not correctly setup yet. Then the cluster will be in INIT state, and our failover will still be triggered.
  2. the cluster is not provisioned. Then the cluster will be removed from the cluster table when we refresh the status, so the failover will be correctly triggered as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repro gave

Running task on cluster dbg2...
I 07-03 09:45:21 cloud_vm_ray_backend.py:3788] The cluster 'dbg2' was autodowned or manually terminated on the cloud console. Using the same resources as the previously terminated one to provision a new cluster.
I 07-03 09:45:21 cloud_vm_ray_backend.py:3813] Creating a new cluster: "dbg2" [1x AWS(m6i.large)].

Maybe we should change L3788's logging to (or something more clear):

The cluster 'dbg2' (status: XXX) was not found on the cloud: it may be autodowned, manually terminated, or its launch never succeeded. Provisioning a new cluster by using the same resources as its original launch.

Copy link
Collaborator Author

@Michaelvll Michaelvll Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Updated the logging. Tested again with:

  • sky launch -c min --cloud gcp --cpus 2; manually terminate the cluster on the console; python -c 'import sky; sky.launch(sky.Task(), cluster_name="min")' again

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll ! Some questions.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
@@ -749,13 +749,13 @@ def is_spot_controller_up(
identity.
"""
try:
# Set force_refresh=False to make sure the refresh only happens when the
# Set force_refresh=None to make sure the refresh only happens when the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: why is setting it to None the same as “refresh only when the controller is init/up”?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because the spot controller will always have the autostop setup, which will trigger the refresh for both init and up cases for the controller.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
force_refresh: if True, refresh the cluster status even if it may be
skipped. Otherwise (the default), only refresh if the cluster:
force_refresh: if specified, refresh the cluster in the specified status
even if it may be skipped. Otherwise (the default), only refresh if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, always refresh in either of these cases:

@Michaelvll Michaelvll force-pushed the avoid-terminate-cluster-for-resources-unavailability branch from 7264631 to 318c114 Compare July 3, 2023 05:27
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for identifying and fixing the critical issue @Michaelvll!

# This is important for the case, where an existing is
# transitioned into INIT state due to key interruption during
# launching, with the following steps:
# (1) launch, after answering prompt immediately ctrl-c;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repro gave

Running task on cluster dbg2...
I 07-03 09:45:21 cloud_vm_ray_backend.py:3788] The cluster 'dbg2' was autodowned or manually terminated on the cloud console. Using the same resources as the previously terminated one to provision a new cluster.
I 07-03 09:45:21 cloud_vm_ray_backend.py:3813] Creating a new cluster: "dbg2" [1x AWS(m6i.large)].

Maybe we should change L3788's logging to (or something more clear):

The cluster 'dbg2' (status: XXX) was not found on the cloud: it may be autodowned, manually terminated, or its launch never succeeded. Provisioning a new cluster by using the same resources as its original launch.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
Michaelvll and others added 3 commits July 3, 2023 10:24
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…f github.com:skypilot-org/skypilot into avoid-terminate-cluster-for-resources-unavailability
@Michaelvll Michaelvll merged commit 484617a into master Jul 3, 2023
@Michaelvll Michaelvll deleted the avoid-terminate-cluster-for-resources-unavailability branch July 3, 2023 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Unexpected termination of user's previous cluster when resource capacity issue happens
2 participants