-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Avoid terminating cluster for resources unavailability #2170
[core] Avoid terminating cluster for resources unavailability #2170
Conversation
sky/skylet/log_lib.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change for the same issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, it is from #2166. We can merge that PR first, as otherwise, the debugging is quite hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the changes to make it easier to review.
# This is important for the case, where an existing is | ||
# transitioned into INIT state due to key interruption during | ||
# launching, with the following steps: | ||
# (1) launch, after answering prompt immediately ctrl-c; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we do these two steps for a new cluster name? I imagine with this PR, at step 2 we should not set it to STOPPED and we should do the provisioning loop as usual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't set it to stop for a new cluster, because the new cluster will only have the following two cases:
- the cluster is provisioned, but not correctly setup yet. Then the cluster will be in INIT state, and our failover will still be triggered.
- the cluster is not provisioned. Then the cluster will be removed from the cluster table when we refresh the status, so the failover will be correctly triggered as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This repro gave
Running task on cluster dbg2...
I 07-03 09:45:21 cloud_vm_ray_backend.py:3788] The cluster 'dbg2' was autodowned or manually terminated on the cloud console. Using the same resources as the previously terminated one to provision a new cluster.
I 07-03 09:45:21 cloud_vm_ray_backend.py:3813] Creating a new cluster: "dbg2" [1x AWS(m6i.large)].
Maybe we should change L3788's logging to (or something more clear):
The cluster 'dbg2' (status: XXX) was not found on the cloud: it may be autodowned, manually terminated, or its launch never succeeded. Provisioning a new cluster by using the same resources as its original launch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Updated the logging. Tested again with:
-
sky launch -c min --cloud gcp --cpus 2
; manually terminate the cluster on the console;python -c 'import sky; sky.launch(sky.Task(), cluster_name="min")'
again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll ! Some questions.
sky/spot/spot_utils.py
Outdated
@@ -749,13 +749,13 @@ def is_spot_controller_up( | |||
identity. | |||
""" | |||
try: | |||
# Set force_refresh=False to make sure the refresh only happens when the | |||
# Set force_refresh=None to make sure the refresh only happens when the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: why is setting it to None the same as “refresh only when the controller is init/up”?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is because the spot controller will always have the autostop setup, which will trigger the refresh for both init and up cases for the controller.
sky/backends/backend_utils.py
Outdated
force_refresh: if True, refresh the cluster status even if it may be | ||
skipped. Otherwise (the default), only refresh if the cluster: | ||
force_refresh: if specified, refresh the cluster in the specified status | ||
even if it may be skipped. Otherwise (the default), only refresh if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, always refresh in either of these cases:
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
7264631
to
318c114
Compare
…terminate-cluster-for-resources-unavailability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for identifying and fixing the critical issue @Michaelvll!
# This is important for the case, where an existing is | ||
# transitioned into INIT state due to key interruption during | ||
# launching, with the following steps: | ||
# (1) launch, after answering prompt immediately ctrl-c; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This repro gave
Running task on cluster dbg2...
I 07-03 09:45:21 cloud_vm_ray_backend.py:3788] The cluster 'dbg2' was autodowned or manually terminated on the cloud console. Using the same resources as the previously terminated one to provision a new cluster.
I 07-03 09:45:21 cloud_vm_ray_backend.py:3813] Creating a new cluster: "dbg2" [1x AWS(m6i.large)].
Maybe we should change L3788's logging to (or something more clear):
The cluster 'dbg2' (status: XXX) was not found on the cloud: it may be autodowned, manually terminated, or its launch never succeeded. Provisioning a new cluster by using the same resources as its original launch.
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…f github.com:skypilot-org/skypilot into avoid-terminate-cluster-for-resources-unavailability
Fixes #2169
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py