-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ssh key name handling for Lambda #1838
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for fixing the issues @ewzeng! It looks good to me. Left several questions. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes @ewzeng! LGTM. Left a nit for improving our robustness.
# Occasionally, the head node will continue running for a short | ||
# period after termination. This can lead to the following bug: | ||
# 1. Head node autodowns but continues running. | ||
# 2. The next autodown event is triggered, which executes ray up. | ||
# 3. Head node stops running. | ||
# In this case, a new head node is created after the cluster has | ||
# terminated. We avoid this with the following check: | ||
if self.on_head: | ||
raise lambda_utils.LambdaCloudError('Head already exists.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider exiting the skylet event or resetting the autodown config as well, to make sure this problem will not happen to the other clouds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. This might be a good idea. How about I add this in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me!
This pr fixes #1836 and #1837 by improving how we handle the ssh key name. We no longer use the ssh key name in
lambda_keys
and query Lambda Cloud instead (they have a new API for this). We also improve some error handling code when there is no ip address.Tested:
pytest tests/test_smoke.py --lambda
sky launch --cloud lambda --num-nodes 2
on a gcp clustersky launch --cloud lambda
after deletingssh key name
fromlambda_keys