Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ssh key name handling for Lambda #1838

Merged
merged 6 commits into from
Apr 11, 2023

Conversation

ewzeng
Copy link
Collaborator

@ewzeng ewzeng commented Apr 4, 2023

This pr fixes #1836 and #1837 by improving how we handle the ssh key name. We no longer use the ssh key name in lambda_keys and query Lambda Cloud instead (they have a new API for this). We also improve some error handling code when there is no ip address.

Tested:

  • pytest tests/test_smoke.py --lambda
  • sky launch --cloud lambda --num-nodes 2 on a gcp cluster
  • sky launch --cloud lambda after deleting ssh key name from lambda_keys

@ewzeng ewzeng changed the title Lambda recursive v2 Improve ssh key name handling for Lambda Apr 4, 2023
@ewzeng ewzeng requested a review from Michaelvll April 4, 2023 19:49
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing the issues @ewzeng! It looks good to me. Left several questions. : )

sky/skylet/providers/lambda_cloud/lambda_utils.py Outdated Show resolved Hide resolved
sky/skylet/providers/lambda_cloud/node_provider.py Outdated Show resolved Hide resolved
sky/skylet/providers/lambda_cloud/node_provider.py Outdated Show resolved Hide resolved
@ewzeng ewzeng requested a review from Michaelvll April 11, 2023 07:51
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes @ewzeng! LGTM. Left a nit for improving our robustness.

Comment on lines +259 to +267
# Occasionally, the head node will continue running for a short
# period after termination. This can lead to the following bug:
# 1. Head node autodowns but continues running.
# 2. The next autodown event is triggered, which executes ray up.
# 3. Head node stops running.
# In this case, a new head node is created after the cluster has
# terminated. We avoid this with the following check:
if self.on_head:
raise lambda_utils.LambdaCloudError('Head already exists.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider exiting the skylet event or resetting the autodown config as well, to make sure this problem will not happen to the other clouds?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. This might be a good idea. How about I add this in a separate PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Lambda] Launching Lambda cluster from SkyPilot cluster fails
2 participants