Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage] Fix the storage cloud checking before sky.check is called. #2017

Merged
merged 9 commits into from
Jun 3, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jun 3, 2023

Fixes #2016.

This PR also fixes an issue that the $USER is empty when under root account. This will cause the SKYPILOT_USER in spot-controller.yaml to be set to None, causing the failure for the controller to launch GCP cluster

Got googleapiclient.errors.HttpError: <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/skypilot-375900/zones/asia-south2-c/instances?alt=json returned "Invalid value for field 'resource.labels': ''. Label value 'None' violates format constraints. Th
e value can only contain lowercase letters, numeric characters, underscores and dashes. The value can be at most 63 characters long. International characters are allowed.". Details: "[{'message': "Invalid value for field 'resource.labels': ''. Label value 'None' violates format constraints. The value can only contain lowercase letters, numeric characters, unders
cores and dashes. The value can be at most 63 characters long. International characters are allowed.", 'domain': 'global', 'reason': 'invalid'}]">

Note that getpass.getuser() works correctly for the root user.

Tested (run the relevant ones):

  • pytest tests/test_smoke.py::test_spot_storage with a new spot controller on AWS
  • pytest tests/test_smoke.py --managed-spot

@Michaelvll Michaelvll requested a review from romilbhardwaj June 3, 2023 02:01
@Michaelvll Michaelvll marked this pull request as draft June 3, 2023 02:02
@Michaelvll Michaelvll marked this pull request as ready for review June 3, 2023 03:27
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for identifying and fixing this! 🙏 Left a comment, rest of the code looks good if the test pass.

enabled_storage_clouds = global_user_state.get_enabled_storage_clouds()
if cloud_name in enabled_storage_clouds:
return True
if try_fix_with_sky_check:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fully sure if it is a good idea to run sky check if the storage cloud is not already in get_enabled_storage_clouds(). Running this adds a latency of ~7s:

>>> s=time.time();sky.check.check(quiet=True);print(time.time()-s)
7.19357705116272

Though this latency hit would not be frequent, I wonder if there's some alternate solution. Can we run sky check on the spot controller at the end of its setup?

I guess this also relates to semantics of sky check. Do we expect it to be run once by the user, or is it acceptable to be run frequently by our code? For example, I often need to disable a specific cloud. To do so, I temporarily move it's credentials file from it's path (e.g., mv ~/.aws/credentials /tmp/credentials) and then run sky check. That disables the cloud, and now I can move my credentials file back so I can continue using aws cli tools if I need to. If SkyPilot calls sky.check() quietly in the code, aws would get enabled again (which is not what I would want to happen).

Copy link
Collaborator Author

@Michaelvll Michaelvll Jun 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point!
Running the sky check at the end of its setup will cause the overhead occurring every time the spot job is submitted, which could make the latency more significant.

Our optimizer has the same logic for re-checking the enabled cloud when the cloud is not enabled, so doing it here may not make a significant difference from before.

skypilot/sky/optimizer.py

Lines 943 to 949 in 072874b

if resources.cloud is not None and not _cloud_in_list(
resources.cloud, enabled_clouds):
if try_fix_with_sky_check:
# Explicitly check again to update the enabled cloud list.
check.check(quiet=True)
return _fill_in_launchable_resources(task, blocked_resources,
False)

An optimization we could do here might be just to check for the cloud that is specified, as added in the comment. Wdyt?

For the use case of moving the credential and back, I find the original method might be quite hacky already, since the user may not have the idea of the credential files. For that, we may want to offer an explicit way to allow the user to disable some clouds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks good to me, because:

  • If a user has not called sky.check() already
    • Before this PR: a sky.check() is automatically called by sky launch -> optimizer code path anyway
    • After this PR: a sky.check() is possibly called at YAML parsing time (if it specifies a store; storage.py is eager); if so, afterwards the sky launch -> optimizer code path doesn't need to re-incur this overhead

So it seems like the overhead is incurred once before & after this PR, if users have not called sky.check().

Please correct if this is wrong @Michaelvll.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, thanks @Michaelvll!

sky/execution.py Outdated
'is_dev': env_options.Options.IS_DEVELOPER.get(),
'is_debug': env_options.Options.SHOW_DEBUG_INFO.get(),
'disable_logging': env_options.Options.DISABLE_LOGGING.get(),
'logging_user_hash': common_utils.get_user_hash(),
'retry_until_up': retry_until_up,
'user': os.environ.get('USER', None),
'user': os.environ.get('SKYPILOT_USER', getpass.getuser()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps comment on why we try to get SKYPILOT_USER first?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it is not needed in the current stage, as we won't have recursive calls for spot_launch. Changed it to getpass.getuser() only.

@@ -94,7 +94,7 @@ class GcsCloudStorage(CloudStorage):
# parellel workers on our end.
# The gsutil command is part of the Google Cloud SDK, and we reuse
# the installation logic here.
_GET_GSUTIL = gcp.GCLOUD_INSTALLATION_COMMAND
_GET_GSUTIL = gcp.GOOGLE_SDK_INSTALLATION_COMMAND
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is it ok to have extra pip commands if this caller only needs gsutil?

Copy link
Collaborator Author

@Michaelvll Michaelvll Jun 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just removed the pip commands, as it seems the gcloud installation will install the python APIs as well. we already install it in the previous lines in the controller yaml

echo Check and install AWS SDK
(pip list | grep boto3 > /dev/null 2>&1 && \
pip list | grep google-api-python-client > /dev/null 2>&1 ) || \
pip install boto3 awscli pycryptodome==3.12.0 google-api-python-client google-cloud-storage 2>&1 > /dev/null

enabled_storage_clouds = global_user_state.get_enabled_storage_clouds()
if cloud_name in enabled_storage_clouds:
return True
if try_fix_with_sky_check:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks good to me, because:

  • If a user has not called sky.check() already
    • Before this PR: a sky.check() is automatically called by sky launch -> optimizer code path anyway
    • After this PR: a sky.check() is possibly called at YAML parsing time (if it specifies a store; storage.py is eager); if so, afterwards the sky launch -> optimizer code path doesn't need to re-incur this overhead

So it seems like the overhead is incurred once before & after this PR, if users have not called sky.check().

Please correct if this is wrong @Michaelvll.

@Michaelvll Michaelvll changed the title [GCS] Install google python SDK with the gcloud [Storage] Fix the storage cloud checking before sky.check is called. Jun 3, 2023
@Michaelvll Michaelvll merged commit a076ed7 into master Jun 3, 2023
@Michaelvll Michaelvll deleted the fix-gcs-not-enabled branch June 3, 2023 06:05
concretevitamin pushed a commit that referenced this pull request Jun 3, 2023
…2017)

* Install google python sdk

* upload remaining files

* fix and / or logic

* update

* sky.check again if storage cloud not enabled

* fix

* use getpass instead of $USER

* remove pip install

* address comment
concretevitamin pushed a commit that referenced this pull request Jun 4, 2023
…2017)

* Install google python sdk

* upload remaining files

* fix and / or logic

* update

* sky.check again if storage cloud not enabled

* fix

* use getpass instead of $USER

* remove pip install

* address comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Spot] Spot job with GCS store fails on a new controller
3 participants