-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Storage] Fix the storage cloud checking before sky.check is called. #2017
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for identifying and fixing this! 🙏 Left a comment, rest of the code looks good if the test pass.
enabled_storage_clouds = global_user_state.get_enabled_storage_clouds() | ||
if cloud_name in enabled_storage_clouds: | ||
return True | ||
if try_fix_with_sky_check: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fully sure if it is a good idea to run sky check if the storage cloud is not already in get_enabled_storage_clouds()
. Running this adds a latency of ~7s:
>>> s=time.time();sky.check.check(quiet=True);print(time.time()-s)
7.19357705116272
Though this latency hit would not be frequent, I wonder if there's some alternate solution. Can we run sky check
on the spot controller at the end of its setup?
I guess this also relates to semantics of sky check
. Do we expect it to be run once by the user, or is it acceptable to be run frequently by our code? For example, I often need to disable a specific cloud. To do so, I temporarily move it's credentials file from it's path (e.g., mv ~/.aws/credentials /tmp/credentials
) and then run sky check
. That disables the cloud, and now I can move my credentials file back so I can continue using aws cli tools if I need to. If SkyPilot calls sky.check() quietly in the code, aws would get enabled again (which is not what I would want to happen).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point!
Running the sky check
at the end of its setup will cause the overhead occurring every time the spot job is submitted, which could make the latency more significant.
Our optimizer has the same logic for re-checking the enabled cloud when the cloud is not enabled, so doing it here may not make a significant difference from before.
Lines 943 to 949 in 072874b
if resources.cloud is not None and not _cloud_in_list( | |
resources.cloud, enabled_clouds): | |
if try_fix_with_sky_check: | |
# Explicitly check again to update the enabled cloud list. | |
check.check(quiet=True) | |
return _fill_in_launchable_resources(task, blocked_resources, | |
False) |
An optimization we could do here might be just to check for the cloud that is specified, as added in the comment. Wdyt?
For the use case of moving the credential and back, I find the original method might be quite hacky already, since the user may not have the idea of the credential files. For that, we may want to offer an explicit way to allow the user to disable some clouds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks good to me, because:
- If a user has not called
sky.check()
already- Before this PR: a
sky.check()
is automatically called bysky launch -> optimizer
code path anyway - After this PR: a
sky.check()
is possibly called at YAML parsing time (if it specifies a store; storage.py is eager); if so, afterwards thesky launch -> optimizer
code path doesn't need to re-incur this overhead
- Before this PR: a
So it seems like the overhead is incurred once before & after this PR, if users have not called sky.check()
.
Please correct if this is wrong @Michaelvll.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, thanks @Michaelvll!
sky/execution.py
Outdated
'is_dev': env_options.Options.IS_DEVELOPER.get(), | ||
'is_debug': env_options.Options.SHOW_DEBUG_INFO.get(), | ||
'disable_logging': env_options.Options.DISABLE_LOGGING.get(), | ||
'logging_user_hash': common_utils.get_user_hash(), | ||
'retry_until_up': retry_until_up, | ||
'user': os.environ.get('USER', None), | ||
'user': os.environ.get('SKYPILOT_USER', getpass.getuser()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps comment on why we try to get SKYPILOT_USER first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it is not needed in the current stage, as we won't have recursive calls for spot_launch
. Changed it to getpass.getuser()
only.
@@ -94,7 +94,7 @@ class GcsCloudStorage(CloudStorage): | |||
# parellel workers on our end. | |||
# The gsutil command is part of the Google Cloud SDK, and we reuse | |||
# the installation logic here. | |||
_GET_GSUTIL = gcp.GCLOUD_INSTALLATION_COMMAND | |||
_GET_GSUTIL = gcp.GOOGLE_SDK_INSTALLATION_COMMAND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: is it ok to have extra pip commands if this caller only needs gsutil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just removed the pip commands, as it seems the gcloud installation will install the python APIs as well. we already install it in the previous lines in the controller yaml
skypilot/sky/templates/spot-controller.yaml.j2
Lines 23 to 26 in a076ed7
echo Check and install AWS SDK | |
(pip list | grep boto3 > /dev/null 2>&1 && \ | |
pip list | grep google-api-python-client > /dev/null 2>&1 ) || \ | |
pip install boto3 awscli pycryptodome==3.12.0 google-api-python-client google-cloud-storage 2>&1 > /dev/null |
enabled_storage_clouds = global_user_state.get_enabled_storage_clouds() | ||
if cloud_name in enabled_storage_clouds: | ||
return True | ||
if try_fix_with_sky_check: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks good to me, because:
- If a user has not called
sky.check()
already- Before this PR: a
sky.check()
is automatically called bysky launch -> optimizer
code path anyway - After this PR: a
sky.check()
is possibly called at YAML parsing time (if it specifies a store; storage.py is eager); if so, afterwards thesky launch -> optimizer
code path doesn't need to re-incur this overhead
- Before this PR: a
So it seems like the overhead is incurred once before & after this PR, if users have not called sky.check()
.
Please correct if this is wrong @Michaelvll.
…2017) * Install google python sdk * upload remaining files * fix and / or logic * update * sky.check again if storage cloud not enabled * fix * use getpass instead of $USER * remove pip install * address comment
…2017) * Install google python sdk * upload remaining files * fix and / or logic * update * sky.check again if storage cloud not enabled * fix * use getpass instead of $USER * remove pip install * address comment
Fixes #2016.
This PR also fixes an issue that the
$USER
is empty when underroot
account. This will cause theSKYPILOT_USER
inspot-controller.yaml
to be set toNone
, causing the failure for the controller to launch GCP clusterNote that
getpass.getuser()
works correctly for theroot
user.Tested (run the relevant ones):
pytest tests/test_smoke.py::test_spot_storage
with a new spot controller on AWSpytest tests/test_smoke.py --managed-spot