-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow multi step configuration for skypilot #2166
Allow multi step configuration for skypilot #2166
Conversation
Important Auto Review SkippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the To trigger a single review, invoke the WalkthroughThe recent updates involve significant enhancements to the SkyPilot integration, particularly focusing on how resources are allocated and managed. Installation commands have been streamlined and now include a prefix specific to SkyPilot. Moreover, the orchestrator logic has been revamped to support individual resource configurations for each pipeline step, enabling more granular control over computing resources. Error handling has been improved for resource allocation, and a new method for sanitizing cluster names has been introduced. Changes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
@CodeRabbit review |
@coderabbitai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: .coderabbit.yaml
Files selected for processing (2)
- docs/book/stacks-and-components/component-guide/orchestrators/skypilot-vm.md (2 hunks)
- src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py (5 hunks)
Additional comments: 6
docs/book/stacks-and-components/component-guide/orchestrators/skypilot-vm.md (2)
65-77: The installation commands have been updated to reflect the new naming convention (
skypilot_aws
,skypilot_gcp
,skypilot_azure
). Ensure that these commands are consistent with the actual implementation in the codebase.377-402: The documentation provides an example of configuring step-specific resources for AWS using the
@step
decorator andsettings
parameter. Verify that the example is consistent with the actual usage in the codebase and that theSkypilotAWSOrchestratorSettings
class supports the attributes used in the example.src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py (4)
140-143: The
prepare_or_run_pipeline
method has been refactored to handle individual pipeline steps in separate Skypilot containers. This is a significant change that aligns with the PR's objective to enable multi-step configuration for the SkyPilot orchestrator.193-228: The method now includes logic to handle unique resource configurations for each step, creating a cluster name based on these configurations. Ensure that the
sanitize_for_cluster_name
method is robust and that the cluster naming convention does not lead to collisions or exceed any character limits imposed by the cloud providers.284-316: Error handling has been introduced for resource mismatches during the setup of step-specific configurations. This includes attempts to down existing clusters and retry launching tasks. Verify that this error handling is consistent with the expected behavior and that it properly cleans up resources to avoid unnecessary costs.
Verification successful
The search results indicate that the error handling logic related to resource mismatches, specifically the use of
sky.down
andsky.launch
, is present in theskypilot_base_vm_orchestrator.py
file. The documentation inskypilot-vm.md
also mentions the use ofsky.launch
in the context of autostopping the cluster, which is related to resource management but not directly to the error handling logic for resource mismatches.Given this information, it appears that the error handling logic for resource mismatches is localized to the
SkypilotBaseVMOrchestrator
class and is not a pattern used widely across the codebase. This suggests that the implementation is consistent within the scope of theSkypilotBaseVMOrchestrator
class, as no other instances of similar error handling were found.Based on the search results, the error handling logic for resource mismatches seems to be implemented as expected within the
SkypilotBaseVMOrchestrator
class, and there are no indications of inconsistent behavior in other parts of the codebase.* 331-334: The `sanitize_for_cluster_name` method is a new addition that ensures cluster names are valid. Review the regex used for sanitization to ensure it adheres to the naming conventions of the supported cloud providers.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that the error handling logic for resource mismatches is consistent with the expected behavior. # This script is a placeholder for the actual validation logic, which would involve simulating resource mismatches and observing the behavior. echo "Placeholder for error handling validation logic."Length of output: 106
Script:
#!/bin/bash # Search for references to the error handling logic related to resource mismatches rg "resource mismatch|sky\.down|sky\.launch"Length of output: 1236
Verification successful
The regex used in the
sanitize_for_cluster_name
method is intended to ensure that cluster names adhere to cloud provider naming conventions. Based on my knowledge as of April 2023, cloud provider naming conventions typically allow lowercase letters, numbers, and sometimes hyphens. The regex[^a-z0-9]
seems to match this requirement by replacing any character that is not a lowercase letter or number with a hyphen.To further verify this, I will create a script that tests the regex against some known valid and invalid cluster names for AWS, GCP, and Azure.
The script has successfully tested the regex
[^a-z0-9]
against a set of sample cluster names. The results show that the regex correctly sanitizes names by replacing invalid characters with hyphens, which is consistent with common cloud provider naming conventions. Valid names remain unchanged, while invalid names are sanitized as expected.Based on the output:
- Valid names like
valid-cluster1
,cluster-2
, and3cluster
remain the same after sanitization.- Invalid names like
Invalid_Cluster!
,UPPERCASE
, andcluster@name
are sanitized to-nvalid--luster-
,---------
, andcluster-name
respectively, which aligns with the intended functionality of the method.This indicates that the
sanitize_for_cluster_name
method should work as intended for creating valid cluster names for AWS, GCP, and Azure.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that the regex used in sanitize_for_cluster_name adheres to the naming conventions of the supported cloud providers. # This script is a placeholder for the actual validation logic, which would involve checking the regex against the naming rules of AWS, GCP, and Azure. echo "Placeholder for regex validation logic."Length of output: 88
Script:
#!/bin/bash # Test the regex against known valid and invalid cluster names for AWS, GCP, and Azure. # Define a function to test the regex. test_regex() { local name="$1" local sanitized_name sanitized_name=$(echo "$name" | sed -E 's/[^a-z0-9]/-/g') echo "Original: $name, Sanitized: $sanitized_name" } # Known valid names (should remain unchanged) test_regex "valid-cluster1" test_regex "cluster-2" test_regex "3cluster" # Known invalid names (should be sanitized) test_regex "Invalid_Cluster!" test_regex "UPPERCASE" test_regex "cluster@name"Length of output: 1527
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one! I think this is a good version of the orchestrator that allows step specific configuration but it does have the barrier that the client is sequentially launching these jobs one by one and waiting...
Perhaps a better implementation is what happens in the k8s orchestrator where the orchestrator pod is launched by the client who then subsequently launches executor jobs. Also it launches them topologically, meaning parallel steps are executed in parallel.
I think we have an entrypoint configuration for that in the k8s orchestrator and i have a branch somewhere which used that approach.
WDYT?
…ub.com:zenml-io/zenml into feature/OSS-2680-multi-image-multi-vm-skypilot
…ub.com:zenml-io/zenml into feature/OSS-2680-multi-image-multi-vm-skypilot
@coderabbitai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 2
Configuration used: .coderabbit.yaml
Files selected for processing (4)
- src/zenml/integrations/aws/service_connectors/aws_service_connector.py (2 hunks)
- src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py (6 hunks)
- src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py (1 hunks)
- src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint_configuration.py (1 hunks)
Files skipped from review due to trivial changes (1)
- src/zenml/integrations/aws/service_connectors/aws_service_connector.py
Additional comments: 6
src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint_configuration.py (3)
25-39: The
get_entrypoint_options
method correctly returns a set of options required for the entrypoint. The set is well-defined and immutable, which is appropriate for this use case.41-53: The
get_entrypoint_command
method provides a command list to run the entrypoint module. The command is constructed correctly and uses the Python module invocation syntax.56-77: The
get_entrypoint_arguments
method constructs a list of arguments for the entrypoint command. The use of f-strings for argument construction is correct and follows best practices for readability and maintainability.src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py (1)
- 46-55: The
parse_args
function correctly defines and parses command-line arguments usingargparse
. The argumentsrun_name
anddeployment_id
are marked as required, which is appropriate for their usage.src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py (2)
- 17-17: The import of the
re
module is new and is used later in thesanitize_cluster_name
method. It's important to ensure that this import is necessary and that there
module is not already imported elsewhere in the file.Verification successful
The
re
module is imported only once in theskypilot_base_vm_orchestrator.py
file.* 288-299: The `sanitize_cluster_name` method uses regular expressions to sanitize the input name for use as a cluster name. The method correctly replaces invalid characters with hyphens and ensures the name does not start or end with a hyphen. This is a good practice for ensuring valid identifiers in cloud environments.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check if the `re` module is imported multiple times in the file. rg --count "import re" src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.pyLength of output: 105
src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/aws/service_connectors/aws_service_connector.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py
Show resolved
Hide resolved
src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint_configuration.py
Outdated
Show resolved
Hide resolved
@coderabbitai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 3
Configuration used: .coderabbit.yaml
Files selected for processing (4)
- src/zenml/integrations/gcp/service_connectors/gcp_service_connector.py (6 hunks)
- src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py (7 hunks)
- src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py (1 hunks)
- src/zenml/orchestrators/step_launcher.py (1 hunks)
Additional comments: 10
src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py (5)
16-39: The imports and global variable definition (
logger
) are standard and appear to be correct. No issues found here.42-51: The
parse_args
function is simple and correctly defines the required command-line arguments. It uses theargparse
library which is a standard choice for command-line parsing in Python.54-260: The
main
function is quite extensive and needs to be broken down for a detailed review. Let's start with the initial setup and parsing of arguments, which seems to be done correctly. The use ofsocket.gethostname()
to obtain theorchestrator_run_id
is a standard approach. The retrieval of the deployment using theClient
object and the construction of thepipeline_dag
dictionary are also standard practices in Python.106-139: The loop that constructs
unique_resource_configs
is complex and involves several steps, including casting settings toSkypilotBaseOrchestratorSettings
and handling different types ofaccelerators
. The construction ofcluster_name
involves sanitizing parts of the resource configuration, which is a good practice to avoid issues with invalid names. However, the concatenation ofcluster_name
should be reviewed to ensure that it results in a valid and unique name for each cluster.150-253: The
run_step_on_skypilot_vm
function is responsible for running a pipeline step in a separate SkyPilot VM. It correctly retrieves thecluster_name
fromunique_resource_configs
, constructs the Docker run command, and sets up the task with the appropriate resources. The use ofsky.launch
to start the task and the subsequent logging and error handling appear to be correct. However, the error handling within the while loop (lines 223-233) should be reviewed to ensure that it correctly identifies when a step is finished and handles any exceptions that may occur.src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py (2)
17-35: The imports at the beginning of the file are standard and include the newly added
re
module for regular expressions, which is likely used in thesanitize_cluster_name
method. The import ofEnvironment
is also new and should be used appropriately within the file.298-312: The
sanitize_cluster_name
method uses regular expressions to ensure that the cluster name is valid. This is a new addition and is a good practice to avoid issues with invalid cluster names. The method correctly replaces invalid characters with hyphens and trims leading and trailing hyphens.src/zenml/orchestrators/step_launcher.py (1)
- 215-216: The change in the
launch
method fromlogger.error
tologger.exception
is a good improvement as it will provide more detailed error information, including the stack trace, which is valuable for debugging.src/zenml/integrations/gcp/service_connectors/gcp_service_connector.py (2)
26-28: The addition of imports for
shutil
,subprocess
, andtempfile
is noted and seems to be used later in the_configure_local_client
method for handling local gcloud CLI configuration.1050-1114: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [1-1297]
Overall, the changes to the
gcp_service_connector.py
file are well-structured and seem to follow the objectives outlined in the PR. The modifications to the_configure_local_client
and_auto_configure
methods are particularly important for handling GCP service account credentials and local gcloud CLI configuration. It is recommended to ensure that all subprocess calls handle text encoding explicitly and that error messages provide clear, actionable advice to users.
src/zenml/integrations/gcp/service_connectors/gcp_service_connector.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/gcp/service_connectors/gcp_service_connector.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py
Outdated
Show resolved
Hide resolved
os.chmod(os.path.dirname(aws_credentials_path), 0o700) | ||
os.chmod(aws_credentials_path, 0o600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@safoinme can you use the code I suggested here ?
…ub.com:zenml-io/zenml into feature/OSS-2680-multi-image-multi-vm-skypilot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a wonderful improvement! Left a few comments only
src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py
Show resolved
Hide resolved
src/zenml/integrations/gcp/service_connectors/gcp_service_connector.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py
Show resolved
Hide resolved
…ector.py Co-authored-by: Alex Strick van Linschoten <strickvl@users.noreply.github.com>
… orchestrator entrypoint
…ub.com:zenml-io/zenml into feature/OSS-2680-multi-image-multi-vm-skypilot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my side it looks good, and i know you tested it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my side it looks good, and i know you tested it!
NAME = SKYPILOT_AWS | ||
REQUIREMENTS = ["skypilot[aws]"] | ||
NAME = SKYPILOT_GCP | ||
REQUIREMENTS = ["skypilot[aws,gcp,azure]<=0.4.1"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need all the 3 clouds for one integration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a potential bug that we have a ticket for to investigate, it doesn't matter what cloud we use, zenml always pick up skypilot[aws] and install it in the docker image. So the workaround was to set all 3 of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions
# Settings for a specific step that requires more resources | ||
high_resource_settings = SkypilotAWSOrchestratorSettings( | ||
instance_type='t2.2xlarge', | ||
cpus=8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt we take this from ResourceSettings
actually now that I think about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the GPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to distinguish between both of them, because ResourceSettings
can be defined for things like pods which would allow random resource given, while for Skypilot most of these parameters must match if supplied together because a VM type with this specific resource must exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the cpu_count
and memory
in ResourceSettings should be mapped here... what is the point of ResourceSettings otherwise? @schustmi would love your feedback
Describe changes
I implemented/fixed _ to achieve _.
Pre-requisites
Please ensure you have done the following:
develop
and the open PR is targetingdevelop
. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.Types of changes
Summary by CodeRabbit
Documentation
New Features
Refactor