Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multi step configuration for skypilot #2166

Merged
merged 41 commits into from
Jan 12, 2024

Conversation

safoinme
Copy link
Contributor

@safoinme safoinme commented Dec 19, 2023

Describe changes

I implemented/fixed _ to achieve _.

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • If my change requires a change to docs, I have updated the documentation accordingly.
  • If I have added an integration, I have updated the integrations table and the corresponding website section.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

Summary by CodeRabbit

  • Documentation

    • Updated installation commands for SkyPilot integrations.
    • Added guidance on configuring step-specific resources for AWS.
  • New Features

    • Enhanced Skypilot integrations to manage AWS, GCP, and Azure credentials.
    • Orchestrated pipeline steps and managed Skypilot VM cluster lifecycle.
  • Refactor

    • Transitioned from PipelineEntrypointConfiguration to StepEntrypointConfiguration.
    • Introduced a new method for sanitizing cluster names.

Copy link
Contributor

coderabbitai bot commented Dec 19, 2023

Important

Auto Review Skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository.

To trigger a single review, invoke the @coderabbitai review command.

Walkthrough

The recent updates involve significant enhancements to the SkyPilot integration, particularly focusing on how resources are allocated and managed. Installation commands have been streamlined and now include a prefix specific to SkyPilot. Moreover, the orchestrator logic has been revamped to support individual resource configurations for each pipeline step, enabling more granular control over computing resources. Error handling has been improved for resource allocation, and a new method for sanitizing cluster names has been introduced.

Changes

File Path Change Summary
docs/book/.../skypilot-vm.md
src/zenml/.../skypilot_base_vm_orchestrator.py
Updated installation commands with skypilot_ prefix and added instructions for configuring step-specific resources for AWS. Refactored prepare_or_run_pipeline for step-specific resource allocation in separate containers, added error handling for resource mismatches, and introduced sanitize_for_cluster_name method. Replaced PipelineEntrypointConfiguration with StepEntrypointConfiguration.
src/zenml/.../aws_service_connector.py
src/zenml/.../gcp_service_connector.py
src/zenml/.../skypilot/__init__.py
Introduced code to manage AWS and GCP credentials, added APT_PACKAGES attribute to Skypilot integrations, and modified GCP service connector for local client configuration.
src/zenml/.../skypilot_orchestrator_entrypoint.py Orchestrates the execution of pipeline steps and manages the lifecycle of the Skypilot VM clusters.
src/zenml/.../skypilot_orchestrator_entrypoint_configuration.py Introduces an entrypoint configuration for the Skypilot master/orchestrator VM.
src/zenml/.../step_launcher.py Modified exception handling logic in the launch method for potentially more detailed error information.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository from git and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

@github-actions github-actions bot added internal To filter out internal PRs and issues enhancement New feature or request labels Dec 19, 2023
@htahir1
Copy link
Contributor

htahir1 commented Dec 19, 2023

@CodeRabbit review

@htahir1
Copy link
Contributor

htahir1 commented Dec 19, 2023

@coderabbitai review

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: .coderabbit.yaml

Commits Files that changed from the base of the PR and between 1097444 and 27fc323.
Files selected for processing (2)
  • docs/book/stacks-and-components/component-guide/orchestrators/skypilot-vm.md (2 hunks)
  • src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py (5 hunks)
Additional comments: 6
docs/book/stacks-and-components/component-guide/orchestrators/skypilot-vm.md (2)
  • 65-77: The installation commands have been updated to reflect the new naming convention (skypilot_aws, skypilot_gcp, skypilot_azure). Ensure that these commands are consistent with the actual implementation in the codebase.

  • 377-402: The documentation provides an example of configuring step-specific resources for AWS using the @step decorator and settings parameter. Verify that the example is consistent with the actual usage in the codebase and that the SkypilotAWSOrchestratorSettings class supports the attributes used in the example.

src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py (4)
  • 140-143: The prepare_or_run_pipeline method has been refactored to handle individual pipeline steps in separate Skypilot containers. This is a significant change that aligns with the PR's objective to enable multi-step configuration for the SkyPilot orchestrator.

  • 193-228: The method now includes logic to handle unique resource configurations for each step, creating a cluster name based on these configurations. Ensure that the sanitize_for_cluster_name method is robust and that the cluster naming convention does not lead to collisions or exceed any character limits imposed by the cloud providers.

  • 284-316: Error handling has been introduced for resource mismatches during the setup of step-specific configurations. This includes attempts to down existing clusters and retry launching tasks. Verify that this error handling is consistent with the expected behavior and that it properly cleans up resources to avoid unnecessary costs.

Verification successful

The search results indicate that the error handling logic related to resource mismatches, specifically the use of sky.down and sky.launch, is present in the skypilot_base_vm_orchestrator.py file. The documentation in skypilot-vm.md also mentions the use of sky.launch in the context of autostopping the cluster, which is related to resource management but not directly to the error handling logic for resource mismatches.

Given this information, it appears that the error handling logic for resource mismatches is localized to the SkypilotBaseVMOrchestrator class and is not a pattern used widely across the codebase. This suggests that the implementation is consistent within the scope of the SkypilotBaseVMOrchestrator class, as no other instances of similar error handling were found.

Based on the search results, the error handling logic for resource mismatches seems to be implemented as expected within the SkypilotBaseVMOrchestrator class, and there are no indications of inconsistent behavior in other parts of the codebase.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify that the error handling logic for resource mismatches is consistent with the expected behavior.
# This script is a placeholder for the actual validation logic, which would involve simulating resource mismatches and observing the behavior.
echo "Placeholder for error handling validation logic."

Length of output: 106



Script:

#!/bin/bash
# Search for references to the error handling logic related to resource mismatches
rg "resource mismatch|sky\.down|sky\.launch"

Length of output: 1236

* 331-334: The `sanitize_for_cluster_name` method is a new addition that ensures cluster names are valid. Review the regex used for sanitization to ensure it adheres to the naming conventions of the supported cloud providers.
Verification successful

The regex used in the sanitize_for_cluster_name method is intended to ensure that cluster names adhere to cloud provider naming conventions. Based on my knowledge as of April 2023, cloud provider naming conventions typically allow lowercase letters, numbers, and sometimes hyphens. The regex [^a-z0-9] seems to match this requirement by replacing any character that is not a lowercase letter or number with a hyphen.

To further verify this, I will create a script that tests the regex against some known valid and invalid cluster names for AWS, GCP, and Azure.


The script has successfully tested the regex [^a-z0-9] against a set of sample cluster names. The results show that the regex correctly sanitizes names by replacing invalid characters with hyphens, which is consistent with common cloud provider naming conventions. Valid names remain unchanged, while invalid names are sanitized as expected.

Based on the output:

  • Valid names like valid-cluster1, cluster-2, and 3cluster remain the same after sanitization.
  • Invalid names like Invalid_Cluster!, UPPERCASE, and cluster@name are sanitized to -nvalid--luster-, ---------, and cluster-name respectively, which aligns with the intended functionality of the method.

This indicates that the sanitize_for_cluster_name method should work as intended for creating valid cluster names for AWS, GCP, and Azure.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify that the regex used in sanitize_for_cluster_name adheres to the naming conventions of the supported cloud providers.
# This script is a placeholder for the actual validation logic, which would involve checking the regex against the naming rules of AWS, GCP, and Azure.
echo "Placeholder for regex validation logic."

Length of output: 88



Script:

#!/bin/bash
# Test the regex against known valid and invalid cluster names for AWS, GCP, and Azure.

# Define a function to test the regex.
test_regex() {
  local name="$1"
  local sanitized_name
  sanitized_name=$(echo "$name" | sed -E 's/[^a-z0-9]/-/g')
  echo "Original: $name, Sanitized: $sanitized_name"
}

# Known valid names (should remain unchanged)
test_regex "valid-cluster1"
test_regex "cluster-2"
test_regex "3cluster"

# Known invalid names (should be sanitized)
test_regex "Invalid_Cluster!"
test_regex "UPPERCASE"
test_regex "cluster@name"

Length of output: 1527

Copy link
Contributor

@htahir1 htahir1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! I think this is a good version of the orchestrator that allows step specific configuration but it does have the barrier that the client is sequentially launching these jobs one by one and waiting...

Perhaps a better implementation is what happens in the k8s orchestrator where the orchestrator pod is launched by the client who then subsequently launches executor jobs. Also it launches them topologically, meaning parallel steps are executed in parallel.

I think we have an entrypoint configuration for that in the k8s orchestrator and i have a branch somewhere which used that approach.

WDYT?

@strickvl
Copy link
Contributor

strickvl commented Jan 5, 2024

@coderabbitai review

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: .coderabbit.yaml

Commits Files that changed from the base of the PR and between 27fc323 and 7452667.
Files selected for processing (4)
  • src/zenml/integrations/aws/service_connectors/aws_service_connector.py (2 hunks)
  • src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py (6 hunks)
  • src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py (1 hunks)
  • src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint_configuration.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • src/zenml/integrations/aws/service_connectors/aws_service_connector.py
Additional comments: 6
src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint_configuration.py (3)
  • 25-39: The get_entrypoint_options method correctly returns a set of options required for the entrypoint. The set is well-defined and immutable, which is appropriate for this use case.

  • 41-53: The get_entrypoint_command method provides a command list to run the entrypoint module. The command is constructed correctly and uses the Python module invocation syntax.

  • 56-77: The get_entrypoint_arguments method constructs a list of arguments for the entrypoint command. The use of f-strings for argument construction is correct and follows best practices for readability and maintainability.

src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py (1)
  • 46-55: The parse_args function correctly defines and parses command-line arguments using argparse. The arguments run_name and deployment_id are marked as required, which is appropriate for their usage.
src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py (2)
  • 17-17: The import of the re module is new and is used later in the sanitize_cluster_name method. It's important to ensure that this import is necessary and that the re module is not already imported elsewhere in the file.
Verification successful

The re module is imported only once in the skypilot_base_vm_orchestrator.py file.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Check if the `re` module is imported multiple times in the file.
rg --count "import re" src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py

Length of output: 105

* 288-299: The `sanitize_cluster_name` method uses regular expressions to sanitize the input name for use as a cluster name. The method correctly replaces invalid characters with hyphens and ensures the name does not start or end with a hyphen. This is a good practice for ensuring valid identifiers in cloud environments.

@zenml-io zenml-io deleted a comment from coderabbitai bot Jan 5, 2024
@zenml-io zenml-io deleted a comment from coderabbitai bot Jan 5, 2024
@strickvl
Copy link
Contributor

strickvl commented Jan 9, 2024

@coderabbitai review

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 3

Configuration used: .coderabbit.yaml

Commits Files that changed from the base of the PR and between f9d51d8 and ec7d09b.
Files selected for processing (4)
  • src/zenml/integrations/gcp/service_connectors/gcp_service_connector.py (6 hunks)
  • src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py (7 hunks)
  • src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py (1 hunks)
  • src/zenml/orchestrators/step_launcher.py (1 hunks)
Additional comments: 10
src/zenml/integrations/skypilot/orchestrators/skypilot_orchestrator_entrypoint.py (5)
  • 16-39: The imports and global variable definition (logger) are standard and appear to be correct. No issues found here.

  • 42-51: The parse_args function is simple and correctly defines the required command-line arguments. It uses the argparse library which is a standard choice for command-line parsing in Python.

  • 54-260: The main function is quite extensive and needs to be broken down for a detailed review. Let's start with the initial setup and parsing of arguments, which seems to be done correctly. The use of socket.gethostname() to obtain the orchestrator_run_id is a standard approach. The retrieval of the deployment using the Client object and the construction of the pipeline_dag dictionary are also standard practices in Python.

  • 106-139: The loop that constructs unique_resource_configs is complex and involves several steps, including casting settings to SkypilotBaseOrchestratorSettings and handling different types of accelerators. The construction of cluster_name involves sanitizing parts of the resource configuration, which is a good practice to avoid issues with invalid names. However, the concatenation of cluster_name should be reviewed to ensure that it results in a valid and unique name for each cluster.

  • 150-253: The run_step_on_skypilot_vm function is responsible for running a pipeline step in a separate SkyPilot VM. It correctly retrieves the cluster_name from unique_resource_configs, constructs the Docker run command, and sets up the task with the appropriate resources. The use of sky.launch to start the task and the subsequent logging and error handling appear to be correct. However, the error handling within the while loop (lines 223-233) should be reviewed to ensure that it correctly identifies when a step is finished and handles any exceptions that may occur.

src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py (2)
  • 17-35: The imports at the beginning of the file are standard and include the newly added re module for regular expressions, which is likely used in the sanitize_cluster_name method. The import of Environment is also new and should be used appropriately within the file.

  • 298-312: The sanitize_cluster_name method uses regular expressions to ensure that the cluster name is valid. This is a new addition and is a good practice to avoid issues with invalid cluster names. The method correctly replaces invalid characters with hyphens and trims leading and trailing hyphens.

src/zenml/orchestrators/step_launcher.py (1)
  • 215-216: The change in the launch method from logger.error to logger.exception is a good improvement as it will provide more detailed error information, including the stack trace, which is valuable for debugging.
src/zenml/integrations/gcp/service_connectors/gcp_service_connector.py (2)
  • 26-28: The addition of imports for shutil, subprocess, and tempfile is noted and seems to be used later in the _configure_local_client method for handling local gcloud CLI configuration.

  • 1050-1114: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [1-1297]

Overall, the changes to the gcp_service_connector.py file are well-structured and seem to follow the objectives outlined in the PR. The modifications to the _configure_local_client and _auto_configure methods are particularly important for handling GCP service account credentials and local gcloud CLI configuration. It is recommended to ensure that all subprocess calls handle text encoding explicitly and that error messages provide clear, actionable advice to users.

Comment on lines 1388 to 1389
os.chmod(os.path.dirname(aws_credentials_path), 0o700)
os.chmod(aws_credentials_path, 0o600)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@safoinme can you use the code I suggested here ?

Copy link
Contributor

@htahir1 htahir1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a wonderful improvement! Left a few comments only

safoinme and others added 4 commits January 11, 2024 10:18
…ector.py

Co-authored-by: Alex Strick van Linschoten <strickvl@users.noreply.github.com>
…ub.com:zenml-io/zenml into feature/OSS-2680-multi-image-multi-vm-skypilot
Copy link
Contributor

@htahir1 htahir1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side it looks good, and i know you tested it!

Copy link
Contributor

@htahir1 htahir1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side it looks good, and i know you tested it!

NAME = SKYPILOT_AWS
REQUIREMENTS = ["skypilot[aws]"]
NAME = SKYPILOT_GCP
REQUIREMENTS = ["skypilot[aws,gcp,azure]<=0.4.1"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need all the 3 clouds for one integration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potential bug that we have a ticket for to investigate, it doesn't matter what cloud we use, zenml always pick up skypilot[aws] and install it in the docker image. So the workaround was to set all 3 of them.

Copy link
Contributor

@htahir1 htahir1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions

# Settings for a specific step that requires more resources
high_resource_settings = SkypilotAWSOrchestratorSettings(
instance_type='t2.2xlarge',
cpus=8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt we take this from ResourceSettings actually now that I think about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the GPU

Copy link
Contributor Author

@safoinme safoinme Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to distinguish between both of them, because ResourceSettings can be defined for things like pods which would allow random resource given, while for Skypilot most of these parameters must match if supplied together because a VM type with this specific resource must exist

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the cpu_count and memory in ResourceSettings should be mapped here... what is the point of ResourceSettings otherwise? @schustmi would love your feedback

@safoinme safoinme merged commit a0fd100 into develop Jan 12, 2024
11 of 22 checks passed
@safoinme safoinme deleted the feature/OSS-2680-multi-image-multi-vm-skypilot branch January 12, 2024 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request internal To filter out internal PRs and issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants