Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multi step configuration for skypilot #2166

Merged
merged 41 commits into from
Jan 12, 2024
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
27fc323
allow multi step configuration for skypilot
safoinme Dec 19, 2023
6f50333
fixes
safoinme Dec 20, 2023
f109795
Merge branch 'develop' into feature/OSS-2680-multi-image-multi-vm-sky…
strickvl Dec 20, 2023
11a18b9
Merge branch 'develop' into feature/OSS-2680-multi-image-multi-vm-sky…
safoinme Jan 4, 2024
db2c007
Merge branch 'feature/OSS-2680-multi-image-multi-vm-skypilot' of gith…
safoinme Jan 4, 2024
4471d48
Auto-update of Starter template
actions-user Jan 4, 2024
57ed6df
update skypilot to allow running using an orchestrating VM
safoinme Jan 4, 2024
0e02dee
Merge branch 'feature/OSS-2680-multi-image-multi-vm-skypilot' of gith…
safoinme Jan 4, 2024
7452667
add ability to create an aws config when configuring local client
safoinme Jan 4, 2024
32de4aa
Add advanced AWS configurations to config.yaml
safoinme Jan 5, 2024
0eb1f97
Fix cluster name concatenation in skypilot orchestrator entrypoint
safoinme Jan 5, 2024
4270edd
Fix AWS service connector and Skypilot orchestrator issues
safoinme Jan 5, 2024
dfdedf9
Add APT packages to Skypilot integrations and sanitize cluster name
safoinme Jan 6, 2024
b6c9f4c
Remove advanced AWS configurations from config.yaml
safoinme Jan 6, 2024
301bfa6
Update Skypilot orchestrator code
safoinme Jan 6, 2024
f9d51d8
Merge branch 'develop' into feature/OSS-2680-multi-image-multi-vm-sky…
safoinme Jan 6, 2024
559b710
Auto-update of Starter template
actions-user Jan 6, 2024
7476f85
Refactor cluster name sanitization in SkypilotBaseOrchestrator
safoinme Jan 6, 2024
09c368d
merge changes
safoinme Jan 6, 2024
4077eda
Refactor SkypilotBaseOrchestrator and SkypilotOrchestratorEntrypoint
safoinme Jan 7, 2024
5057bae
Remove commented out code and update Skypilot orchestrator entrypoint
safoinme Jan 7, 2024
b20b975
Fix GCP service connector support for local gcloud CLI login
stefannica Jan 8, 2024
8699b5a
Fix Skypilot orchestrator cluster name generation
safoinme Jan 9, 2024
9898aef
Merge branch 'feature/OSS-2680-multi-image-multi-vm-skypilot' of gith…
safoinme Jan 9, 2024
ec7d09b
Merge branch 'develop' into feature/OSS-2680-multi-image-multi-vm-sky…
safoinme Jan 9, 2024
21af350
Refactor cluster creation logic in SkypilotBaseOrchestrator
safoinme Jan 10, 2024
69b5924
Allow custom docker run args when using the Skypilot orchestrator
schustmi Jan 10, 2024
9964220
Handle exception and log error message in SkypilotBaseOrchestrator
safoinme Jan 10, 2024
f65e61d
Refactor AWS service connector to improve file handling
safoinme Jan 10, 2024
b33d4bb
Merge branch 'feature/OSS-2680-multi-image-multi-vm-skypilot' of gith…
safoinme Jan 10, 2024
daaaefc
Refactor Skypilot integrations for ZenML
safoinme Jan 10, 2024
dddd6fe
Merge branch 'develop' into feature/OSS-2680-multi-image-multi-vm-sky…
safoinme Jan 10, 2024
a3271a6
Add encoding parameter to subprocess call
safoinme Jan 10, 2024
794cba4
Update Skypilot integration for AWS, GCP, and Azure
safoinme Jan 10, 2024
f40796f
Add note about configuring pipeline resources for specific orchestrat…
safoinme Jan 10, 2024
46170b6
Merge branch 'develop' into feature/OSS-2680-multi-image-multi-vm-sky…
strickvl Jan 11, 2024
89903b8
Update src/zenml/integrations/gcp/service_connectors/gcp_service_conn…
safoinme Jan 11, 2024
44ea432
Merge branch 'develop' into feature/OSS-2680-multi-image-multi-vm-sky…
safoinme Jan 11, 2024
df16014
Add error handling for running code in a notebook and update Skypilot…
safoinme Jan 11, 2024
69a7ce4
Merge branch 'feature/OSS-2680-multi-image-multi-vm-skypilot' of gith…
safoinme Jan 11, 2024
7ef6179
Merge branch 'develop' into feature/OSS-2680-multi-image-multi-vm-sky…
safoinme Jan 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -65,16 +65,16 @@ To use the SkyPilot VM Orchestrator, you need:
* One of the SkyPilot integrations installed. You can install the SkyPilot integration for your cloud provider of choice using the following command:
```shell
# For AWS
pip install "zenml[connectors-gcp]"
zenml integration install aws vm_aws
pip install "zenml[connectors-aws]"
zenml integration install aws skypilot_aws

# for GCP
pip install "zenml[connectors-gcp]"
zenml integration install gcp vm_gcp # for GCP
zenml integration install gcp skypilot_gcp # for GCP

# for Azure
pip install "zenml[connectors-azure]"
zenml integration install azure vm_azure # for Azure
zenml integration install azure skypilot_azure # for Azure
```
* [Docker](https://www.docker.com) installed and running.
* A [remote artifact store](../artifact-stores/artifact-stores.md) as part of your stack.
Expand All @@ -91,7 +91,7 @@ We need first to install the SkyPilot integration for AWS and the AWS connectors

```shell
pip install "zenml[connectors-aws]"
zenml integration install aws vm_aws
zenml integration install aws skypilot_aws
```

To provision VMs on AWS, your VM Orchestrator stack component needs to be configured to authenticate with [AWS Service Connector](../../../stacks-and-components/auth-management/aws-service-connector.md).
Expand Down Expand Up @@ -141,7 +141,7 @@ We need first to install the SkyPilot integration for GCP and the GCP extra for

```shell
pip install "zenml[connectors-gcp]"
zenml integration install gcp vm_gcp
zenml integration install gcp skypilot_gcp
```

To provision VMs on GCP, your VM Orchestrator stack component needs to be configured to authenticate with [GCP Service Connector](../../../stacks-and-components/auth-management/gcp-service-connector.md)
Expand Down Expand Up @@ -199,7 +199,7 @@ We need first to install the SkyPilot integration for Azure and the Azure extra

```shell
pip install "zenml[connectors-azure]"
zenml integration install azure vm_azure
zenml integration install azure skypilot_azure
```

To provision VMs on Azure, your VM Orchestrator stack component needs to be configured to authenticate with [Azure Service Connector](../../../stacks-and-components/auth-management/azure-service-connector.md)
Expand Down Expand Up @@ -371,6 +371,46 @@ skypilot_settings = SkypilotAzureOrchestratorSettings(
{% endtab %}
{% endtabs %}


One of the key features of the SkyPilot VM Orchestrator is the ability to run each step of a pipeline on a separate VM with its own specific settings. This allows for fine-grained control over the resources allocated to each step, ensuring that each part of your pipeline has the necessary compute power while optimizing for cost and efficiency.

## Configuring Step-Specific Resources

The SkyPilot VM Orchestrator allows you to configure resources for each step individually. This means you can specify different VM types, CPU and memory requirements, and even use spot instances for certain steps while using on-demand instances for others.

To configure step-specific resources, you can pass a `SkypilotBaseOrchestratorSettings` object to the `settings` parameter of the `@step` decorator. This object allows you to define various attributes such as `instance_type`, `cpus`, `memory`, `use_spot`, `region`, and more.

Here's an example of how to configure specific resources for a step for the AWS cloud:

```python
from zenml.integrations.skypilot.flavors.skypilot_orchestrator_aws_vm_flavor import SkypilotAWSOrchestratorSettings

# Settings for a specific step that requires more resources
high_resource_settings = SkypilotAWSOrchestratorSettings(
instance_type='t2.2xlarge',
cpus=8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt we take this from ResourceSettings actually now that I think about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the GPU

Copy link
Contributor Author

@safoinme safoinme Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to distinguish between both of them, because ResourceSettings can be defined for things like pods which would allow random resource given, while for Skypilot most of these parameters must match if supplied together because a VM type with this specific resource must exist

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the cpu_count and memory in ResourceSettings should be mapped here... what is the point of ResourceSettings otherwise? @schustmi would love your feedback

memory=32,
use_spot=False,
region='us-east-1',
# ... other settings
)

@step(settings={"orchestrator.vm_aws": high_resource_settings})
safoinme marked this conversation as resolved.
Show resolved Hide resolved
def my_resource_intensive_step():
# Step implementation
pass
```

{% hint style="warning" %}
When configuring pipeline or step-specific resources, you can use the `settings`
parameter to specifically target the orchestrator flavor you want to use
`orchestrator.STACK_COMPONENT_FLAVOR` and not orchestrator component name
`orchestrator.STACK_COMPONENT_NAME`. For example, if you want to configure
resources for the `vm_gcp` flavor, you can use `settings={"orchestrator.vm_gcp": ...}`.
{% endhint %}

By using the `settings` parameter, you can tailor the resources for each step according to its specific needs. This flexibility allows you to optimize your pipeline execution for both performance and cost.

Check out
the [SDK docs](https://sdkdocs.zenml.io/latest/integration\_code\_docs/integrations-skypilot/#zenml.integrations.skypilot.flavors.skypilot\_orchestrator\_base\_vm\_flavor.SkypilotBaseOrchestratorSettings)
for a full list of available attributes and [this docs page](/docs/book/user-guide/advanced-guide/pipelining-features/configure-steps-pipelines.md) for more
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
import base64
import datetime
import json
import os
import re
from typing import Any, Dict, List, Optional, Tuple, cast

Expand Down Expand Up @@ -1371,6 +1372,21 @@ def _configure_local_client(
"aws_session_token"
] = credentials.token

aws_credentials_path = os.path.join(
users_home, ".aws", "credentials"
)

# Create the file as well as the parent dir if needed.
dirname = os.path.split(aws_credentials_path)[0]
strickvl marked this conversation as resolved.
Show resolved Hide resolved
if not os.path.isdir(dirname):
os.makedirs(dirname)
with os.fdopen(
os.open(aws_credentials_path, os.O_WRONLY | os.O_CREAT, 0o600),
"w",
):
pass

# Write the credentials to the file
common.rewrite_credentials_file(all_profiles, users_home)

logger.info(
Expand Down
108 changes: 88 additions & 20 deletions src/zenml/integrations/gcp/service_connectors/gcp_service_connector.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@
import json
import os
import re
import shutil
import subprocess
import tempfile
from typing import Any, Dict, List, Optional, Tuple

import google.api_core.exceptions
Expand Down Expand Up @@ -1026,6 +1029,7 @@ def _configure_local_client(
NotImplementedError: If the connector instance does not support
local configuration for the configured resource type or
authentication method.registry
AuthorizationException: If the local client configuration fails.
"""
resource_type = self.resource_type

Expand All @@ -1034,16 +1038,8 @@ def _configure_local_client(

# There is no way to configure the local gcloud CLI to use
# temporary OAuth 2.0 tokens. However, we can configure it to use
# the user account credentials or service account credentials
if self.auth_method == GCPAuthenticationMethods.USER_ACCOUNT:
assert isinstance(self.config, GCPUserAccountConfig)
# Use the user account credentials JSON to configure the
# local gcloud CLI
gcloud_config_json = (
self.config.user_account_json.get_secret_value()
)

elif self.auth_method == GCPAuthenticationMethods.SERVICE_ACCOUNT:
# the service account credentials
if self.auth_method == GCPAuthenticationMethods.SERVICE_ACCOUNT:
assert isinstance(self.config, GCPServiceAccountConfig)
# Use the service account credentials JSON to configure the
# local gcloud CLI
Expand All @@ -1054,7 +1050,66 @@ def _configure_local_client(
if gcloud_config_json:
from google.auth import _cloud_sdk

# Dump the user account or service account credentials JSON to
if not shutil.which("gcloud"):
raise AuthorizationException(
"The local gcloud CLI is not installed. Please "
"install the gcloud CLI to use this feature."
)

# Write the credentials JSON to a temporary file
with tempfile.NamedTemporaryFile(
mode="w", suffix=".json", delete=True
) as f:
f.write(gcloud_config_json)
f.flush()
adc_path = f.name

try:
# Run the gcloud CLI command to configure the local
# gcloud CLI to use the credentials JSON
subprocess.run(
[
"gcloud",
"auth",
"login",
"--cred-file",
adc_path,
],
check=True,
stderr=subprocess.STDOUT,
encoding="utf-8",
stdout=subprocess.PIPE,
)
except subprocess.CalledProcessError as e:
raise AuthorizationException(
f"Failed to configure the local gcloud CLI to use "
f"the credentials JSON: {e}\n"
f"{e.stdout.decode()}"
)

try:
# Run the gcloud CLI command to configure the local gcloud
# CLI to use the credentials project ID
subprocess.run(
[
"gcloud",
"config",
"set",
"project",
self.config.project_id,
],
check=True,
stderr=subprocess.STDOUT,
stdout=subprocess.PIPE,
)
except subprocess.CalledProcessError as e:
raise AuthorizationException(
f"Failed to configure the local gcloud CLI to use "
f"the project ID: {e}\n"
f"{e.stdout.decode()}"
)

# Dump the service account credentials JSON to
# the local gcloud application default credentials file
adc_path = (
_cloud_sdk.get_application_default_credentials_path()
Expand All @@ -1063,16 +1118,15 @@ def _configure_local_client(
f.write(gcloud_config_json)

logger.info(
f"Updated the local gcloud default application "
f"credentials file at '{adc_path}'"
"Updated the local gcloud CLI and application default "
f"credentials file ({adc_path})."
)

return

raise NotImplementedError(
f"Local gcloud client configuration for resource type "
f"{resource_type} is only supported if the "
f"'{GCPAuthenticationMethods.USER_ACCOUNT}' or "
f"'{GCPAuthenticationMethods.SERVICE_ACCOUNT}' authentication "
f"method is used and only if the generation of temporary OAuth "
f"2.0 tokens is disabled by setting the "
Expand Down Expand Up @@ -1221,13 +1275,27 @@ def _auto_configure(
"GOOGLE_APPLICATION_CREDENTIALS"
)
if service_account_json_file is None:
# Shouldn't happen since google.auth.default() should
# already have loaded the credentials from the environment
# No explicit service account JSON file was specified in the
# environment, meaning that the credentials were loaded from
# the GCP application default credentials (ADC) file.
from google.auth import _cloud_sdk

# Use the location of the gcloud application default
# credentials file
service_account_json_file = (
_cloud_sdk.get_application_default_credentials_path()
)

if not service_account_json_file or not os.path.isfile(
service_account_json_file
):
raise AuthorizationException(
"No GCP service account credentials found in the "
"environment. Please set the "
"GOOGLE_APPLICATION_CREDENTIALS environment variable "
"to the path of the service account JSON file."
"No GCP service account credentials were found in the "
"environment or the application default credentials"
safoinme marked this conversation as resolved.
Show resolved Hide resolved
"path. Please set the GOOGLE_APPLICATION_CREDENTIALS "
"environment variable to the path of the service "
"account JSON file or run 'gcloud auth application-"
"default login' to generate a new ADC file."
stefannica marked this conversation as resolved.
Show resolved Hide resolved
)
with open(service_account_json_file, "r") as f:
service_account_json = f.read()
Expand Down
36 changes: 18 additions & 18 deletions src/zenml/integrations/skypilot/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,52 +30,52 @@
SKYPILOT_GCP_ORCHESTRATOR_FLAVOR = "vm_gcp"
SKYPILOT_AZURE_ORCHESTRATOR_FLAVOR = "vm_azure"

class SkypilotGCPIntegration(Integration):
"""Definition of Skypilot Integration for ZenML."""

class SkypilotAWSIntegration(Integration):
"""Definition of Skypilot AWS Integration for ZenML."""

NAME = SKYPILOT_AWS
REQUIREMENTS = ["skypilot[aws]"]
NAME = SKYPILOT_GCP
REQUIREMENTS = ["skypilot[aws,gcp,azure]<=0.4.1"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need all the 3 clouds for one integration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potential bug that we have a ticket for to investigate, it doesn't matter what cloud we use, zenml always pick up skypilot[aws] and install it in the docker image. So the workaround was to set all 3 of them.

APT_PACKAGES = ["openssh-client","rsync"]

@classmethod
def flavors(cls) -> List[Type[Flavor]]:
"""Declare the stack component flavors for the Skypilot AWS integration.
"""Declare the stack component flavors for the Skypilot GCP integration.

Returns:
List of stack component flavors for this integration.
"""
from zenml.integrations.skypilot.flavors import (
SkypilotAWSOrchestratorFlavor,
SkypilotGCPOrchestratorFlavor,
)

return [SkypilotAWSOrchestratorFlavor]

return [SkypilotGCPOrchestratorFlavor]

class SkypilotGCPIntegration(Integration):
"""Definition of Skypilot Integration for ZenML."""
class SkypilotAWSIntegration(Integration):
"""Definition of Skypilot AWS Integration for ZenML."""

NAME = SKYPILOT_GCP
REQUIREMENTS = ["skypilot[gcp]"]
NAME = SKYPILOT_AWS
REQUIREMENTS = ["skypilot[aws,gcp,azure]<=0.4.1"]
APT_PACKAGES = ["openssh-client","rsync"]
safoinme marked this conversation as resolved.
Show resolved Hide resolved

@classmethod
def flavors(cls) -> List[Type[Flavor]]:
"""Declare the stack component flavors for the Skypilot GCP integration.
"""Declare the stack component flavors for the Skypilot AWS integration.

Returns:
List of stack component flavors for this integration.
"""
from zenml.integrations.skypilot.flavors import (
SkypilotGCPOrchestratorFlavor,
SkypilotAWSOrchestratorFlavor,
)

return [SkypilotGCPOrchestratorFlavor]

return [SkypilotAWSOrchestratorFlavor]

class SkypilotAzureIntegration(Integration):
"""Definition of Skypilot Integration for ZenML."""

NAME = SKYPILOT_AZURE
REQUIREMENTS = ["skypilot[azure]"]
REQUIREMENTS = ["skypilot[aws,gcp,azure]<=0.4.1"]
APT_PACKAGES = ["openssh-client","rsync"]

@classmethod
def flavors(cls) -> List[Type[Flavor]]:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# permissions and limitations under the License.
"""Skypilot orchestrator base config and settings."""

from typing import Dict, Literal, Optional, Union
from typing import Dict, List, Literal, Optional, Union

from zenml.config.base_settings import BaseSettings
from zenml.logger import get_logger
Expand Down Expand Up @@ -107,6 +107,8 @@ class SkypilotBaseOrchestratorSettings(BaseSettings):
down: bool = True
stream_logs: bool = True

docker_run_args: List[str] = []


class SkypilotBaseOrchestratorConfig( # type: ignore[misc] # https://github.com/pydantic/pydantic/issues/4173
BaseOrchestratorConfig, SkypilotBaseOrchestratorSettings
Expand Down
Loading
Loading