Skip to content

A do-framework project to simplify deployment of Kubeflow on Amazon EKS

License

Notifications You must be signed in to change notification settings

aws-samples/aws-do-kubeflow

aws-do-kubeflow

AWS do Kubeflow (aws-do-kubeflow)
Deploy and Manage Kubeflow on AWS using the do-framework

Overview

Kubeflow is an open source project which deploys on Kubernetes. It provides end-to-end ML platform and workflow capabilities. There are a number of ways to deploy Kubeflow as well as many variations of Kubeflow that can be deployed. The goal of aws-do-kubeflow is to simplify the deployment and management of Kubeflow on AWS as well as provide some useful ML examples. This project follows the principles of the Do Framework and the structure of the Depend on Docker template. It containerizes all the tools necessary to deploy and manage Kubeflow using Docker, then executes the deployment from within the container. All you need is an AWS Account.

For a hands-on experience with Kubeflow and its application for distributed ML training workflows, please see our online workshop and walk through the self-paced workshop steps.

Below is an overview diagram that shows the general architecture of a Kubeflow deployment on Amazon EKS.


Fig.1 - Deployment Architecture

The deployment process is described on Fig. 2 below:


Fig.2 - Kubeflow deployment process with aws-do-kubeflow

Prerequisites

  1. AWS Account - you will need an AWS account
  2. EKS Cluster - it is assumed that an EKS cluster already exists in the account. If a cluster is needed, one way to create it, is by following the instructions in the aws-do-eks project. In that case it is recommended to use cluser manifest /eks/eks-kubeflow.yaml, located within the aws-do-eks conatiner.
  3. Optionally, you can create an Amazon SageMaker HyperPod cluster and deploy Kubeflow there. If a HyperPod cluster is needed, one way to create it, is by following the instructions in the aws-do-hyperpod project.
  4. Default StorageClass - it is assumed that a default StorageClass already exists in the underlying EKS cluster when deploying Kubeflow. Some of the Kubeflow components require storage volumes to be available and will create these using a default StorageClass. Please ensure a default StorageClass is set up before deploying Kubeflow. If you need to create one, you can follow the instructions below in the section "Create default StorageClass".

Create Default StorageClass

In order for all the components of Kubeflow to work properly, some require a persistent volume which they will attach to the corresponding pod. These components will create the volumes automatically during the deployment of Kubeflow. However, in order for them to create these volumes, they require a default StorageClass to be set up in your EKS cluster. Below we show how to set up a default StorageClass for FSx for Lustre. Note that you can use other storage options (e.g. EFS) and do not have to use FSx for Lustre.

In order to deploy a Default StorageClass, you can either use our automatic deployment scripts or set up all the necessary resources yourself. For both options we provide a detailed step-by-step guide below.

Manual deployment

Install the Amazon FSx for Lustre CSI Driver

The Amazon FSx for Lustre Container Storage Interface (CSI) driver uses IAM roles for service accounts (IRSA) to authenticate AWS API calls. To use IRSA, an IAM OpenID Connect (OIDC) provider needs to be associated with the OIDC issuer URL that comes provisioned your EKS cluster.

Create an IAM OIDC identity provider for your cluster with the following command:

eksctl utils associate-iam-oidc-provider --cluster $AWS_CLUSTER_NAME --approve

Deploy the FSx for Lustre CSI driver:

helm repo add aws-fsx-csi-driver https://kubernetes-sigs.github.io/aws-fsx-csi-driver

helm repo update

helm upgrade --install aws-fsx-csi-driver aws-fsx-csi-driver/aws-fsx-csi-driver\
  --namespace kube-system 

[!NOTE]
This Helm chart includes a service account named fsx-csi-controller-sa that gets deployed in the kube-system namespace.

Use the eksctl CLI to create an IAM role bound to the service account used by the driver, attaching the AmazonFSxFullAccess AWS-managed policy:

eksctl create iamserviceaccount \
  --name fsx-csi-controller-sa \
  --override-existing-serviceaccounts \
  --namespace kube-system \
  --cluster $AWS_CLUSTER_NAME \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess \
  --approve \
  --role-name AmazonEKSFSxLustreCSIDriverFullAccess \
  --region $AWS_REGION

[!NOTE]
The --override-existing-serviceaccounts flag lets eksctl know that the fsx-csi-controller-sa service account already exists on the EKS cluster, so it skips creating a new one and updates the metadata of the current service account instead.

Annotate the driver's service account with the ARN of the AmazonEKSFSxLustreCSIDriverFullAccess IAM role that was created:

SA_ROLE_ARN=$(aws iam get-role --role-name AmazonEKSFSxLustreCSIDriverFullAccess --query 'Role.Arn' --output text)

kubectl annotate serviceaccount -n kube-system fsx-csi-controller-sa \
  eks.amazonaws.com/role-arn=${SA_ROLE_ARN} --overwrite=true

This annotation lets the driver know what IAM role it should use to interact with the FSx for Lustre service on your behalf.

Restart the fsx-csi-controller deployment for the changes to take effect:

kubectl rollout restart deployment fsx-csi-controller -n kube-system

Create default StorageClass

Create the StorageClass for FSx for Lustre and ensure that it is annotated as default.

cat <<EOF> storageclass.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: fsx-sc
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: fsx.csi.aws.com
parameters:
  subnetId: $SUBNET_ID
  securityGroupIds: $SECURITYGROUP_ID
  deploymentType: PERSISTENT_2
  automaticBackupRetentionDays: "0"
  copyTagsToBackups: "true"
  perUnitStorageThroughput: "250"
  dataCompressionType: "LZ4"
  fileSystemTypeVersion: "2.15"
mountOptions:
  - flock
EOF

Now, deploy this StorageClass to take effect:

kubectl apply -f storageclass.yaml
Automatic deployment
  1. Navigate into the deployments/fsx/ directory by using cd /kubeflow/deployments/fsx
  2. Execute the deploy-requirements.sh script
  3. Execute the create-storageclass.sh script

[!NOTE] If you would like to use a different kind of storage for your default StorageClass, simply install the necessary CSI drivers and edit the storageclass.yaml accordingly

Configure

All configuration settings of the aws-do-kubeflow project are centralized in its .env file. To review or change any of the settings, simply execute ./config.sh. The AWS_CLUSTER_NAME setting must match the name of your existing EKS Cluster, and AWS_REGION should match the AWS Region where the cluster is deployed.

The aws-do-kubeflow project supports both the generic and AWS specific Kubeflow distributions. Your desired distribution to deploy, can be configured via setting KF_DISTRO. By default, the project deploys the AWS vanilla distribution.

Important

Please note that the AWS specific Kubeflow distribution is no longer actively maintained.

Build

Please execute the ./build.sh script to build the project. This will create the "aws-do-kubeflow" container image and tag it using the registry and version tag specified in the project configuration.

Run

Execute ./run.sh to bring up the Docker container.

Status

To check if the container is up, execute ./status.sh. If the container is in Exited state, it can be started with ./start.sh

Exec

Executing the ./exec.sh script will open a bash shell inside the aws-do-kubeflow container.

Deploy Kubeflow

To deploy your configured distribution of Kubeflow, simply execute ./kubeflow-deploy.sh

The deployment creates several groups of pods in your EKS cluster. Upon successful deployment, all pods will be in Running state. To check the state of all pods in the cluster, use command: kubectl get pods -A.

Note

Please note that the complete deployment takes up to 30 minutes until all resources and pods are in running state.

Access Kubeflow Dashboard

In order to access the Kubeflow Dashboard, the Istio Ingress Gateway service of this Kubeflow deployment needs to be exposed outside the cluster. In a production deployment this is typically done via an Application Load Balancer (ALB), however this requires a DNS domain registration and a matching SSL certificate.

For an easy way to expose the Kubeflow Dashboard, we can use kubectl port-forward from from any machine that has a browser and kubectl access to the cluster. To start the port-forward, execute script ./kubeflow-expose.sh.

If you are on a machine with its own browser, just navigate to localhost:8080 to open the Kubeflow Dashboard.

Note

Kubeflow uses a default email (user@example.com) and password (12341234). For any production Kubeflow deployment, you should change the default password by following the official Kubeflow documentation.

Note

Please change the mount path for Notebook Volumes when creating a new notebook to avoid permission denied errors. You can use e.g. /volume/


Fig. 3 - Kubeflow Dashboard

Remove Kubeflow Deployment

To remove your Kubeflow deployment, simply execute ./kubeflow-remove.sh from within the aws-do-kubeflow container.

Command reference

  • ./config.sh - configure aws-do-kubeflow project settings interactively
  • ./build.sh - build aws-do-kubeflow container image
  • ./login.sh - login to the configred container registry
  • ./push.sh - push aws-do-kubeflow container image to configured registry
  • ./pull.sh - pull aws-do-kubeflow container image from a configured existing registry
  • ./prune.sh - delete all unused docker containers, networks and images from the local host
  • ./run.sh - run aws-do-kubeflow container
  • ./status.sh - show current aws-do-kubeflow container status
  • ./logs.sh - show logs of the running aws-do-kubeflow container
  • ./start.sh - start the aws-do-kubeflow container if is currently in "Exited" status
  • ./exec.sh - execute a command inside the running aws-do-kubeflow container, the default command is bash
  • ./stop.sh - stop and remove the aws-do-kubeflow container
  • ./test.sh - run container unit tests

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Troubleshooting

  • Namespaces are left in Terminating state when removing a Kubeflow deployment - execute script ./configure/ns-clear.sh

Credits

  • Mark Vinciguerra - @mvincig
  • Jason Dang - @jndang
  • Florian Stahl - @flostahl
  • Tatsuo Azeyanagi - @tazeyana
  • Alex Iankoulski - @iankouls
  • Kanwaljit Khurmi - @kkhurmi
  • Milena Boytchef - @boytchef
  • Gautam Kumar - @gauta

References