AWS do Kubeflow (aws-do-kubeflow)
Deploy and Manage Kubeflow on AWS using the do-framework
Kubeflow is an open source project which deploys on Kubernetes. It provides end-to-end ML platform and workflow capabilities. There are a number of ways to deploy Kubeflow as well as many variations of Kubeflow that can be deployed. The goal of aws-do-kubeflow
is to simplify the deployment and management of Kubeflow on AWS as well as provide some useful ML examples. This project follows the principles of the Do Framework and the structure of the Depend on Docker template. It containerizes all the tools necessary to deploy and manage Kubeflow using Docker, then executes the deployment from within the container. All you need is an AWS Account.
For a hands-on experience with Kubeflow and its application for distributed ML training workflows, please see our online workshop and walk through the self-paced workshop steps.
Below is an overview diagram that shows the general architecture of a Kubeflow deployment on Amazon EKS.
Fig.1 - Deployment Architecture
The deployment process is described on Fig. 2 below:
Fig.2 - Kubeflow deployment process with aws-do-kubeflow
- AWS Account - you will need an AWS account
- EKS Cluster - it is assumed that an EKS cluster already exists in the account. If a cluster is needed, one way to create it, is by following the instructions in the aws-do-eks project. In that case it is recommended to use cluser manifest
/eks/eks-kubeflow.yaml
, located within the aws-do-eks conatiner. - Optionally, you can create an Amazon SageMaker HyperPod cluster and deploy Kubeflow there. If a HyperPod cluster is needed, one way to create it, is by following the instructions in the aws-do-hyperpod project.
- Default StorageClass - it is assumed that a default StorageClass already exists in the underlying EKS cluster when deploying Kubeflow. Some of the Kubeflow components require storage volumes to be available and will create these using a default StorageClass. Please ensure a default StorageClass is set up before deploying Kubeflow. If you need to create one, you can follow the instructions below in the section "Create default StorageClass".
In order for all the components of Kubeflow to work properly, some require a persistent volume which they will attach to the corresponding pod. These components will create the volumes automatically during the deployment of Kubeflow. However, in order for them to create these volumes, they require a default StorageClass to be set up in your EKS cluster. Below we show how to set up a default StorageClass for FSx for Lustre. Note that you can use other storage options (e.g. EFS) and do not have to use FSx for Lustre.
In order to deploy a Default StorageClass, you can either use our automatic deployment scripts or set up all the necessary resources yourself. For both options we provide a detailed step-by-step guide below.
Manual deployment
The Amazon FSx for Lustre Container Storage Interface (CSI) driver uses IAM roles for service accounts (IRSA) to authenticate AWS API calls. To use IRSA, an IAM OpenID Connect (OIDC) provider needs to be associated with the OIDC issuer URL that comes provisioned your EKS cluster.
Create an IAM OIDC identity provider for your cluster with the following command:
eksctl utils associate-iam-oidc-provider --cluster $AWS_CLUSTER_NAME --approve
Deploy the FSx for Lustre CSI driver:
helm repo add aws-fsx-csi-driver https://kubernetes-sigs.github.io/aws-fsx-csi-driver
helm repo update
helm upgrade --install aws-fsx-csi-driver aws-fsx-csi-driver/aws-fsx-csi-driver\
--namespace kube-system
[!NOTE]
This Helm chart includes a service account namedfsx-csi-controller-sa
that gets deployed in thekube-system
namespace.
Use the eksctl CLI to create an IAM role bound to the service account used by the driver, attaching the AmazonFSxFullAccess AWS-managed policy:
eksctl create iamserviceaccount \
--name fsx-csi-controller-sa \
--override-existing-serviceaccounts \
--namespace kube-system \
--cluster $AWS_CLUSTER_NAME \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess \
--approve \
--role-name AmazonEKSFSxLustreCSIDriverFullAccess \
--region $AWS_REGION
[!NOTE]
The--override-existing-serviceaccounts
flag lets eksctl know that thefsx-csi-controller-sa
service account already exists on the EKS cluster, so it skips creating a new one and updates the metadata of the current service account instead.
Annotate the driver's service account with the ARN of the AmazonEKSFSxLustreCSIDriverFullAccess
IAM role that was created:
SA_ROLE_ARN=$(aws iam get-role --role-name AmazonEKSFSxLustreCSIDriverFullAccess --query 'Role.Arn' --output text)
kubectl annotate serviceaccount -n kube-system fsx-csi-controller-sa \
eks.amazonaws.com/role-arn=${SA_ROLE_ARN} --overwrite=true
This annotation lets the driver know what IAM role it should use to interact with the FSx for Lustre service on your behalf.
Restart the fsx-csi-controller deployment for the changes to take effect:
kubectl rollout restart deployment fsx-csi-controller -n kube-system
Create the StorageClass for FSx for Lustre and ensure that it is annotated as default.
cat <<EOF> storageclass.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: fsx-sc
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: fsx.csi.aws.com
parameters:
subnetId: $SUBNET_ID
securityGroupIds: $SECURITYGROUP_ID
deploymentType: PERSISTENT_2
automaticBackupRetentionDays: "0"
copyTagsToBackups: "true"
perUnitStorageThroughput: "250"
dataCompressionType: "LZ4"
fileSystemTypeVersion: "2.15"
mountOptions:
- flock
EOF
Now, deploy this StorageClass to take effect:
kubectl apply -f storageclass.yaml
Automatic deployment
- Navigate into the
deployments/fsx/
directory by usingcd /kubeflow/deployments/fsx
- Execute the
deploy-requirements.sh
script - Execute the
create-storageclass.sh
script
[!NOTE] If you would like to use a different kind of storage for your default StorageClass, simply install the necessary CSI drivers and edit the
storageclass.yaml
accordingly
All configuration settings of the aws-do-kubeflow
project are centralized in its .env
file. To review or change any of the settings, simply execute ./config.sh
. The AWS_CLUSTER_NAME setting must match the name of your existing EKS Cluster, and AWS_REGION should match the AWS Region where the cluster is deployed.
The aws-do-kubeflow
project supports both the generic and AWS specific Kubeflow distributions. Your desired distribution to deploy, can be configured via setting KF_DISTRO
. By default, the project deploys the AWS vanilla distribution.
Important
Please note that the AWS specific Kubeflow distribution is no longer actively maintained.
Please execute the ./build.sh
script to build the project. This will create the "aws-do-kubeflow" container image and tag it using the registry and version tag specified in the project configuration.
Execute ./run.sh
to bring up the Docker container.
To check if the container is up, execute ./status.sh
. If the container is in Exited state, it can be started with ./start.sh
Executing the ./exec.sh
script will open a bash shell inside the aws-do-kubeflow
container.
To deploy your configured distribution of Kubeflow, simply execute ./kubeflow-deploy.sh
The deployment creates several groups of pods in your EKS cluster. Upon successful deployment, all pods will be in Running state. To check the state of all pods in the cluster, use command:
kubectl get pods -A
.
Note
Please note that the complete deployment takes up to 30 minutes until all resources and pods are in running
state.
In order to access the Kubeflow Dashboard, the Istio Ingress Gateway service of this Kubeflow deployment needs to be exposed outside the cluster. In a production deployment this is typically done via an Application Load Balancer (ALB), however this requires a DNS domain registration and a matching SSL certificate.
For an easy way to expose the Kubeflow Dashboard, we can use kubectl port-forward
from from any machine that has a browser and kubectl access to the cluster.
To start the port-forward, execute script ./kubeflow-expose.sh
.
If you are on a machine with its own browser, just navigate to localhost:8080 to open the Kubeflow Dashboard.
Note
Kubeflow uses a default email (user@example.com
) and password (12341234
). For any production Kubeflow deployment, you should change the default password by following the official Kubeflow documentation.
Note
Please change the mount path for Notebook Volumes when creating a new notebook to avoid permission denied
errors. You can use e.g. /volume/
Fig. 3 - Kubeflow Dashboard
To remove your Kubeflow deployment, simply execute ./kubeflow-remove.sh
from within the aws-do-kubeflow
container.
- ./config.sh - configure aws-do-kubeflow project settings interactively
- ./build.sh - build aws-do-kubeflow container image
- ./login.sh - login to the configred container registry
- ./push.sh - push aws-do-kubeflow container image to configured registry
- ./pull.sh - pull aws-do-kubeflow container image from a configured existing registry
- ./prune.sh - delete all unused docker containers, networks and images from the local host
- ./run.sh - run aws-do-kubeflow container
- ./status.sh - show current aws-do-kubeflow container status
- ./logs.sh - show logs of the running aws-do-kubeflow container
- ./start.sh - start the aws-do-kubeflow container if is currently in "Exited" status
- ./exec.sh - execute a command inside the running aws-do-kubeflow container, the default command is
bash
- ./stop.sh - stop and remove the aws-do-kubeflow container
- ./test.sh - run container unit tests
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
- Namespaces are left in Terminating state when removing a Kubeflow deployment - execute script
./configure/ns-clear.sh
- Mark Vinciguerra - @mvincig
- Jason Dang - @jndang
- Florian Stahl - @flostahl
- Tatsuo Azeyanagi - @tazeyana
- Alex Iankoulski - @iankouls
- Kanwaljit Khurmi - @kkhurmi
- Milena Boytchef - @boytchef
- Gautam Kumar - @gauta
- Machine Learning Using Kubeflow
- Docker
- Kubernetes
- Kubeflow
- Amazon Web Services
- Depend On Docker Project
- AWS Do EKS Project
- Amazon SageMaker HyperPod
- AWS Do HyperPod Project
- Kubeflow on AWS
- AWS Kubeflow Deployment
- AWS Kubeflow Blog
- AWS Kubeflow Multitenancy
- Kubeflow Pipelines
- Kubeflow Training Operator
- EKS Distributed Training Workshop
- Kubeflow MPI Operator
- Distributed Training with Tensorflow and Kubeflow
- Distributed Training using Pytorch with Kubeflow
- Build Flexible and Sacalable Distributed Training Architectures using Kubeflow on AWS and Amazon SageMaeker