Kubeflow is an open source project which deploys on Kubernetes. It provides end-to-end ML platform and workflow capabilities. There are a number of ways to deploy Kubeflow as well as many variations of Kubeflow that can be deployed. The goal of aws-do-kubeflow
is to simplify the deployment and management of Kubeflow on AWS as well as provide some useful ML examples. This project follows the principles of the Do Framework and the structure of the Depend on Docker template. It containerizes all the tools necessary to deploy and manage Kubeflow using Docker, then executes the deployment from within the container. All you need is an AWS Account.
For a hands-on experience with Kubeflow and its application for distributed ML training workflows, please see our online workshop and walk through the self-paced workshop steps.
Below is an overview diagram that shows the general architecture of a Kubeflow deployment on EKS.
Fig.1 - Deployment Architecture
The deployment process is described on Fig. 2 below:
Fig.2 - Kubeflow deployment process with aws-do-kubeflow
- AWS Account - you will need an AWS account
- EKS Cluster - it is assumed that an EKS cluster already exists in the account. If a cluster is needed, one way to create it, is by following the instructions in the aws-do-eks project. In that case it is recommended to use cluser manifest
/eks/eks-kubeflow.yaml
, located within the aws-do-eks conatiner. - Optionally, we recommend using AWS Cloud9 as a working environment. Instructions for setting up a Cloud9 IDE are available here
All configuration settings of the aws-do-kubeflow
project are centralized in its .env
file. To review or change any of the settings, simply execute ./config.sh
. The AWS_CLUSTER_NAME setting must match the name of your existing EKS Cluster, and AWS_REGION should match the AWS Region where the cluster is deployed.
The aws-do-kubeflow
project supports both the generic and AWS specific Kubeflow distributions. Your desired distribution to deploy, can be configured via setting KF_DISTRO
. By default, the project deploys the AWS vanilla distribution.
Please execute the ./build.sh
script to build the project. This will create the "aws-do-kubeflow" container image and tag it using the registry and version tag specified in the project configuration.
Execute ./run.sh
to bring up the Docker container.
To check if the container is up, execute ./status.sh
. If the container is in Exited state, it can be started with ./start.sh
Executing the ./exec.sh
script will open a bash shell inside the aws-do-kubeflow
container.
To deploy your configured distribution of Kubeflow, simply execute ./kubeflow-deploy.sh
The deployment creates several groups of pods in your EKS cluster. Upon successful deployment, all pods will be in Running state. To check the state of all pods in the cluster, use command:
kubectl get pods -A
.
In order to access the Kubeflow Dashboard, the Istio Ingress Gateway service of this Kubeflow deployment needs to be exposed outside the cluster. In a production deployment this is typically done via an Application Load Balancer (ALB), however this requires a DNS domain registration and a matching SSL certificate.
For an easy way to expose the Kubeflow Dashboard, we can use kubectl port-forward
from Cloud9 or from any machine that has a browser and kubectl access to the cluster.
To start the port-forward, execute script ./kubeflow-expose.sh
.
If you are in Cloud9, select Preview->Preview Running Application. This will open a browser tab within Cloud9. You can expand that tab to a full-browser by clicking the icon in the upper-right corner.
If you are on a machine with its own browser, just navigate to localhost:8080 to open the Kubeflow Dashboard.
Fig. 3 - Kubeflow Dashboard
To remove your Kubeflow deployment, simply execute ./kubeflow-remove.sh
from within the aws-do-kubeflow
container.
- ./config.sh - configure aws-do-kubeflow project settings interactively
- ./build.sh - build aws-do-kubeflow container image
- ./login.sh - login to the configred container registry
- ./push.sh - push aws-do-kubeflow container image to configured registry
- ./pull.sh - pull aws-do-kubeflow container image from a configured existing registry
- ./prune.sh - delete all unused docker containers, networks and images from the local host
- ./run.sh - run aws-do-kubeflow container
- ./status.sh - show current aws-do-kubeflow container status
- ./logs.sh - show logs of the running aws-do-kubeflow container
- ./start.sh - start the aws-do-kubeflow container if is currently in "Exited" status
- ./exec.sh - execute a command inside the running aws-do-kubeflow container, the default command is
bash
- ./stop.sh - stop and remove the aws-do-kubeflow container
- ./test.sh - run container unit tests
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
-
Cloud9 instance running out of disk space - refer to instructions for increasing of volume size here
-
Errors regarding your permissions as a user in Cloud9 - refer to Create an IAM role for your Workspace.
-
Namespaces are left in Terminating state when removing a Kubeflow deployment - execute script
./configure/ns-clear.sh
- Mark Vinciguerra - @mvincig
- Jason Dang - @jndang
- Tatsuo Azeyanagi - @tazeyana
- Alex Iankoulski - @iankouls
- Kanwaljit Khurmi - @kkhurmi
- Milena Boytchef - @boytchef
- Gautam Kumar - @gauta
- Machine Learning Using Kubeflow
- Docker
- Kubernetes
- Kubeflow
- Amazon Web Services
- Depend On Docker Project
- AWS Do EKS Project
- Kubeflow on AWS
- AWS Kubeflow Deployment
- AWS Kubeflow Blog
- AWS Kubeflow Multitenancy
- Kubeflow Pipelines
- Kubeflow Training Operator
- EKS Distributed Training Workshop
- Kubeflow MPI Operator
- Distributed Training with Tensorflow and Kubeflow
- Distributed Training using Pytorch with Kubeflow