Skip to content

Latest commit

 

History

History
180 lines (138 loc) · 8.43 KB

setup.md

File metadata and controls

180 lines (138 loc) · 8.43 KB

Setup Kubeflow

Requirements

  • Kubernetes cluster
  • Access to a working kubectl (Kubernetes CLI)
  • Ksonnet CLI: ks

Setup

Refer to the getting started guide for instructions on how to setup kubeflow on your kubernetes cluster. Specifically, look at the quick start section. For this example, we will be using ks nocloud environment (on premise K8s). If you plan to use cloud ks environment, please make sure you follow the proper instructions in the kubeflow getting started guide.

After completing the steps in the kubeflow getting started guide you will have the following:

  • A ksonnet app directory called my-kubeflow
  • A new namespace in you K8s cluster called kubeflow
  • The following pods in your kubernetes cluster in the kubeflow namespace:
$ kubectl -n kubeflow get pods
NAME                                      READY     STATUS    RESTARTS   AGE
ambassador-7987df44b9-4pht8               2/2       Running   0          1m
ambassador-7987df44b9-dh5h6               2/2       Running   0          1m
ambassador-7987df44b9-qrgsm               2/2       Running   0          1m
tf-hub-0                                  1/1       Running   0          1m
tf-job-operator-v1alpha2-b76bfbdb-lgbjw   1/1       Running   0          1m

Overview

During the course of this tutorial you will apply a set of ksonnet components that will:

  1. Create a PersistentVolume to store our data and training results.
  2. Download the dataset, dataset annotations, a pre-trained model checkpoint, and the training pipeline configuration file.
  3. Decompress the downloaded dataset, pre-trained model, and dataset annotations.
  4. Create a TensorFlow pet record since we will be training a pet detector model.
  5. Execute a distributed TensorFlow object detection training job using the previous configurations.
  6. Export the trained pet detector model and serve it using TF-Serving

We have prepared a ksonnet app ks-app with a set of components that will be used in this example. The components can be found at the ks-app/components directory in case you want to perform some customizations.

Let's make use of the app to continue with the tutorial.

cd ks-app
ENV=default
ks env add ${ENV} --context=`kubectl config current-context`
ks env set ${ENV} --namespace kubeflow

Preparing the training data

Note: TensorFlow works with many file systems like HDFS and S3, you can use them to push the dataset and other configurations there and skip the Download and Decompress steps in this tutorial.

First let's create a PVC to store the data.

# First, lets configure and apply the pets-pvc to create a PVC where the training data will be stored
ks param set pets-pvc accessMode "ReadWriteMany"
ks param set pets-pvc storage "20Gi"
ks apply ${ENV} -c pets-pvc

The command above will create a PVC with ReadWriteMany access mode if your Kubernetes cluster does not support this feature you can modify the accessMode value to create the PVC in ReadWriteOnce and before you execute the tf-job to train the model add a nodeSelector: configuration to execute the pods in the same node. You can find more about assigning pods to specific nodes here

This step assumes that your K8s cluster has Dynamic Volume Provisioning enabled and the default Storage Class is created. You can check if the assumption is ready like below (a storageclass with (default) notation need exist):

$ kubectl get storageclass
NAME                 PROVISIONER               AGE
standard (default)   kubernetes.io/gce-pd      1d
gold                 kubernetes.io/gce-pd      1d

Otherwise you can find that the PVC remains Pending status.

$ kubectl get pvc pets-pvc -n kubeflow
NAME        STATUS    VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pets-pvc    Pending                                                      28s

If your cluster doesn't have defined default storageclass, you can create a PersistentVolume manually to make the PVC work.

Now we will get the data we need to prepare our training pipeline:

# Configure and apply the get-data-job component this component will download the dataset,
# annotations, the model we will use for the fine tune checkpoint, and
# the pipeline configuration file

PVC="pets-pvc"
MOUNT_PATH="/pets_data"
DATASET_URL="http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz"
ANNOTATIONS_URL="http://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz"
MODEL_URL="http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_coco_2018_01_28.tar.gz"
PIPELINE_CONFIG_URL="https://raw.githubusercontent.com/kubeflow/examples/master/object_detection/conf/faster_rcnn_resnet101_pets.config"

ks param set get-data-job mounthPath ${MOUNT_PATH}
ks param set get-data-job pvc ${PVC}
ks param set get-data-job urlData ${DATASET_URL}
ks param set get-data-job urlAnnotations ${ANNOTATIONS_URL}
ks param set get-data-job urlModel ${MODEL_URL}
ks param set get-data-job urlPipelineConfig ${PIPELINE_CONFIG_URL}

ks apply ${ENV} -c get-data-job

The downloaded files will be dumped into the MOUNT_PATH

Here is a quick description for the get-data-job component parameters:

  • mountPath string, volume mount path.
  • pvc string, name of the PVC where the data will be stored.
  • urlData string, remote URL of the dataset that will be used for training.
  • urlAnnotations string, remote URL of the annotations that will be used for training.
  • urlModel string, remote URL of the model that will be used for fine tuning.
  • urlPipelineConfig string, remote URL of the pipeline configuration file to use.

NOTE: The annotations are the result of labeling your dataset using some manual labeling tool. For this example we will use a set of annotations generated specifically for the dataset we are using for training.

Before moving to the next set of commands make sure all of the jobs to get the data were completed.

Now we will configure and apply the decompress-data-job component:

ANNOTATIONS_PATH="${MOUNT_PATH}/annotations.tar.gz"
DATASET_PATH="${MOUNT_PATH}/images.tar.gz"
PRE_TRAINED_MODEL_PATH="${MOUNT_PATH}/faster_rcnn_resnet101_coco_2018_01_28.tar.gz"

ks param set decompress-data-job mountPath ${MOUNT_PATH}
ks param set decompress-data-job pvc ${PVC}
ks param set decompress-data-job pathToAnnotations ${ANNOTATIONS_PATH}
ks param set decompress-data-job pathToDataset ${DATASET_PATH}
ks param set decompress-data-job pathToModel ${PRE_TRAINED_MODEL_PATH}

ks apply ${ENV} -c decompress-data-job

Here is a quick description for the decompress-data-job component parameters:

  • mountPath string, volume mount path.
  • pvc string, name of the PVC where the data is located.
  • pathToAnnotations string, File system path to the annotations .tar.gz file
  • pathToDataset string, File system path to the dataset .tar.gz file
  • pathToModel string, File system path to the pre-trained model .tar.gz file

Finally, and since TensorFlow Object Detection API uses the TFRecord format we need to create the TF pet records. For that, we wil configure and apply the create-pet-record-job component:

OBJ_DETECTION_IMAGE="lcastell/pets_object_detection"
DATA_DIR_PATH="${MOUNT_PATH}"
OUTPUT_DIR_PATH="${MOUNT_PATH}"

ks param set create-pet-record-job image ${OBJ_DETECTION_IMAGE}
ks param set create-pet-record-job dataDirPath ${DATA_DIR_PATH}
ks param set create-pet-record-job outputDirPath ${OUTPUT_DIR_PATH}
ks param set create-pet-record-job mountPath ${MOUNT_PATH}
ks param set create-pet-record-job pvc ${PVC}

ks apply ${ENV} -c create-pet-record-job

Here is a quick description for the create-pet-record-job component parameters:

  • mountPath string, volume mount path.
  • pvc string, name of the PVC where the data is located.
  • image string, name of the docker image to use.
  • dataDirPath string, the directory with the images
  • outputDirPath string, the output directory for the pet records.

To see the default values of the components used in this set of steps look at: params.libsonnet

Next

Submit the TF Job