Deploy the TfJob operator using a helm chart.

* The chart includes a basic test to make sure the operator is working.
kubeflow · Jul 12, 2017 · 3fea440 · 3fea440
1 parent d235a03
commit 3fea440
Show file tree

Hide file tree

Showing 5 changed files with 92 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -67,7 +67,7 @@ Leader election allows a K8s deployment resource to be used to upgrade the opera
 1. Deploy the operator
 
    ```
-   kubectl create -f ./images/tf_operator/tf_job_operator_deployment.yaml 
+   helm install tf-job-chart/ -n tf-job --wait --replace
    ```
 
 1. Make sure the operator is running
@@ -80,6 +80,14 @@ Leader election allows a K8s deployment resource to be used to upgrade the opera
 
     ```
 
+1. Run the helm tests
+
+    ```
+    helm test tf-job
+    RUNNING: tf-job-tfjob-test-pqxkwk
+    PASSED: tf-job-tfjob-test-pqxkwk
+    ```
+    
 ## Run the example
 
 A simplistic TF program is in the directory tf_sample. 
@@ -148,6 +156,14 @@ There is a lot of code from earlier versions (including the ETCD operator) that
 
 There is minimal testing.
 
+#### Unittests
+
+There are some unittests.
+
+#### E2E tests
+
+The helm package provides some basic E2E tests.
+
 ### TensorBoard Integration
 
 What's the best way to integrate TensorBoard?

diff --git a/tf-job-chart/Chart.yaml b/tf-job-chart/Chart.yaml
@@ -0,0 +1,8 @@
+name: tensorflow
+home: https://github.com/jlewi/mlkube.io
+version: 0.1.0
+appVersion: 0.1.0
+description: K8s Third Party Resource and Operator For TensorFlow jobs
+icon: https://raw.githubusercontent.com/hashicorp/consul/bce3809dfca37b883828c3715b84143dd71c0f85/website/source/assets/images/favicons/android-chrome-512x512.png
+sources:
+  - https://github.com/jlewi/mlkube.io
diff --git a/tf-job-chart/templates/tests/basic-test-config.yaml b/tf-job-chart/templates/tests/basic-test-config.yaml
@@ -0,0 +1,30 @@
+# This ConfigMap is used by the basic-test helm chart to define the python script to use to run the tests.
+#
+# TODO(jlewi): Is it a common convention to use a ConfigMap to define the tests? I think one advantage of this
+# approach is that you don't have to push the test code anywhere. If we pulled down the python file from somewhere
+# else (e.g. github or as a Docker image) we'd have to push the code somewhere first.
+# However, the test however, is already pulling tf_job.yaml from github so already there is some mismatch between
+# the code and the test. The helm package also doesn't deploy the TfJob operator. So arguably we already have to
+# build and deploy various artifacts in order to run the test and we can probably reuse those mechanisms to deploy
+# the actual python test files.
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: tfjob-tests
+data:
+  run.py: |-
+    #! /usr/bin/python
+    from subprocess import call
+    def test_trivial():
+        assert "a" == "a"
+
+    def test_create():
+        # TODO(jlewi): This is just an initial hack. The job is deleted in case there is a previous run lying around.
+        # delete will return an error if the resource doesn't exist.
+        # A better solution is probably to give job a unique id so that different runs don't interfere.
+        return_code = call("kubectl delete -f https://raw.githubusercontent.com/jlewi/mlkube.io/master/examples/tf_job.yaml", shell=True)
+
+        return_code = call("kubectl create -f https://raw.githubusercontent.com/jlewi/mlkube.io/master/examples/tf_job.yaml", shell=True)
+        assert(return_code == 0)
+
+    # more tests here
diff --git a/tf-job-chart/templates/tests/basic-test.yaml b/tf-job-chart/templates/tests/basic-test.yaml
@@ -0,0 +1,37 @@
+apiVersion: v1
+kind: Pod
+metadata:
+  # We give the pod a random name so that we can run helm test multiple times and not
+  # have issues because the pod already exists.
+  name: "{{.Release.Name}}-tfjob-test-{{randAlphaNum 6 | lower }}"
+  annotations:
+    # See https://github.com/kubernetes/helm/blob/master/docs/chart_tests.md
+    "helm.sh/hook": test-success
+spec:
+  containers:
+    - name: basic-test
+      # TODO(jlewi): Should we use an IMAGE that contains the relevant python test code? The example (i.e. the
+      # TensorFlow code used by examples/tf_job.yaml) is already pushed to a registry and therefore not the code
+      # pulled from the source tree.
+      image: python:latest
+      command: ["/bin/sh","-c"]
+      # TODO(jlewi): We download kubectl because the test uses kubectl to submit TfJobs to be used in the tests.
+      # We should probably use the Python API in the test (or maybe switch to go?) and then we don't need to
+      # download kubectl.
+      args: ["wget -NP /usr/bin https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
+            && chmod 755 /usr/bin/kubectl
+            && pip install nose-tap
+            && nosetests --with-tap /tests/run.py"]
+      volumeMounts:
+      - mountPath: /tests
+        name: tests
+        readOnly: true
+      - mountPath: /tools
+        name: tools
+  volumes:
+  - name: tests
+    configMap:
+      name: tfjob-tests
+  - name: tools
+    emptyDir: {}
+  restartPolicy: Never
diff --git a/..._operator/tf_job_operator_deployment.yaml → ...templates/tf_job_operator_deployment.yaml b/..._operator/tf_job_operator_deployment.yaml → ...templates/tf_job_operator_deployment.yaml