Skip to content

Commit

Permalink
Deploy the TfJob operator using a helm chart.
Browse files Browse the repository at this point in the history
  * The chart includes a basic test to make sure the operator is working.
  • Loading branch information
Jeremy Lewi committed Jul 12, 2017
1 parent d235a03 commit 3fea440
Show file tree
Hide file tree
Showing 5 changed files with 92 additions and 1 deletion.
18 changes: 17 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ Leader election allows a K8s deployment resource to be used to upgrade the opera
1. Deploy the operator
```
kubectl create -f ./images/tf_operator/tf_job_operator_deployment.yaml
helm install tf-job-chart/ -n tf-job --wait --replace
```
1. Make sure the operator is running
Expand All @@ -80,6 +80,14 @@ Leader election allows a K8s deployment resource to be used to upgrade the opera
```
1. Run the helm tests
```
helm test tf-job
RUNNING: tf-job-tfjob-test-pqxkwk
PASSED: tf-job-tfjob-test-pqxkwk
```
## Run the example
A simplistic TF program is in the directory tf_sample.
Expand Down Expand Up @@ -148,6 +156,14 @@ There is a lot of code from earlier versions (including the ETCD operator) that
There is minimal testing.
#### Unittests
There are some unittests.
#### E2E tests
The helm package provides some basic E2E tests.
### TensorBoard Integration
What's the best way to integrate TensorBoard?
Expand Down
8 changes: 8 additions & 0 deletions tf-job-chart/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
name: tensorflow
home: https://github.com/jlewi/mlkube.io
version: 0.1.0
appVersion: 0.1.0
description: K8s Third Party Resource and Operator For TensorFlow jobs
icon: https://raw.githubusercontent.com/hashicorp/consul/bce3809dfca37b883828c3715b84143dd71c0f85/website/source/assets/images/favicons/android-chrome-512x512.png
sources:
- https://github.com/jlewi/mlkube.io
30 changes: 30 additions & 0 deletions tf-job-chart/templates/tests/basic-test-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# This ConfigMap is used by the basic-test helm chart to define the python script to use to run the tests.
#
# TODO(jlewi): Is it a common convention to use a ConfigMap to define the tests? I think one advantage of this
# approach is that you don't have to push the test code anywhere. If we pulled down the python file from somewhere
# else (e.g. github or as a Docker image) we'd have to push the code somewhere first.
# However, the test however, is already pulling tf_job.yaml from github so already there is some mismatch between
# the code and the test. The helm package also doesn't deploy the TfJob operator. So arguably we already have to
# build and deploy various artifacts in order to run the test and we can probably reuse those mechanisms to deploy
# the actual python test files.
apiVersion: v1
kind: ConfigMap
metadata:
name: tfjob-tests
data:
run.py: |-
#! /usr/bin/python
from subprocess import call
def test_trivial():
assert "a" == "a"
def test_create():
# TODO(jlewi): This is just an initial hack. The job is deleted in case there is a previous run lying around.
# delete will return an error if the resource doesn't exist.
# A better solution is probably to give job a unique id so that different runs don't interfere.
return_code = call("kubectl delete -f https://raw.githubusercontent.com/jlewi/mlkube.io/master/examples/tf_job.yaml", shell=True)
return_code = call("kubectl create -f https://raw.githubusercontent.com/jlewi/mlkube.io/master/examples/tf_job.yaml", shell=True)
assert(return_code == 0)
# more tests here
37 changes: 37 additions & 0 deletions tf-job-chart/templates/tests/basic-test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
apiVersion: v1
kind: Pod
metadata:
# We give the pod a random name so that we can run helm test multiple times and not
# have issues because the pod already exists.
name: "{{.Release.Name}}-tfjob-test-{{randAlphaNum 6 | lower }}"
annotations:
# See https://github.com/kubernetes/helm/blob/master/docs/chart_tests.md
"helm.sh/hook": test-success
spec:
containers:
- name: basic-test
# TODO(jlewi): Should we use an IMAGE that contains the relevant python test code? The example (i.e. the
# TensorFlow code used by examples/tf_job.yaml) is already pushed to a registry and therefore not the code
# pulled from the source tree.
image: python:latest
command: ["/bin/sh","-c"]
# TODO(jlewi): We download kubectl because the test uses kubectl to submit TfJobs to be used in the tests.
# We should probably use the Python API in the test (or maybe switch to go?) and then we don't need to
# download kubectl.
args: ["wget -NP /usr/bin https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
&& chmod 755 /usr/bin/kubectl
&& pip install nose-tap
&& nosetests --with-tap /tests/run.py"]
volumeMounts:
- mountPath: /tests
name: tests
readOnly: true
- mountPath: /tools
name: tools
volumes:
- name: tests
configMap:
name: tfjob-tests
- name: tools
emptyDir: {}
restartPolicy: Never

0 comments on commit 3fea440

Please sign in to comment.