Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a helm chart for deploying the TfJob operator #1

Merged
merged 1 commit into from
Jul 12, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ Leader election allows a K8s deployment resource to be used to upgrade the opera
1. Deploy the operator

```
kubectl create -f ./images/tf_operator/tf_job_operator_deployment.yaml
helm install tf-job-chart/ -n tf-job --wait --replace
```

1. Make sure the operator is running
Expand All @@ -80,6 +80,14 @@ Leader election allows a K8s deployment resource to be used to upgrade the opera

```

1. Run the helm tests

```
helm test tf-job
RUNNING: tf-job-tfjob-test-pqxkwk
PASSED: tf-job-tfjob-test-pqxkwk
```

## Run the example

A simplistic TF program is in the directory tf_sample.
Expand Down Expand Up @@ -148,6 +156,14 @@ There is a lot of code from earlier versions (including the ETCD operator) that

There is minimal testing.

#### Unittests

There are some unittests.

#### E2E tests

The helm package provides some basic E2E tests.

### TensorBoard Integration

What's the best way to integrate TensorBoard?
Expand Down
8 changes: 8 additions & 0 deletions tf-job-chart/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
name: tensorflow
home: https://github.com/jlewi/mlkube.io
version: 0.1.0
appVersion: 0.1.0
description: K8s Third Party Resource and Operator For TensorFlow jobs
icon: https://raw.githubusercontent.com/hashicorp/consul/bce3809dfca37b883828c3715b84143dd71c0f85/website/source/assets/images/favicons/android-chrome-512x512.png
sources:
- https://github.com/jlewi/mlkube.io
30 changes: 30 additions & 0 deletions tf-job-chart/templates/tests/basic-test-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# This ConfigMap is used by the basic-test helm chart to define the python script to use to run the tests.
#
# TODO(jlewi): Is it a common convention to use a ConfigMap to define the tests? I think one advantage of this
# approach is that you don't have to push the test code anywhere. If we pulled down the python file from somewhere
# else (e.g. github or as a Docker image) we'd have to push the code somewhere first.
# However, the test however, is already pulling tf_job.yaml from github so already there is some mismatch between
# the code and the test. The helm package also doesn't deploy the TfJob operator. So arguably we already have to
# build and deploy various artifacts in order to run the test and we can probably reuse those mechanisms to deploy
# the actual python test files.
apiVersion: v1
kind: ConfigMap
metadata:
name: tfjob-tests
data:
run.py: |-
#! /usr/bin/python
from subprocess import call
def test_trivial():
assert "a" == "a"

def test_create():
# TODO(jlewi): This is just an initial hack. The job is deleted in case there is a previous run lying around.
# delete will return an error if the resource doesn't exist.
# A better solution is probably to give job a unique id so that different runs don't interfere.
return_code = call("kubectl delete -f https://raw.githubusercontent.com/jlewi/mlkube.io/master/examples/tf_job.yaml", shell=True)

return_code = call("kubectl create -f https://raw.githubusercontent.com/jlewi/mlkube.io/master/examples/tf_job.yaml", shell=True)
assert(return_code == 0)

# more tests here
37 changes: 37 additions & 0 deletions tf-job-chart/templates/tests/basic-test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
apiVersion: v1
kind: Pod
metadata:
# We give the pod a random name so that we can run helm test multiple times and not
# have issues because the pod already exists.
name: "{{.Release.Name}}-tfjob-test-{{randAlphaNum 6 | lower }}"
annotations:
# See https://github.com/kubernetes/helm/blob/master/docs/chart_tests.md
"helm.sh/hook": test-success
spec:
containers:
- name: basic-test
# TODO(jlewi): Should we use an IMAGE that contains the relevant python test code? The example (i.e. the
# TensorFlow code used by examples/tf_job.yaml) is already pushed to a registry and therefore not the code
# pulled from the source tree.
image: python:latest
command: ["/bin/sh","-c"]
# TODO(jlewi): We download kubectl because the test uses kubectl to submit TfJobs to be used in the tests.
# We should probably use the Python API in the test (or maybe switch to go?) and then we don't need to
# download kubectl.
args: ["wget -NP /usr/bin https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
&& chmod 755 /usr/bin/kubectl
&& pip install nose-tap
&& nosetests --with-tap /tests/run.py"]
volumeMounts:
- mountPath: /tests
name: tests
readOnly: true
- mountPath: /tools
name: tools
volumes:
- name: tests
configMap:
name: tfjob-tests
- name: tools
emptyDir: {}
restartPolicy: Never