Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Argo Workflows #1605

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,9 @@ Katib has these CRD examples in upstream:

- [Kubeflow `XGBoostJob`](https://github.com/kubeflow/xgboost-operator)

- [Tekton `Pipeline`](https://github.com/tektoncd/pipeline)
- [Tekton `Pipelines`](./examples/v1beta1/tekton)

- [Argo `Workflows`](./examples/v1beta1/argo)

Thus, Katib supports multiple frameworks with the help of different job kinds.

Expand Down
2 changes: 1 addition & 1 deletion docs/new-algorithm-service.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ You can setup the GRPC server using `grpc_testing`, then define your own test ca
#### E2E Test (Optional)

E2e tests help Katib verify that the algorithm works well.
Follow bellow steps to add your algorithm (Suggestion) to the Katib CI
Follow below steps to add your algorithm (Suggestion) to the Katib CI
(replace `<name>` with your Suggestion name):

1. Submit a PR to add a new ECR private registry to the AWS
Expand Down
2 changes: 1 addition & 1 deletion docs/presentations.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Katib Presentations and Demos

Bellow are the list of Katib presentations and demos. If you want to add your
Below are the list of Katib presentations and demos. If you want to add your
presentation or demo in this list please send a pull request. Please keep the
list in reverse chronological order.

Expand Down
113 changes: 113 additions & 0 deletions examples/v1beta1/argo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Katib Examples with Argo Workflows Integration

Here you can find examples of using Katib with [Argo Workflows](https://github.com/argoproj/argo-workflows).

**Note:** You have to install `Argo >= v3.1.3` to use it in Katib Experiments.

## Installation

### Argo Workflow

To deploy Argo Workflows `v3.1.3`, run the following commands:

```bash
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.1.3/install.yaml
```

Check that Argo Workflow components are running:

```bash
$ kubectl get pods -n argo

NAME READY STATUS RESTARTS AGE
argo-server-5bbd69cc6b-6nvb6 1/1 Running 0 20s
workflow-controller-5f48fb7c8-vw9bp 1/1 Running 0 20s
```

After that, run below command to enable
[Katib Metrics Collector sidecar injection](https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector):

```bash
kubectl patch namespace argo -p '{"metadata":{"labels":{"katib-metricscollector-injection":"enabled"}}}'
```

**Note:** Argo Workflows are using `docker` as a
[default container runtime executor](https://argoproj.github.io/argo-workflows/workflow-executors/#workflow-executors).
Since Katib is using Metrics Collector sidecar container and Argo Workflows controller
should not kill sidecar containers, you have to modify this
executor to [`emissary`](https://argoproj.github.io/argo-workflows/workflow-executors/#emissary-emissary).

Run the following command to change the `containerRuntimeExecutor` to `emissary` in the
Argo `workflow-controller-configmap`

```bash
kubectl patch ConfigMap -n argo workflow-controller-configmap --type='merge' -p='{"data":{"containerRuntimeExecutor":"emissary"}}'
```

Verify that `containerRuntimeExecutor` has been modified:

```bash
$ kubectl get ConfigMap -n argo workflow-controller-configmap -o yaml | grep containerRuntimeExecutor

containerRuntimeExecutor: emissary
```

### Katib Controller

To run Argo Workflow within Katib Trials you have to update Katib
[ClusterRole's rules](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/components/controller/rbac.yaml#L5)
with the appropriate permission:

```yaml
- apiGroups:
- argoproj.io
resources:
- workflows
verbs:
- "*"
```

Run the following command to update Katib ClusterRole:

```bash
kubectl patch ClusterRole katib-controller -n kubeflow --type=json \
-p='[{"op": "add", "path": "/rules/-", "value": {"apiGroups":["argoproj.io"],"resources":["workflows"],"verbs":["*"]}}]'
```

In addition to that, you have to modify Katib
[Controller args](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/components/controller/controller.yaml#L27)
with the new flag `--trial-resources`.

Run the following command to update Katib Controller args:

```bash
kubectl patch Deployment katib-controller -n kubeflow --type=json \
-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--trial-resources=Workflow.v1alpha1.argoproj.io"}]'
```

Check that Katib Controller's pod was restarted:

```bash
$ kubectl get pods -n kubeflow

NAME READY STATUS RESTARTS AGE
katib-cert-generator-hnv6q 0/1 Completed 0 6m12s
katib-controller-784994d449-9bgj9 1/1 Running 0 28s
katib-db-manager-78697c7bd4-ck7l8 1/1 Running 0 6m13s
katib-mysql-854cdb87c4-krcm9 1/1 Running 0 6m13s
katib-ui-57b9d7f6dd-cv6gn 1/1 Running 0 6m13s
```

Check logs from Katib Controller to verify Argo Workflow integration:

```bash
$ kubectl logs $(kubectl get pods -n kubeflow -o name | grep katib-controller) -n kubeflow | grep '"CRD Kind":"Workflow"'

{"level":"info","ts":1628032648.6285546,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"argoproj.io","CRD Version":"v1alpha1","CRD Kind":"Workflow"}
```

If you ran the above steps successfully, you should be able to run Argo Workflow examples.

Learn more about using custom Kubernetes resource as a Trial template in the
[official Kubeflow guides](https://www.kubeflow.org/docs/components/katib/trial-template/#use-custom-kubernetes-resource-as-a-trial-template)
83 changes: 83 additions & 0 deletions examples/v1beta1/argo/argo-workflow.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# This example shows how you can use Argo Workflows in Katib, transfer parameters from one Step to another and run HP job.
# It uses a simple random algorithm and tunes only learning rate.
# Workflow contains 2 Steps, first is data-preprocessing second is model-training.
# First Step shows how you can prepare your training data (here: simply divide number of training examples) before running HP job.
# Number of training examples is transferred to the second Step.
# Second Step is the actual training which metrics collector sidecar is injected.
# Note that for this example Argo Container Runtime Executor must be "emissary".
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
namespace: argo
name: katib-argo-workflow
spec:
objective:
type: maximize
goal: 0.99
objectiveMetricName: Validation-accuracy
additionalMetricNames:
- Train-accuracy
algorithm:
algorithmName: random
parallelTrialCount: 2
maxTrialCount: 5
maxFailedTrialCount: 1
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.03"
trialTemplate:
retain: true
primaryPodLabels:
katib.kubeflow.org/model-training: "true"
primaryContainerName: main
successCondition: status.[@this].#(phase=="Succeeded")#
failureCondition: status.[@this].#(phase=="Failed")#
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: lr
trialSpec:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
serviceAccountName: argo
entrypoint: hp-workflow
templates:
- name: hp-workflow
steps:
- - name: data-preprocessing
template: gen-num-examples
- - name: model-training
template: model-training
arguments:
parameters:
- name: num-examples
value: "{{steps.data-preprocessing.outputs.result}}"

- name: gen-num-examples
script:
image: python:alpine3.6
command:
- python
source: |
import random
print(60000//random.randint(10, 100))

- name: model-training
metadata:
labels:
katib.kubeflow.org/model-training: "true"
inputs:
parameters:
- name: num-examples
container:
name: model-training
image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-45c5727
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
- "--lr=${trialParameters.learningRate}"
- "--num-examples={{inputs.parameters.num-examples}}"
2 changes: 1 addition & 1 deletion examples/v1beta1/nas/darts-cnn-cifar10/architect.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def virtual_step(self, train_x, train_y, xi, w_optim):
gradients = torch.autograd.grad(loss, self.model.getWeights())

# Do virtual step (Update gradient)
# Bellow opeartions do not need gradient tracking
# Below opeartions do not need gradient tracking
with torch.no_grad():
# dict key is not the value, but the pointer. So original network weight have to
# be iterated also.
Expand Down
102 changes: 90 additions & 12 deletions examples/v1beta1/tekton/README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,118 @@
# Katib examples with Tekton integration
# Katib Examples with Tekton Pipelines Integration

Here you can find examples of using Katib with [Tekton](https://github.com/tektoncd/pipeline).

Check [here](https://github.com/tektoncd/pipeline/blob/master/docs/install.md#installing-tekton-pipelines-on-kubernetes)
how to install Tekton on your cluster.
## Installation

**Note** that you must modify Tekton [`nop`](https://github.com/tektoncd/pipeline/tree/master/cmd/nop)
image to run Tekton pipelines. `Nop` image is used to stop sidecar containers after main container
is completed. Metrics collector should not be stopped after training container is finished.
To avoid this problem, set `nop` image to metrics collector sidecar image.
### Tekton Pipelines

To deploy Tekton Pipelines `v0.26.0`, run the following command:

```bash
kubectl apply -f https://storage.googleapis.com/tekton-releases/pipeline/previous/v0.26.0/release.yaml
```

Check that Tekton Pipelines components are running:

```bash
$ kubectl get pods -n tekton-pipelines

NAME READY STATUS RESTARTS AGE
tekton-pipelines-controller-799cdc78fc-sm4vl 1/1 Running 0 50s
tekton-pipelines-webhook-79d8f4f9bc-qmk97 1/1 Running 0 50s
```

**Note:** You must modify Tekton [`nop`](https://github.com/tektoncd/pipeline/tree/master/cmd/nop)
image to run Tekton Pipelines. `Nop` image is used to stop sidecar containers after main container
is completed. Since Katib is using Metrics Collector sidecar container
and Tekton Pipelines Controller should not kill sidecar containers, you have to
set this `nop` image to Metrics Collector image.

For example, if you are using
[StdOut](https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector) metrics collector,
[StdOut](https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector) Metrics Collector,
`nop` image must be equal to `docker.io/kubeflowkatib/file-metrics-collector`.

After deploying Tekton on your cluster, run bellow command to modify `nop` image:
Run the following command to modify the `nop` image:

```bash
kubectl patch deploy tekton-pipelines-controller -n tekton-pipelines --type='json' \
-p='[{"op": "replace", "path": "/spec/template/spec/containers/0/args/9", "value": "docker.io/kubeflowkatib/file-metrics-collector"}]'
```

Check that Tekton controller's pod was restarted:
Check that Tekton Pipelines Controller's pod was restarted:

```bash
$ kubectl get pods -n tekton-pipelines

NAME READY STATUS RESTARTS AGE
tekton-pipelines-controller-7fcb6c6cd4-p8zf2 1/1 Running 0 2m2s
tekton-pipelines-webhook-7f9888f9b-7d6mr 1/1 Running 0 12h
tekton-pipelines-webhook-7f9888f9b-7d6mr 1/1 Running 0 3m
```

Check that `nop` image was modified:
Verify that `nop` image was modified:

```bash
$ kubectl get $(kubectl get pods -o name -n tekton-pipelines | grep tekton-pipelines-controller) -n tekton-pipelines -o yaml | grep katib

- docker.io/kubeflowkatib/file-metrics-collector
```

### Katib Controller

To run Tekton Pipelines within Katib Trials you have to update Katib
[ClusterRole's rules](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/components/controller/rbac.yaml#L5)
with the appropriate permission:

```yaml
- apiGroups:
- tekton.dev
resources:
- pipelineruns
- taskruns
verbs:
- "*"
```

Run the following command to update Katib ClusterRole:

```bash
kubectl patch ClusterRole katib-controller -n kubeflow --type=json \
-p='[{"op": "add", "path": "/rules/-", "value": {"apiGroups":["tekton.dev"],"resources":["pipelineruns", "taskruns"],"verbs":["*"]}}]'
```

In addition to that, you have to modify Katib
[Controller args](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/components/controller/controller.yaml#L27)
with the new flag `--trial-resources`.

Run the following command to update Katib Controller args:

```bash
kubectl patch Deployment katib-controller -n kubeflow --type=json \
-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--trial-resources=PipelineRun.v1beta1.tekton.dev"}]'
```

Check that Katib Controller's pod was restarted:

```bash
$ kubectl get pods -n kubeflow

NAME READY STATUS RESTARTS AGE
katib-cert-generator-hnv6q 0/1 Completed 0 6m12s
katib-controller-784994d449-9bgj9 1/1 Running 0 28s
katib-db-manager-78697c7bd4-ck7l8 1/1 Running 0 6m13s
katib-mysql-854cdb87c4-krcm9 1/1 Running 0 6m13s
katib-ui-57b9d7f6dd-cv6gn 1/1 Running 0 6m13s
```

Check logs from Katib Controller to verify Tekton Pipelines integration:

```bash
$ kubectl logs $(kubectl get pods -n kubeflow -o name | grep katib-controller) -n kubeflow | grep '"CRD Kind":"PipelineRun"'

{"level":"info","ts":1628032648.6285546,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"tekton.dev","CRD Version":"v1beta1","CRD Kind":"PipelineRun"}
```

If you ran the above steps successfully, you should be able to run Tekton Pipelines examples.

Learn more about using custom Kubernetes resource as a Trial template in the
[official Kubeflow guides](https://www.kubeflow.org/docs/components/katib/trial-template/#use-custom-kubernetes-resource-as-a-trial-template)
1 change: 0 additions & 1 deletion manifests/v1beta1/components/controller/controller.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@ spec:
- "--trial-resources=MPIJob.v1.kubeflow.org"
# TODO (andreyvelich): Change to v1.kubeflow.org once all-in-one operator is finished.
- "--trial-resources=XGBoostJob.v1.xgboostjob.kubeflow.org"
- "--trial-resources=PipelineRun.v1beta1.tekton.dev"
ports:
- containerPort: 8443
name: webhook
Expand Down
7 changes: 0 additions & 7 deletions manifests/v1beta1/components/controller/rbac.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -62,13 +62,6 @@ rules:
- xgboostjobs
verbs:
- "*"
- apiGroups:
- tekton.dev
resources:
- pipelineruns
- taskruns
verbs:
- "*"
---
apiVersion: v1
kind: ServiceAccount
Expand Down