Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Support custom CRD in Trial Job #1273

Merged
merged 9 commits into from
Aug 14, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
189 changes: 189 additions & 0 deletions docs/proposals/trial-custom-crd.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
# Support custom CRD in Trial Job proposal

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->

## Table of Contents

- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Implementation](#implementation)
- [API](#api)
- [Trial controller watchers](#trial-controller-watchers)
- [Primary pod label location](#primary-pod-label-location)
- [Training container name](#training-container-name)
- [Start metrics collector parser](#start-metrics-collector-parser)
- [Succeeded status of running CRD](#succeeded-status-of-running-crd)
- [Istio sidecar container](#istio-sidecar-container)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

Created by [doctoc](https://github.com/thlorenz/doctoc).

## Motivation

Running trial is one of the essential steps of executing Katib experiments.
We have implemented new trial template design in Katib v1beta1 ([katib/pull#1202](https://github.com/kubeflow/katib/pull/1202)
and [katib/pull#1215](https://github.com/kubeflow/katib/pull/1215)) to make
experiments valid YAML and make Katib more Kubernetes native.
After migrating to the new API, users still can run only [BatchJob](https://kubernetes.io/docs/concepts/workloads/controllers/job/),
[TFJob](https://github.com/kubeflow/tf-operator) or [PyTorchJob](https://github.com/kubeflow/pytorch-operator) in trial job.
If we want to support new CRD, we need to manually change Katib controller source code.

This approach makes impossible to use other CRDs in trial template, even if they satisfy trial job design.
The number of various Kubernetes CRDs grows significantly and many users would like to use them in Katib
(e.g, [katib/issue#1081, support Argo Workflow](https://github.com/kubeflow/katib/issues/1081)).
Another reason to design unified approach is that CRD controller can have `Go` package versions
that Katib controller doesn't support (e.g, [katib/issue#1081](https://github.com/kubeflow/katib/issues/1081#issuecomment-635338276)).

That is why we propose a new controller design to support custom CRD in trial job and make Katib usable for various Kubernetes resources.
To make this possible, we are changing API, trial controller, job provider, mutation webhook, metrics collector.

## Goals

1. Allow dynamic watchers for the custom CRD.
2. Inject Katib sidecar container on training pod.
3. Indicate training container for metrics collector execution.
4. Run metrics collector parser after all pod processes completion.
5. Get succeeded condition of running CRD.
6. Verify that `sidecar.istio.io/inject: false` label is added.

## Non-Goals

1. Inject Katib sidecar container on more than one pod simultaneously.
2. Specify list of succeeded conditions for the custom CRD.
3. Dynamically add new trial watcher for the custom CRD without Katib restart.

## Implementation

During implementation this feature, we should not brake current Katib controller logic.
Also, we need to make sure that CI is stable and it does not block other Katib work tasks.
After completion, we can clean-up redundant code.

### API

To achieve above goals, we introduce these `TrialTemplate` API changes.

```go

// TrialTemplate describes structure of Trial template
type TrialTemplate struct {
// Retain indicates that Trial resources must be not cleanup
Retain bool `json:"retain,omitempty"`

// Source for Trial template (unstructured structure or config map)
TrialSource `json:",inline"`

// List of parameters that are used in Trial template
TrialParameters []TrialParameterSpec `json:"trialParameters,omitempty"`

// Label that determines if pod needs to be injected by Katib sidecar container
PrimaryPodLabel map[string]string `json:"primaryPodLabel,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about how will these extra fields be filled ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this design it will look like this:

PrimaryPodLabel:
  "label-key": "label-value"

Not sure if it is the best design.
We can follow the same API as metricStrategies:

. . .
PrimaryPodLabel *PrimaryPodLabel
. . .

type PrimaryPodLabel struct {
	Name  string `json:"name,omitempty"`
	Value string `json:"value,omitempty"`
}

Does it make sense, if we have can set 1 label currently?

WDYT @gaocegege @johnugeorge ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should provide a map or a slice here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// Name of training container where training is running
PrimaryContainerName string `json:"primaryContainerName,omitempty"`

// Name of condition when Trial custom resource is succeeded
SucceededCondition string `json:"succeededCondition,omitempty"`

}
```

### Trial controller watchers

In the current design trial controller watches
[three supported resource](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L94-L125).
To generate these parameters dynamically when Katib starts, we add additional flag (`-trial-resource`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use configMap to define these trial-resources ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, in katib-config ConfigMap we can set only Suggestion and Metrics collector settings.
Also, we added Watch only when controller starts: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L79.
I think later when we implement dynamically wathers update, we can think about better design how we can send these resources to the Controller.

to Katib controller, which represents resources that can be used in trial template.
This flag contains custom CRD's `Group`, `Version`, `Kind` in `kind.version.group` format which needs to create controller watchers.
Trial controller iterates over these parameters and creates watchers.

For example, if trial can run TFJob, Argo Workflow and Kubernetes Batch Jobs, Katib controller flags must be:

```yaml
. . .
args:
- "-webhook-port=8443"
- "-trial-resource=TFJob.v1.kubeflow.org".
- "-trial-resource=Workflow.v1alpha1.argoproj.io"
- "-trial-resource=Job.v1.batch"
. . .
```

### Primary pod label location

Right now, we [inject](https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/utils.go#L58-L72)
metrics collector for TFJob and PyTorchJob only for _master_ pods with labels previously saved in controller constants.

We added a new `PrimaryPodLabel` parameter in `TrialTemplate` API to find primary pod that needs to be injected by Katib sidecar container.
User can define the key and value of the pod label which Katib must inject with sidecar container.

For example, for TFJob:

```yaml
. . .
PrimaryPodLabel:
"job-role": "master"
. . .
```

### Training container name

In the current design we compare container name with
[default value](https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L63-L78) for TFJob and PyTorchJob
to find pod container where actual training is happening and metrics collector must parse metrics.

We introduce a new `PrimaryContainerName` field, where user can set container name with running training program to find proper training container.

For example, if training is running on container with `pytorch` name:

```yaml
. . .
PrimaryContainerName: "pytorch"
. . .
```

### Start metrics collector parser

As discussed in [katib/issue#1214](https://github.com/kubeflow/katib/issues/1214#issuecomment-642168716),
metrics collector starts parsing metrics only after all injected pod processes were finished.
That can avoid problems with other sidecar containers that various CRD can have.

We need to verify that [distributive training](https://docs.fast.ai/distributed.html#launch-your-training)
with more than one active process also works with this approach.

### Succeeded condition of running CRD

We have already [designed Kubeflow provider](https://github.com/kubeflow/katib/blob/master/pkg/job/v1alpha3/kubeflow.go#L27-L60)
to check succeeded conditions for the TFJob and PyTorchJob as `unstructured` objects by
[comparing](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller_util.go#L161)
`.status.conditions[x].type` value with `Succeeded` value.

Different CRD can have unique status design (e.g, Kubernetes batch job succeeded condition is
[`Complete`](https://github.com/kubernetes/api/blob/master/batch/v1/types.go#L167-L173)).
We add a new parameter `SucceededCondition` to get CRD succeeded condition value and trigger trial controller.
Trial controller checks all running job conditions and verifies that running job has appropriate `type`
in `.status.conditions` with `status=True`.
We also should transform `reason` and `message` from custom CRD to the trial conditions, if it is available.

For example, for TFJob:

```yaml
. . .
SucceededCondition: Succeeded
. . .
```

### Istio sidecar container

Previously, we had problems with Istio sidecar containers,
check [kubeflow/issue#1081](https://github.com/kubeflow/kubeflow/issues/4742).
In some cases, it is unable to properly download datasets in training pod.
It was fixed by adding annotation `sidecar.istio.io/inject: false` to appropriate Trial job in Katib controller.

Various CRD can have unified design and it is hard to understand where annotation must be specified
to disable Istio injection for the running pods.
We need to update all Katib examples manually and add this annotation to every trial template.

This exception has to be documented and new Katib examples have to include this annotation in templates.