-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Support custom CRD in Trial Job #1273
Changes from all commits
ce632c5
18b51fb
c8ea425
fb0e862
2bc51db
224829b
b2fc7d9
67156c6
96d4e54
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,189 @@ | ||
# Support custom CRD in Trial Job proposal | ||
|
||
<!-- START doctoc generated TOC please keep comment here to allow auto update --> | ||
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> | ||
|
||
## Table of Contents | ||
|
||
- [Motivation](#motivation) | ||
- [Goals](#goals) | ||
- [Non-Goals](#non-goals) | ||
- [Implementation](#implementation) | ||
- [API](#api) | ||
- [Trial controller watchers](#trial-controller-watchers) | ||
- [Primary pod label location](#primary-pod-label-location) | ||
- [Training container name](#training-container-name) | ||
- [Start metrics collector parser](#start-metrics-collector-parser) | ||
- [Succeeded status of running CRD](#succeeded-status-of-running-crd) | ||
- [Istio sidecar container](#istio-sidecar-container) | ||
|
||
<!-- END doctoc generated TOC please keep comment here to allow auto update --> | ||
|
||
Created by [doctoc](https://github.com/thlorenz/doctoc). | ||
|
||
## Motivation | ||
|
||
Running trial is one of the essential steps of executing Katib experiments. | ||
We have implemented new trial template design in Katib v1beta1 ([katib/pull#1202](https://github.com/kubeflow/katib/pull/1202) | ||
and [katib/pull#1215](https://github.com/kubeflow/katib/pull/1215)) to make | ||
experiments valid YAML and make Katib more Kubernetes native. | ||
After migrating to the new API, users still can run only [BatchJob](https://kubernetes.io/docs/concepts/workloads/controllers/job/), | ||
[TFJob](https://github.com/kubeflow/tf-operator) or [PyTorchJob](https://github.com/kubeflow/pytorch-operator) in trial job. | ||
If we want to support new CRD, we need to manually change Katib controller source code. | ||
|
||
This approach makes impossible to use other CRDs in trial template, even if they satisfy trial job design. | ||
The number of various Kubernetes CRDs grows significantly and many users would like to use them in Katib | ||
(e.g, [katib/issue#1081, support Argo Workflow](https://github.com/kubeflow/katib/issues/1081)). | ||
Another reason to design unified approach is that CRD controller can have `Go` package versions | ||
that Katib controller doesn't support (e.g, [katib/issue#1081](https://github.com/kubeflow/katib/issues/1081#issuecomment-635338276)). | ||
|
||
That is why we propose a new controller design to support custom CRD in trial job and make Katib usable for various Kubernetes resources. | ||
To make this possible, we are changing API, trial controller, job provider, mutation webhook, metrics collector. | ||
|
||
## Goals | ||
|
||
1. Allow dynamic watchers for the custom CRD. | ||
2. Inject Katib sidecar container on training pod. | ||
3. Indicate training container for metrics collector execution. | ||
4. Run metrics collector parser after all pod processes completion. | ||
5. Get succeeded condition of running CRD. | ||
6. Verify that `sidecar.istio.io/inject: false` label is added. | ||
|
||
## Non-Goals | ||
|
||
1. Inject Katib sidecar container on more than one pod simultaneously. | ||
2. Specify list of succeeded conditions for the custom CRD. | ||
3. Dynamically add new trial watcher for the custom CRD without Katib restart. | ||
|
||
## Implementation | ||
|
||
During implementation this feature, we should not brake current Katib controller logic. | ||
Also, we need to make sure that CI is stable and it does not block other Katib work tasks. | ||
After completion, we can clean-up redundant code. | ||
|
||
### API | ||
|
||
To achieve above goals, we introduce these `TrialTemplate` API changes. | ||
|
||
```go | ||
|
||
// TrialTemplate describes structure of Trial template | ||
type TrialTemplate struct { | ||
// Retain indicates that Trial resources must be not cleanup | ||
Retain bool `json:"retain,omitempty"` | ||
|
||
// Source for Trial template (unstructured structure or config map) | ||
TrialSource `json:",inline"` | ||
|
||
// List of parameters that are used in Trial template | ||
TrialParameters []TrialParameterSpec `json:"trialParameters,omitempty"` | ||
|
||
// Label that determines if pod needs to be injected by Katib sidecar container | ||
PrimaryPodLabel map[string]string `json:"primaryPodLabel,omitempty"` | ||
|
||
// Name of training container where training is running | ||
PrimaryContainerName string `json:"primaryContainerName,omitempty"` | ||
|
||
// Name of condition when Trial custom resource is succeeded | ||
SucceededCondition string `json:"succeededCondition,omitempty"` | ||
|
||
} | ||
``` | ||
|
||
### Trial controller watchers | ||
|
||
In the current design trial controller watches | ||
[three supported resource](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L94-L125). | ||
To generate these parameters dynamically when Katib starts, we add additional flag (`-trial-resource`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can use configMap to define these trial-resources ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently, in |
||
to Katib controller, which represents resources that can be used in trial template. | ||
This flag contains custom CRD's `Group`, `Version`, `Kind` in `kind.version.group` format which needs to create controller watchers. | ||
Trial controller iterates over these parameters and creates watchers. | ||
|
||
For example, if trial can run TFJob, Argo Workflow and Kubernetes Batch Jobs, Katib controller flags must be: | ||
|
||
```yaml | ||
. . . | ||
args: | ||
- "-webhook-port=8443" | ||
- "-trial-resource=TFJob.v1.kubeflow.org". | ||
- "-trial-resource=Workflow.v1alpha1.argoproj.io" | ||
- "-trial-resource=Job.v1.batch" | ||
. . . | ||
``` | ||
|
||
### Primary pod label location | ||
|
||
Right now, we [inject](https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/utils.go#L58-L72) | ||
metrics collector for TFJob and PyTorchJob only for _master_ pods with labels previously saved in controller constants. | ||
|
||
We added a new `PrimaryPodLabel` parameter in `TrialTemplate` API to find primary pod that needs to be injected by Katib sidecar container. | ||
User can define the key and value of the pod label which Katib must inject with sidecar container. | ||
|
||
For example, for TFJob: | ||
|
||
```yaml | ||
. . . | ||
PrimaryPodLabel: | ||
"job-role": "master" | ||
. . . | ||
``` | ||
|
||
### Training container name | ||
|
||
In the current design we compare container name with | ||
[default value](https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L63-L78) for TFJob and PyTorchJob | ||
to find pod container where actual training is happening and metrics collector must parse metrics. | ||
|
||
We introduce a new `PrimaryContainerName` field, where user can set container name with running training program to find proper training container. | ||
|
||
For example, if training is running on container with `pytorch` name: | ||
|
||
```yaml | ||
. . . | ||
PrimaryContainerName: "pytorch" | ||
. . . | ||
``` | ||
|
||
### Start metrics collector parser | ||
|
||
As discussed in [katib/issue#1214](https://github.com/kubeflow/katib/issues/1214#issuecomment-642168716), | ||
metrics collector starts parsing metrics only after all injected pod processes were finished. | ||
That can avoid problems with other sidecar containers that various CRD can have. | ||
|
||
We need to verify that [distributive training](https://docs.fast.ai/distributed.html#launch-your-training) | ||
with more than one active process also works with this approach. | ||
|
||
### Succeeded condition of running CRD | ||
|
||
We have already [designed Kubeflow provider](https://github.com/kubeflow/katib/blob/master/pkg/job/v1alpha3/kubeflow.go#L27-L60) | ||
to check succeeded conditions for the TFJob and PyTorchJob as `unstructured` objects by | ||
[comparing](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller_util.go#L161) | ||
`.status.conditions[x].type` value with `Succeeded` value. | ||
|
||
Different CRD can have unique status design (e.g, Kubernetes batch job succeeded condition is | ||
[`Complete`](https://github.com/kubernetes/api/blob/master/batch/v1/types.go#L167-L173)). | ||
We add a new parameter `SucceededCondition` to get CRD succeeded condition value and trigger trial controller. | ||
Trial controller checks all running job conditions and verifies that running job has appropriate `type` | ||
in `.status.conditions` with `status=True`. | ||
We also should transform `reason` and `message` from custom CRD to the trial conditions, if it is available. | ||
|
||
For example, for TFJob: | ||
|
||
```yaml | ||
. . . | ||
SucceededCondition: Succeeded | ||
. . . | ||
``` | ||
|
||
### Istio sidecar container | ||
|
||
Previously, we had problems with Istio sidecar containers, | ||
check [kubeflow/issue#1081](https://github.com/kubeflow/kubeflow/issues/4742). | ||
In some cases, it is unable to properly download datasets in training pod. | ||
It was fixed by adding annotation `sidecar.istio.io/inject: false` to appropriate Trial job in Katib controller. | ||
|
||
Various CRD can have unified design and it is hard to understand where annotation must be specified | ||
to disable Istio injection for the running pods. | ||
We need to update all Katib examples manually and add this annotation to every trial template. | ||
|
||
This exception has to be documented and new Katib examples have to include this annotation in templates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about how will these extra fields be filled ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this design it will look like this:
Not sure if it is the best design.
We can follow the same API as
metricStrategies
:Does it make sense, if we have can set 1 label currently?
WDYT @gaocegege @johnugeorge ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should provide a map or a slice here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Map looks good to me, since it follow k8s way: https://github.com/kubernetes/apimachinery/blob/master/pkg/apis/meta/v1/types.go#L226-L231.