From ce632c590ecf76cc3b77952cef6f48954420508f Mon Sep 17 00:00:00 2001 From: avelichk Date: Sat, 18 Jul 2020 02:56:36 +0100 Subject: [PATCH 1/9] Add proposal for custom CRD in Trial Template --- docs/proposals/trial-custom-crd.md | 149 +++++++++++++++++++++++++++++ 1 file changed, 149 insertions(+) create mode 100644 docs/proposals/trial-custom-crd.md diff --git a/docs/proposals/trial-custom-crd.md b/docs/proposals/trial-custom-crd.md new file mode 100644 index 00000000000..78e8df3e271 --- /dev/null +++ b/docs/proposals/trial-custom-crd.md @@ -0,0 +1,149 @@ +# Support custom CRD as Trial Template proposal + + + + +**Table of Contents** _generated with [DocToc](https://github.com/thlorenz/doctoc)_ + +- [Motivation](#motivation) +- [Goals](#goals) +- [Non-Goals](#non-goals) +- [Implementation](#implementation) + - [API](#api) + - [Trial controller watchers](#trial-controller-watchers) + - [Primary pod label location](#primary-pod-label-location) + - [Training container name](#training-container-name) + - [Start metrics collector parser](#start-metrics-collector-parser) + - [Succeeded status of running CRD](#succeeded-status-of-running-crd) + - [Istio sidecar container](#istio-sidecar-container) + + + +## Motivation + +Running trial is one of the essential step of running Katib experiments. We implemented new trial template design in Katib v1beta1 ([katib/pull#1202](https://github.com/kubeflow/katib/pull/1202) and [katib/pull#1215](https://github.com/kubeflow/katib/pull/1215)) to have valid YAML in experiments and make Katib more Kubernetes native. +After migrating to the new API, users still can run only [BatchJob](https://kubernetes.io/docs/concepts/workloads/controllers/job/), [TFJob](https://github.com/kubeflow/tf-operator) or [PyTorchJob](https://github.com/kubeflow/pytorch-operator) as trial. If we want to support new CRD, we need to manually change Katib controller source code. This approach makes it impossible to use other CRDs as trial job, even if they can be used in trial design. Number of various Kubernetes CRDs grows significantly and many users wants to use them in Katib (e.g, [katib/issue#1081 (Support Argo Workflow)](https://github.com/kubeflow/katib/issues/1081)). +Another reason to design unify approach is that CRDs can be depended on go package versions that Katib controller doesn't support and make it not possible to build the controller (e.g, [katib/issue#1081](https://github.com/kubeflow/katib/issues/1081#issuecomment-635338276)). + +Thus we propose new design to support custom CRDs in template and make Katib usable for various Kubernetes resources. To make this possible we make changes in API, trial controller, job provider, mutation webhook, metrics collector. + +## Goals + +1. Setup trial controller to watch for the custom CRD changes. +2. Inject Katib sidecar container on training pod. +3. Find the container for Katib injection in list of training pod containers. +4. Run metrics collector parser after completion of all pod processes. +5. Get succeeded status of running CRD. +6. Verify that `sidecar.istio.io/inject: false` label is added. + +## Non-Goals + +1. Support to inject Katib sidecar container on more than one pod simultaneously. +2. Support list of succeeded conditions for the custom CRD. +3. Dynamically add new trial watcher for the custom CRD without Katib restart. + +## Implementation + +During implementation this feature we should not brake current Katib controller logic to make sure that CI is stable and it not blocks other Katib related tasks. +After completion, we can clean-up redundant code. + +### API + +To achieve above goals, we introduce these `TrialTemplate` API changes. + +```go + +// TrialTemplate describes structure of Trial template +type TrialTemplate struct { + // Retain indicates that Trial resources must be not cleanup + Retain bool `json:"retain,omitempty"` + + // Source for Trial template (unstructured structure or config map) + TrialSource `json:",inline"` + + // List of parameters that are used in Trial template + TrialParameters []TrialParameterSpec `json:"trialParameters,omitempty"` + + // Label that determines if pod needs to be injected by Katib sidecar container + PrimaryPodLabel map[string]string `json:"primaryPodLabel,omitempty"` + + // Name of training container where training is running + PrimaryContainerName string `json:"primaryContainerName,omitempty"` + + // Name of training container where training is running + PrimaryContainerName string `json:"primaryContainerName,omitempty"` + + // Name of condition when Trial custom resource is succeeded + SucceededCondition string `json:"succeededCondition,omitempty"` + +} +``` + +### Trial controller watchers + +Currently, trial controller watches for [the three supported resource](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L94-L125). To generate these parameters dynamically when Katib starts, we add additional flag (`-trial-resource`) to Katib controller, which represents resources that can be used in Trial template. This flag contains `Group`, `Version`, `Kind` of custom CRD which needs to create controller watchers. Trial controller iterates over these params and create watchers. + +For example, if Trial can run TFJob, Argo Workflow and k8s Batch Jobs, Katib controller flags must be: + +```yaml +. . . +args: + - "-webhook-port=8443" + - "-trial-resource=TFJob.v1.kubeflow.org". + - "-trial-resource=Workflow.v1alpha1.argoproj.io" + - "-trial-resource=Job.v1.batch" +. . . +``` + +### Primary pod label location + +Right now, we [inject](https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/utils.go#L58-L72) metrics collector for TFJob and PyTorchJob only for master pods with labels previously saved in controller constants. To find primary pod that needs to be injected by Katib sidecar container, we added new `PrimaryPodLabel` parameter in `TrialTemplate` API. User can define the key and value of the pod label where Katib must inject sidecar container. + +For example, for TFJob: + +```yaml +. . . +PrimaryPodLabel: + "job-role": "master" +. . . +``` + +### Training container name + +In current design, to find pod container where actual training is happening and metrics collector must parse metrics, we compare compare name with [default value](https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L63-L78) for TFJob and PyTorchJob. To find training container we introduce new `PrimaryContainerName` field, where user can set container name with running training program. + +For example, in training is running on container with "pytorch" name: + +```yaml +. . . +PrimaryContainerName: "pytorch" +. . . +``` + +### Start metrics collector parser + +As discussed in [katib/issue#1214](https://github.com/kubeflow/katib/issues/1214#issuecomment-642168716), metrics collector must start parsing metrics only after all injected pod processes are finished. That can avoid problems with other sidecar containers that various CRD can have. + +We need to verify that [distributive training](https://docs.fast.ai/distributed.html#launch-your-training) with more than one active process also works with this approach. + +### Succeeded status of running CRD + +We have already [designed Kubeflow provider](https://github.com/kubeflow/katib/blob/master/pkg/job/v1alpha3/kubeflow.go#L27-L60) to check succeeded status for TFJob and PyTorchJob as `unstructured` objects by [compare](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller_util.go#L161) `.status` value with `succeeded` value. + +Different CRD can have unique status design (e.g, Kubernetes batch job succeeded status is [`Complete`](https://github.com/kubernetes/api/blob/master/batch/v1/types.go#L167-L173)). To get CRD succeeded status value and trigger trial controller, we add new parameters `SucceededCondition`. Trial controller checks all running job conditions and verifies that running job has appropriate `type` in `.status.conditions` with `status=True`. We also should transform `reason` and `message`, if it is available` to trial conditions. + +For example for TFJob + +```yaml +. . . +SucceededCondition: Succeeded +. . . +``` + +### Istio sidecar container + +Previously, we had problems with Istio sidecar containers, see [kubeflow/issue#1081](https://github.com/kubeflow/kubeflow/issues/4742). In some cases, it is unable to properly download datasets in training pod. It was fixed by adding annotation `sidecar.istio.io/inject: false` to appropriate Trial job. + +Various CRD can have unify design and it is hard to understand where annotation must be set to disable Istio injection for the running pods. We should manually update all Katib examples and add this annotation to every trial template. + +This exception must be documented and new Katib examples must have this annotation in templates. From 18b51fb567a4a1d3d85d053582cfde5e3d42f5df Mon Sep 17 00:00:00 2001 From: avelichk Date: Sat, 18 Jul 2020 03:15:24 +0100 Subject: [PATCH 2/9] Fix --- docs/proposals/trial-custom-crd.md | 72 +++++++++++++++++++++++------- 1 file changed, 56 insertions(+), 16 deletions(-) diff --git a/docs/proposals/trial-custom-crd.md b/docs/proposals/trial-custom-crd.md index 78e8df3e271..02d7d55669b 100644 --- a/docs/proposals/trial-custom-crd.md +++ b/docs/proposals/trial-custom-crd.md @@ -1,4 +1,4 @@ -# Support custom CRD as Trial Template proposal +# Support custom CRD as a Trial Job proposal @@ -21,11 +21,22 @@ ## Motivation -Running trial is one of the essential step of running Katib experiments. We implemented new trial template design in Katib v1beta1 ([katib/pull#1202](https://github.com/kubeflow/katib/pull/1202) and [katib/pull#1215](https://github.com/kubeflow/katib/pull/1215)) to have valid YAML in experiments and make Katib more Kubernetes native. -After migrating to the new API, users still can run only [BatchJob](https://kubernetes.io/docs/concepts/workloads/controllers/job/), [TFJob](https://github.com/kubeflow/tf-operator) or [PyTorchJob](https://github.com/kubeflow/pytorch-operator) as trial. If we want to support new CRD, we need to manually change Katib controller source code. This approach makes it impossible to use other CRDs as trial job, even if they can be used in trial design. Number of various Kubernetes CRDs grows significantly and many users wants to use them in Katib (e.g, [katib/issue#1081 (Support Argo Workflow)](https://github.com/kubeflow/katib/issues/1081)). -Another reason to design unify approach is that CRDs can be depended on go package versions that Katib controller doesn't support and make it not possible to build the controller (e.g, [katib/issue#1081](https://github.com/kubeflow/katib/issues/1081#issuecomment-635338276)). - -Thus we propose new design to support custom CRDs in template and make Katib usable for various Kubernetes resources. To make this possible we make changes in API, trial controller, job provider, mutation webhook, metrics collector. +Running trial is one of the essential step of running Katib experiments. +We implemented new trial template design in Katib v1beta1 ([katib/pull#1202](https://github.com/kubeflow/katib/pull/1202) +and [katib/pull#1215](https://github.com/kubeflow/katib/pull/1215)) to make +experiments valid YAMLand make Katib more Kubernetes native. +After migrating to the new API, users still can run only [BatchJob](https://kubernetes.io/docs/concepts/workloads/controllers/job/), +[TFJob](https://github.com/kubeflow/tf-operator) or [PyTorchJob](https://github.com/kubeflow/pytorch-operator) as a trial job. +If we want to support new CRD, we need to manually change Katib controller source code. +This approach makes impossible to use other CRDs in trial template, even if they can be used in trial design. +Number of various Kubernetes CRDs grows significantly and many users wants to use them in Katib +(e.g, [katib/issue#1081 (Support Argo Workflow)](https://github.com/kubeflow/katib/issues/1081)). +Another reason to design unify approach is that CRD controller can have go package dependencies versions +that Katib controller doesn't support and make it not possible to build the controller +(e.g, [katib/issue#1081](https://github.com/kubeflow/katib/issues/1081#issuecomment-635338276)). + +Thus we propose new design to support custom CRDs in template and make Katib usable for various Kubernetes resources. +To make this possible we make changes in API, trial controller, job provider, mutation webhook, metrics collector. ## Goals @@ -44,7 +55,8 @@ Thus we propose new design to support custom CRDs in template and make Katib usa ## Implementation -During implementation this feature we should not brake current Katib controller logic to make sure that CI is stable and it not blocks other Katib related tasks. +During implementation this feature we should not brake current Katib controller logic to +make sure that CI is stable and it not blocks other Katib related tasks. After completion, we can clean-up redundant code. ### API @@ -81,7 +93,12 @@ type TrialTemplate struct { ### Trial controller watchers -Currently, trial controller watches for [the three supported resource](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L94-L125). To generate these parameters dynamically when Katib starts, we add additional flag (`-trial-resource`) to Katib controller, which represents resources that can be used in Trial template. This flag contains `Group`, `Version`, `Kind` of custom CRD which needs to create controller watchers. Trial controller iterates over these params and create watchers. +Currently, trial controller watches for +[the three supported resource](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L94-L125). +To generate these parameters dynamically when Katib starts, we add additional flag (`-trial-resource`) +to Katib controller, which represents resources that can be used in Trial template. +This flag contains `Group`, `Version`, `Kind` of custom CRD which needs to create controller watchers. +Trial controller iterates over these params and create watchers. For example, if Trial can run TFJob, Argo Workflow and k8s Batch Jobs, Katib controller flags must be: @@ -97,7 +114,11 @@ args: ### Primary pod label location -Right now, we [inject](https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/utils.go#L58-L72) metrics collector for TFJob and PyTorchJob only for master pods with labels previously saved in controller constants. To find primary pod that needs to be injected by Katib sidecar container, we added new `PrimaryPodLabel` parameter in `TrialTemplate` API. User can define the key and value of the pod label where Katib must inject sidecar container. +Right now, we [inject](https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/utils.go#L58-L72) +metrics collector for TFJob and PyTorchJob only for master pods with labels previously saved in controller constants. +To find primary pod that needs to be injected by Katib sidecar container, +we added new `PrimaryPodLabel` parameter in `TrialTemplate` API. +User can define the key and value of the pod label where Katib must inject sidecar container. For example, for TFJob: @@ -110,7 +131,10 @@ PrimaryPodLabel: ### Training container name -In current design, to find pod container where actual training is happening and metrics collector must parse metrics, we compare compare name with [default value](https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L63-L78) for TFJob and PyTorchJob. To find training container we introduce new `PrimaryContainerName` field, where user can set container name with running training program. +In current design, we compare compare name with +[default value](https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L63-L78) for TFJob and PyTorchJob, +to find pod container where actual training is happening and metrics collector must parse metrics. +To find training container, we introduce new `PrimaryContainerName` field, where user can set container name with running training program. For example, in training is running on container with "pytorch" name: @@ -122,15 +146,26 @@ PrimaryContainerName: "pytorch" ### Start metrics collector parser -As discussed in [katib/issue#1214](https://github.com/kubeflow/katib/issues/1214#issuecomment-642168716), metrics collector must start parsing metrics only after all injected pod processes are finished. That can avoid problems with other sidecar containers that various CRD can have. +As discussed in [katib/issue#1214](https://github.com/kubeflow/katib/issues/1214#issuecomment-642168716), +metrics collector must start parsing metrics only after all injected pod processes are finished. +That can avoid problems with other sidecar containers that various CRD can have. -We need to verify that [distributive training](https://docs.fast.ai/distributed.html#launch-your-training) with more than one active process also works with this approach. +We need to verify that [distributive training](https://docs.fast.ai/distributed.html#launch-your-training) +with more than one active process also works with this approach. ### Succeeded status of running CRD -We have already [designed Kubeflow provider](https://github.com/kubeflow/katib/blob/master/pkg/job/v1alpha3/kubeflow.go#L27-L60) to check succeeded status for TFJob and PyTorchJob as `unstructured` objects by [compare](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller_util.go#L161) `.status` value with `succeeded` value. +We have already [designed Kubeflow provider](https://github.com/kubeflow/katib/blob/master/pkg/job/v1alpha3/kubeflow.go#L27-L60) +to check succeeded status for TFJob and PyTorchJob as `unstructured` objects by +[compare](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller_util.go#L161) +`.status` value with `succeeded` value. -Different CRD can have unique status design (e.g, Kubernetes batch job succeeded status is [`Complete`](https://github.com/kubernetes/api/blob/master/batch/v1/types.go#L167-L173)). To get CRD succeeded status value and trigger trial controller, we add new parameters `SucceededCondition`. Trial controller checks all running job conditions and verifies that running job has appropriate `type` in `.status.conditions` with `status=True`. We also should transform `reason` and `message`, if it is available` to trial conditions. +Different CRD can have unique status design (e.g, Kubernetes batch job succeeded status is +[`Complete`](https://github.com/kubernetes/api/blob/master/batch/v1/types.go#L167-L173)). +To get CRD succeeded status value and trigger trial controller, we add new parameters `SucceededCondition`. +Trial controller checks all running job conditions and verifies that running job has appropriate `type` +in `.status.conditions` with `status=True`. +We also should transform `reason` and `message`, if it is available` to trial conditions. For example for TFJob @@ -142,8 +177,13 @@ SucceededCondition: Succeeded ### Istio sidecar container -Previously, we had problems with Istio sidecar containers, see [kubeflow/issue#1081](https://github.com/kubeflow/kubeflow/issues/4742). In some cases, it is unable to properly download datasets in training pod. It was fixed by adding annotation `sidecar.istio.io/inject: false` to appropriate Trial job. +Previously, we had problems with Istio sidecar containers, +see [kubeflow/issue#1081](https://github.com/kubeflow/kubeflow/issues/4742). +In some cases, it is unable to properly download datasets in training pod. +It was fixed by adding annotation `sidecar.istio.io/inject: false` to appropriate Trial job. -Various CRD can have unify design and it is hard to understand where annotation must be set to disable Istio injection for the running pods. We should manually update all Katib examples and add this annotation to every trial template. +Various CRD can have unify design and it is hard to understand where annotation must be set +to disable Istio injection for the running pods. +We should manually update all Katib examples and add this annotation to every trial template. This exception must be documented and new Katib examples must have this annotation in templates. From c8ea425ca51c08f3225e1a66755b44fb89baa4ab Mon Sep 17 00:00:00 2001 From: avelichk Date: Sat, 18 Jul 2020 03:22:36 +0100 Subject: [PATCH 3/9] Modify doctoc --- docs/proposals/trial-custom-crd.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/proposals/trial-custom-crd.md b/docs/proposals/trial-custom-crd.md index 02d7d55669b..23d8b39d0b9 100644 --- a/docs/proposals/trial-custom-crd.md +++ b/docs/proposals/trial-custom-crd.md @@ -2,8 +2,7 @@ - -**Table of Contents** _generated with [DocToc](https://github.com/thlorenz/doctoc)_ +## Table of Contents - [Motivation](#motivation) - [Goals](#goals) @@ -19,6 +18,8 @@ +Created by [doctoc](https://github.com/thlorenz/doctoc) + ## Motivation Running trial is one of the essential step of running Katib experiments. From fb0e86262fe4facc6a57ce21dba7666bdadfe134 Mon Sep 17 00:00:00 2001 From: avelichk Date: Sat, 18 Jul 2020 03:59:32 +0100 Subject: [PATCH 4/9] Doc fixes --- docs/proposals/trial-custom-crd.md | 80 +++++++++++++++--------------- 1 file changed, 40 insertions(+), 40 deletions(-) diff --git a/docs/proposals/trial-custom-crd.md b/docs/proposals/trial-custom-crd.md index 23d8b39d0b9..c1818c7ba18 100644 --- a/docs/proposals/trial-custom-crd.md +++ b/docs/proposals/trial-custom-crd.md @@ -2,6 +2,7 @@ + ## Table of Contents - [Motivation](#motivation) @@ -18,33 +19,34 @@ -Created by [doctoc](https://github.com/thlorenz/doctoc) +Created by [doctoc](https://github.com/thlorenz/doctoc). ## Motivation Running trial is one of the essential step of running Katib experiments. We implemented new trial template design in Katib v1beta1 ([katib/pull#1202](https://github.com/kubeflow/katib/pull/1202) and [katib/pull#1215](https://github.com/kubeflow/katib/pull/1215)) to make -experiments valid YAMLand make Katib more Kubernetes native. +experiments valid YAML and make Katib more Kubernetes native. After migrating to the new API, users still can run only [BatchJob](https://kubernetes.io/docs/concepts/workloads/controllers/job/), [TFJob](https://github.com/kubeflow/tf-operator) or [PyTorchJob](https://github.com/kubeflow/pytorch-operator) as a trial job. -If we want to support new CRD, we need to manually change Katib controller source code. -This approach makes impossible to use other CRDs in trial template, even if they can be used in trial design. -Number of various Kubernetes CRDs grows significantly and many users wants to use them in Katib -(e.g, [katib/issue#1081 (Support Argo Workflow)](https://github.com/kubeflow/katib/issues/1081)). -Another reason to design unify approach is that CRD controller can have go package dependencies versions +If we would like to support new CRD, we need to manually change Katib controller source code. + +This approach makes impossible to use other CRDs in trial template, even if they can satisfies trial job design. +Number of various Kubernetes CRDs grows significantly and many users would like to use them in Katib +(e.g, [katib/issue#1081, support Argo Workflow)](https://github.com/kubeflow/katib/issues/1081)). +Another reason to design unify approach, is that CRD controller can have go package dependencies versions that Katib controller doesn't support and make it not possible to build the controller (e.g, [katib/issue#1081](https://github.com/kubeflow/katib/issues/1081#issuecomment-635338276)). -Thus we propose new design to support custom CRDs in template and make Katib usable for various Kubernetes resources. -To make this possible we make changes in API, trial controller, job provider, mutation webhook, metrics collector. +Thus we propose new controller design to support custom CRD as Trial job and make Katib usable for various Kubernetes resources. +To make this possible, we make changes in API, trial controller, job provider, mutation webhook, metrics collector. ## Goals -1. Setup trial controller to watch for the custom CRD changes. +1. Watch for the custom CRD in trial controller. 2. Inject Katib sidecar container on training pod. -3. Find the container for Katib injection in list of training pod containers. -4. Run metrics collector parser after completion of all pod processes. +3. Indicate training container for metrics collector execution. +4. Run metrics collector parser after all pod processes completion. 5. Get succeeded status of running CRD. 6. Verify that `sidecar.istio.io/inject: false` label is added. @@ -56,8 +58,8 @@ To make this possible we make changes in API, trial controller, job provider, mu ## Implementation -During implementation this feature we should not brake current Katib controller logic to -make sure that CI is stable and it not blocks other Katib related tasks. +During implementation this feature, we should not brake current Katib controller logic. +And we need to make sure that CI is stable and it not blocks other Katib work tasks. After completion, we can clean-up redundant code. ### API @@ -68,13 +70,13 @@ To achieve above goals, we introduce these `TrialTemplate` API changes. // TrialTemplate describes structure of Trial template type TrialTemplate struct { - // Retain indicates that Trial resources must be not cleanup - Retain bool `json:"retain,omitempty"` + // Retain indicates that Trial resources must be not cleanup + Retain bool `json:"retain,omitempty"` - // Source for Trial template (unstructured structure or config map) - TrialSource `json:",inline"` + // Source for Trial template (unstructured structure or config map) + TrialSource `json:",inline"` - // List of parameters that are used in Trial template + // List of parameters that are used in Trial template TrialParameters []TrialParameterSpec `json:"trialParameters,omitempty"` // Label that determines if pod needs to be injected by Katib sidecar container @@ -83,9 +85,6 @@ type TrialTemplate struct { // Name of training container where training is running PrimaryContainerName string `json:"primaryContainerName,omitempty"` - // Name of training container where training is running - PrimaryContainerName string `json:"primaryContainerName,omitempty"` - // Name of condition when Trial custom resource is succeeded SucceededCondition string `json:"succeededCondition,omitempty"` @@ -94,14 +93,14 @@ type TrialTemplate struct { ### Trial controller watchers -Currently, trial controller watches for -[the three supported resource](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L94-L125). +Currently, trial controller watches for the +[three supported resource](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L94-L125). To generate these parameters dynamically when Katib starts, we add additional flag (`-trial-resource`) to Katib controller, which represents resources that can be used in Trial template. -This flag contains `Group`, `Version`, `Kind` of custom CRD which needs to create controller watchers. +This flag contains custom CRD `Group`, `Version`, `Kind` in format `kind.version.group` which needs to create controller watchers. Trial controller iterates over these params and create watchers. -For example, if Trial can run TFJob, Argo Workflow and k8s Batch Jobs, Katib controller flags must be: +For example, if Trial can run TFJob, Argo Workflow and Kubernetes Batch Jobs, Katib controller flags must be: ```yaml . . . @@ -116,10 +115,11 @@ args: ### Primary pod label location Right now, we [inject](https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/utils.go#L58-L72) -metrics collector for TFJob and PyTorchJob only for master pods with labels previously saved in controller constants. +metrics collector for TFJob and PyTorchJob only for _master_ pods with labels previously saved in controller constants. + To find primary pod that needs to be injected by Katib sidecar container, we added new `PrimaryPodLabel` parameter in `TrialTemplate` API. -User can define the key and value of the pod label where Katib must inject sidecar container. +User can define the key and value of the pod label which Katib must inject with sidecar container. For example, for TFJob: @@ -132,12 +132,12 @@ PrimaryPodLabel: ### Training container name -In current design, we compare compare name with -[default value](https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L63-L78) for TFJob and PyTorchJob, -to find pod container where actual training is happening and metrics collector must parse metrics. -To find training container, we introduce new `PrimaryContainerName` field, where user can set container name with running training program. +In current design, to find pod container where actual training is happening and metrics collector must parse metrics, we compare container name with +[default value](https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L63-L78) for TFJob and PyTorchJob. + +To find proper training container, we introduce new `PrimaryContainerName` field, where user can set container name with running training program. -For example, in training is running on container with "pytorch" name: +For example, if training is running on container with `pytorch` name: ```yaml . . . @@ -158,17 +158,17 @@ with more than one active process also works with this approach. We have already [designed Kubeflow provider](https://github.com/kubeflow/katib/blob/master/pkg/job/v1alpha3/kubeflow.go#L27-L60) to check succeeded status for TFJob and PyTorchJob as `unstructured` objects by -[compare](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller_util.go#L161) -`.status` value with `succeeded` value. +[comparing](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller_util.go#L161) +`.status.conditions[x].type` value with `Succeeded` value. Different CRD can have unique status design (e.g, Kubernetes batch job succeeded status is [`Complete`](https://github.com/kubernetes/api/blob/master/batch/v1/types.go#L167-L173)). To get CRD succeeded status value and trigger trial controller, we add new parameters `SucceededCondition`. Trial controller checks all running job conditions and verifies that running job has appropriate `type` in `.status.conditions` with `status=True`. -We also should transform `reason` and `message`, if it is available` to trial conditions. +We also should transform `reason` and `message`, if it is available, to trial conditions. -For example for TFJob +For example for TFJob: ```yaml . . . @@ -181,10 +181,10 @@ SucceededCondition: Succeeded Previously, we had problems with Istio sidecar containers, see [kubeflow/issue#1081](https://github.com/kubeflow/kubeflow/issues/4742). In some cases, it is unable to properly download datasets in training pod. -It was fixed by adding annotation `sidecar.istio.io/inject: false` to appropriate Trial job. +It was fixed by adding annotation `sidecar.istio.io/inject: false` to appropriate Trial job in Katib controller. -Various CRD can have unify design and it is hard to understand where annotation must be set +Various CRD can have unify design and it is hard to understand where annotation must be specified to disable Istio injection for the running pods. We should manually update all Katib examples and add this annotation to every trial template. -This exception must be documented and new Katib examples must have this annotation in templates. +This exception must be documented and new Katib examples must include this annotation in templates. From 2bc51db72ac62f37d239cbe692550afab38425d7 Mon Sep 17 00:00:00 2001 From: avelichk Date: Sat, 18 Jul 2020 04:02:53 +0100 Subject: [PATCH 5/9] Rename header --- docs/proposals/trial-custom-crd.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/proposals/trial-custom-crd.md b/docs/proposals/trial-custom-crd.md index c1818c7ba18..808a20adc9c 100644 --- a/docs/proposals/trial-custom-crd.md +++ b/docs/proposals/trial-custom-crd.md @@ -1,4 +1,4 @@ -# Support custom CRD as a Trial Job proposal +# Support custom CRD in Trial Job proposal From 224829b566c93d3ed1978dc94e2600a17b908876 Mon Sep 17 00:00:00 2001 From: avelichk Date: Sat, 18 Jul 2020 05:11:15 +0100 Subject: [PATCH 6/9] Fixes --- docs/proposals/trial-custom-crd.md | 56 ++++++++++++++---------------- 1 file changed, 27 insertions(+), 29 deletions(-) diff --git a/docs/proposals/trial-custom-crd.md b/docs/proposals/trial-custom-crd.md index 808a20adc9c..db5463c60f1 100644 --- a/docs/proposals/trial-custom-crd.md +++ b/docs/proposals/trial-custom-crd.md @@ -23,43 +23,42 @@ Created by [doctoc](https://github.com/thlorenz/doctoc). ## Motivation -Running trial is one of the essential step of running Katib experiments. -We implemented new trial template design in Katib v1beta1 ([katib/pull#1202](https://github.com/kubeflow/katib/pull/1202) +Running trial is one of the essential steps of executing Katib experiments. +We have implemented new trial template design in Katib v1beta1 ([katib/pull#1202](https://github.com/kubeflow/katib/pull/1202) and [katib/pull#1215](https://github.com/kubeflow/katib/pull/1215)) to make experiments valid YAML and make Katib more Kubernetes native. After migrating to the new API, users still can run only [BatchJob](https://kubernetes.io/docs/concepts/workloads/controllers/job/), -[TFJob](https://github.com/kubeflow/tf-operator) or [PyTorchJob](https://github.com/kubeflow/pytorch-operator) as a trial job. -If we would like to support new CRD, we need to manually change Katib controller source code. +[TFJob](https://github.com/kubeflow/tf-operator) or [PyTorchJob](https://github.com/kubeflow/pytorch-operator) in trial job. +If we want to support new CRD, we need to manually change Katib controller source code. -This approach makes impossible to use other CRDs in trial template, even if they can satisfies trial job design. -Number of various Kubernetes CRDs grows significantly and many users would like to use them in Katib -(e.g, [katib/issue#1081, support Argo Workflow)](https://github.com/kubeflow/katib/issues/1081)). -Another reason to design unify approach, is that CRD controller can have go package dependencies versions -that Katib controller doesn't support and make it not possible to build the controller -(e.g, [katib/issue#1081](https://github.com/kubeflow/katib/issues/1081#issuecomment-635338276)). +This approach makes impossible to use other CRDs in trial template, even if they satisfy trial job design. +The number of various Kubernetes CRDs grows significantly and many users would like to use them in Katib +(e.g, [katib/issue#1081, support Argo Workflow](https://github.com/kubeflow/katib/issues/1081)). +Another reason to design unified approach is that CRD controller can have `Go` package versions +that Katib controller doesn't support (e.g, [katib/issue#1081](https://github.com/kubeflow/katib/issues/1081#issuecomment-635338276)). -Thus we propose new controller design to support custom CRD as Trial job and make Katib usable for various Kubernetes resources. -To make this possible, we make changes in API, trial controller, job provider, mutation webhook, metrics collector. +That is why we propose a new controller design to support custom CRD in trial job and make Katib usable for various Kubernetes resources. +To make this possible, we are changing API, trial controller, job provider, mutation webhook, metrics collector. ## Goals -1. Watch for the custom CRD in trial controller. +1. Allow dynamic watchers for the custom CRD. 2. Inject Katib sidecar container on training pod. 3. Indicate training container for metrics collector execution. 4. Run metrics collector parser after all pod processes completion. -5. Get succeeded status of running CRD. +5. Get succeeded condition of running CRD. 6. Verify that `sidecar.istio.io/inject: false` label is added. ## Non-Goals -1. Support to inject Katib sidecar container on more than one pod simultaneously. -2. Support list of succeeded conditions for the custom CRD. +1. Inject Katib sidecar container on more than one pod simultaneously. +2. Specify list of succeeded conditions for the custom CRD. 3. Dynamically add new trial watcher for the custom CRD without Katib restart. ## Implementation During implementation this feature, we should not brake current Katib controller logic. -And we need to make sure that CI is stable and it not blocks other Katib work tasks. +Also, we need to make sure that CI is stable and it does not block other Katib work tasks. After completion, we can clean-up redundant code. ### API @@ -93,14 +92,14 @@ type TrialTemplate struct { ### Trial controller watchers -Currently, trial controller watches for the +In the current design trial controller watches [three supported resource](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L94-L125). -To generate these parameters dynamically when Katib starts, we add additional flag (`-trial-resource`) -to Katib controller, which represents resources that can be used in Trial template. -This flag contains custom CRD `Group`, `Version`, `Kind` in format `kind.version.group` which needs to create controller watchers. -Trial controller iterates over these params and create watchers. +We add additional flag (`-trial-resource`) +to Katib controller, which represents resources that can be used in trial template, to generate these parameters dynamically when Katib starts. +This flag contains custom CRD's `Group`, `Version`, `Kind` in `kind.version.group` format which needs to create controller watchers. +Trial controller iterates over these parameters and creates watchers. -For example, if Trial can run TFJob, Argo Workflow and Kubernetes Batch Jobs, Katib controller flags must be: +For example, if trial can run TFJob, Argo Workflow and Kubernetes Batch Jobs, Katib controller flags must be: ```yaml . . . @@ -117,8 +116,7 @@ args: Right now, we [inject](https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/utils.go#L58-L72) metrics collector for TFJob and PyTorchJob only for _master_ pods with labels previously saved in controller constants. -To find primary pod that needs to be injected by Katib sidecar container, -we added new `PrimaryPodLabel` parameter in `TrialTemplate` API. +We added a new `PrimaryPodLabel` parameter in `TrialTemplate` API, to find primary pod that needs to be injected by Katib sidecar container. User can define the key and value of the pod label which Katib must inject with sidecar container. For example, for TFJob: @@ -132,10 +130,10 @@ PrimaryPodLabel: ### Training container name -In current design, to find pod container where actual training is happening and metrics collector must parse metrics, we compare container name with +In the current design, to find pod container where actual training is happening and metrics collector must parse metrics, we compare container name with [default value](https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L63-L78) for TFJob and PyTorchJob. -To find proper training container, we introduce new `PrimaryContainerName` field, where user can set container name with running training program. +We introduce new `PrimaryContainerName` field, where user can set container name with running training program, to find proper training container. For example, if training is running on container with `pytorch` name: @@ -154,7 +152,7 @@ That can avoid problems with other sidecar containers that various CRD can have. We need to verify that [distributive training](https://docs.fast.ai/distributed.html#launch-your-training) with more than one active process also works with this approach. -### Succeeded status of running CRD +### Succeeded condition of running CRD We have already [designed Kubeflow provider](https://github.com/kubeflow/katib/blob/master/pkg/job/v1alpha3/kubeflow.go#L27-L60) to check succeeded status for TFJob and PyTorchJob as `unstructured` objects by @@ -163,7 +161,7 @@ to check succeeded status for TFJob and PyTorchJob as `unstructured` objects by Different CRD can have unique status design (e.g, Kubernetes batch job succeeded status is [`Complete`](https://github.com/kubernetes/api/blob/master/batch/v1/types.go#L167-L173)). -To get CRD succeeded status value and trigger trial controller, we add new parameters `SucceededCondition`. +We add new parameters `SucceededCondition`, to get CRD succeeded status value and trigger trial controller. Trial controller checks all running job conditions and verifies that running job has appropriate `type` in `.status.conditions` with `status=True`. We also should transform `reason` and `message`, if it is available, to trial conditions. From b2fc7d92ce98a25d37b2b741aa7fa35c0b9f66cd Mon Sep 17 00:00:00 2001 From: avelichk Date: Sat, 18 Jul 2020 05:15:07 +0100 Subject: [PATCH 7/9] Change doc --- docs/proposals/trial-custom-crd.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/proposals/trial-custom-crd.md b/docs/proposals/trial-custom-crd.md index db5463c60f1..17c039b3cfd 100644 --- a/docs/proposals/trial-custom-crd.md +++ b/docs/proposals/trial-custom-crd.md @@ -94,8 +94,8 @@ type TrialTemplate struct { In the current design trial controller watches [three supported resource](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L94-L125). -We add additional flag (`-trial-resource`) -to Katib controller, which represents resources that can be used in trial template, to generate these parameters dynamically when Katib starts. +To generate these parameters dynamically when Katib starts, we add additional flag (`-trial-resource`) +to Katib controller, which represents resources that can be used in trial template. This flag contains custom CRD's `Group`, `Version`, `Kind` in `kind.version.group` format which needs to create controller watchers. Trial controller iterates over these parameters and creates watchers. @@ -130,8 +130,9 @@ PrimaryPodLabel: ### Training container name -In the current design, to find pod container where actual training is happening and metrics collector must parse metrics, we compare container name with -[default value](https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L63-L78) for TFJob and PyTorchJob. +In the current design, we compare container name with +[default value](https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L63-L78) for TFJob and PyTorchJob, +to find pod container where actual training is happening and metrics collector must parse metrics. We introduce new `PrimaryContainerName` field, where user can set container name with running training program, to find proper training container. From 67156c63a70c7453063f55eeaedb2ebb25074290 Mon Sep 17 00:00:00 2001 From: avelichk Date: Sat, 18 Jul 2020 05:18:18 +0100 Subject: [PATCH 8/9] Remove comma --- docs/proposals/trial-custom-crd.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/proposals/trial-custom-crd.md b/docs/proposals/trial-custom-crd.md index 17c039b3cfd..554833d4c9c 100644 --- a/docs/proposals/trial-custom-crd.md +++ b/docs/proposals/trial-custom-crd.md @@ -116,7 +116,7 @@ args: Right now, we [inject](https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/utils.go#L58-L72) metrics collector for TFJob and PyTorchJob only for _master_ pods with labels previously saved in controller constants. -We added a new `PrimaryPodLabel` parameter in `TrialTemplate` API, to find primary pod that needs to be injected by Katib sidecar container. +We added a new `PrimaryPodLabel` parameter in `TrialTemplate` API to find primary pod that needs to be injected by Katib sidecar container. User can define the key and value of the pod label which Katib must inject with sidecar container. For example, for TFJob: @@ -130,11 +130,11 @@ PrimaryPodLabel: ### Training container name -In the current design, we compare container name with -[default value](https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L63-L78) for TFJob and PyTorchJob, +In the current design we compare container name with +[default value](https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L63-L78) for TFJob and PyTorchJob to find pod container where actual training is happening and metrics collector must parse metrics. -We introduce new `PrimaryContainerName` field, where user can set container name with running training program, to find proper training container. +We introduce new `PrimaryContainerName` field, where user can set container name with running training program to find proper training container. For example, if training is running on container with `pytorch` name: @@ -162,10 +162,10 @@ to check succeeded status for TFJob and PyTorchJob as `unstructured` objects by Different CRD can have unique status design (e.g, Kubernetes batch job succeeded status is [`Complete`](https://github.com/kubernetes/api/blob/master/batch/v1/types.go#L167-L173)). -We add new parameters `SucceededCondition`, to get CRD succeeded status value and trigger trial controller. +We add new parameters `SucceededCondition` to get CRD succeeded status value and trigger trial controller. Trial controller checks all running job conditions and verifies that running job has appropriate `type` in `.status.conditions` with `status=True`. -We also should transform `reason` and `message`, if it is available, to trial conditions. +We also should transform `reason` and `message`, if it is available to trial conditions. For example for TFJob: From 96d4e540afb2a3854cf1732eb256732d4b1f5a52 Mon Sep 17 00:00:00 2001 From: avelichk Date: Sat, 18 Jul 2020 05:36:07 +0100 Subject: [PATCH 9/9] Fix Implementation --- docs/proposals/trial-custom-crd.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/docs/proposals/trial-custom-crd.md b/docs/proposals/trial-custom-crd.md index 554833d4c9c..e3e5082a15d 100644 --- a/docs/proposals/trial-custom-crd.md +++ b/docs/proposals/trial-custom-crd.md @@ -134,7 +134,7 @@ In the current design we compare container name with [default value](https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L63-L78) for TFJob and PyTorchJob to find pod container where actual training is happening and metrics collector must parse metrics. -We introduce new `PrimaryContainerName` field, where user can set container name with running training program to find proper training container. +We introduce a new `PrimaryContainerName` field, where user can set container name with running training program to find proper training container. For example, if training is running on container with `pytorch` name: @@ -147,7 +147,7 @@ PrimaryContainerName: "pytorch" ### Start metrics collector parser As discussed in [katib/issue#1214](https://github.com/kubeflow/katib/issues/1214#issuecomment-642168716), -metrics collector must start parsing metrics only after all injected pod processes are finished. +metrics collector starts parsing metrics only after all injected pod processes were finished. That can avoid problems with other sidecar containers that various CRD can have. We need to verify that [distributive training](https://docs.fast.ai/distributed.html#launch-your-training) @@ -156,18 +156,18 @@ with more than one active process also works with this approach. ### Succeeded condition of running CRD We have already [designed Kubeflow provider](https://github.com/kubeflow/katib/blob/master/pkg/job/v1alpha3/kubeflow.go#L27-L60) -to check succeeded status for TFJob and PyTorchJob as `unstructured` objects by +to check succeeded conditions for the TFJob and PyTorchJob as `unstructured` objects by [comparing](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller_util.go#L161) `.status.conditions[x].type` value with `Succeeded` value. -Different CRD can have unique status design (e.g, Kubernetes batch job succeeded status is +Different CRD can have unique status design (e.g, Kubernetes batch job succeeded condition is [`Complete`](https://github.com/kubernetes/api/blob/master/batch/v1/types.go#L167-L173)). -We add new parameters `SucceededCondition` to get CRD succeeded status value and trigger trial controller. +We add a new parameter `SucceededCondition` to get CRD succeeded condition value and trigger trial controller. Trial controller checks all running job conditions and verifies that running job has appropriate `type` in `.status.conditions` with `status=True`. -We also should transform `reason` and `message`, if it is available to trial conditions. +We also should transform `reason` and `message` from custom CRD to the trial conditions, if it is available. -For example for TFJob: +For example, for TFJob: ```yaml . . . @@ -178,12 +178,12 @@ SucceededCondition: Succeeded ### Istio sidecar container Previously, we had problems with Istio sidecar containers, -see [kubeflow/issue#1081](https://github.com/kubeflow/kubeflow/issues/4742). +check [kubeflow/issue#1081](https://github.com/kubeflow/kubeflow/issues/4742). In some cases, it is unable to properly download datasets in training pod. It was fixed by adding annotation `sidecar.istio.io/inject: false` to appropriate Trial job in Katib controller. -Various CRD can have unify design and it is hard to understand where annotation must be specified +Various CRD can have unified design and it is hard to understand where annotation must be specified to disable Istio injection for the running pods. -We should manually update all Katib examples and add this annotation to every trial template. +We need to update all Katib examples manually and add this annotation to every trial template. -This exception must be documented and new Katib examples must include this annotation in templates. +This exception has to be documented and new Katib examples have to include this annotation in templates.