Proposal: Support custom CRD in Trial Job #1273

andreyvelich · 2020-07-18T04:36:48Z

See comment: #1214 (comment).
I added proposal for supporting any kind of CRD in Trial Spec.

Please take a look @gaocegege @johnugeorge @czheng94

/cc @sperlingxx @jlewi @nielsmeima @terrykong

k8s-ci-robot · 2020-07-18T04:36:52Z

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: nielsmeima, terrykong.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

See comment: #1214 (comment).
I added proposal for supporting any kind of CRD in Trial Spec.

Please take a look @gaocegege @johnugeorge @czheng94

/cc @sperlingxx @jlewi @nielsmeima @terrykong

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kubeflow-bot · 2020-07-18T04:36:57Z

This change is

gaocegege

/lgtm

sperlingxx

I think the new design is supposed to replace current jobPrivider intereface. With the help of new design, we can support arbitrary CRDs without adding go codes (implementing/registering a new jobPrivider).
It looks fantastic to me except how we support Provider.MutateJob under new design.

sperlingxx · 2020-07-20T02:10:12Z

docs/proposals/trial-custom-crd.md

+
+In the current design trial controller watches
+[three supported resource](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L94-L125).
+To generate these parameters dynamically when Katib starts, we add additional flag (`-trial-resource`)


Maybe we can use configMap to define these trial-resources ?

Currently, in katib-config ConfigMap we can set only Suggestion and Metrics collector settings.
Also, we added Watch only when controller starts: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L79.
I think later when we implement dynamically wathers update, we can think about better design how we can send these resources to the Controller.

sperlingxx · 2020-07-20T02:16:32Z

docs/proposals/trial-custom-crd.md

+  TrialParameters []TrialParameterSpec `json:"trialParameters,omitempty"`
+
+  // Label that determines if pod needs to be injected by Katib sidecar container
+  PrimaryPodLabel map[string]string `json:"primaryPodLabel,omitempty"`


I am curious about how will these extra fields be filled ?

In this design it will look like this:

PrimaryPodLabel: "label-key": "label-value"

Not sure if it is the best design.
We can follow the same API as metricStrategies:

. . . PrimaryPodLabel *PrimaryPodLabel . . . type PrimaryPodLabel struct { Name string `json:"name,omitempty"` Value string `json:"value,omitempty"` }

Does it make sense, if we have can set 1 label currently?

WDYT @gaocegege @johnugeorge ?

I think we should provide a map or a slice here.

Map looks good to me, since it follow k8s way: https://github.com/kubernetes/apimachinery/blob/master/pkg/apis/meta/v1/types.go#L226-L231.

andreyvelich · 2020-07-20T12:15:23Z

I think the new design is supposed to replace current jobPrivider intereface. With the help of new design, we can support arbitrary CRDs without adding go codes (implementing/registering a new jobPrivider).
It looks fantastic to me except how we support Provider.MutateJob under new design.

Thanks, yes we should refactor jobPrivider and later we can left only 1 unify provider for every CRDs.

@sperlingxx Do we have a use-case when JobLevel injection might be useful?
I believe Job-level injection was introduced here: https://github.com/kubeflow/katib/blob/master/docs/proposals/metrics-collector.md#mutating-webhook.
I can't see that we use ObjectSelector for injection webhook: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/webhook.go#L103-L114.

Also, MutateJob for Kubeflow and Job providers are currently empty: https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L80-L82 and
https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/job.go#L72-L74.

sperlingxx · 2020-07-20T13:15:51Z

I think the new design is supposed to replace current jobPrivider intereface. With the help of new design, we can support arbitrary CRDs without adding go codes (implementing/registering a new jobPrivider).
It looks fantastic to me except how we support Provider.MutateJob under new design.

Thanks, yes we should refactor jobPrivider and later we can left only 1 unify provider for every CRDs.

@sperlingxx Do we have a use-case when JobLevel injection might be useful?
I believe Job-level injection was introduced here: https://github.com/kubeflow/katib/blob/master/docs/proposals/metrics-collector.md#mutating-webhook.
I can't see that we use ObjectSelector for injection webhook: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/webhook.go#L103-L114.

Also, MutateJob for Kubeflow and Job providers are currently empty: https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L80-L82 and
https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/job.go#L72-L74.

@andreyvelich In fact, we use MutateJob to do some adaption work in our origization. And it seems Job-Level injection can do the same job.

andreyvelich · 2020-07-20T13:33:26Z

@andreyvelich In fact, we use MutateJob to do some adaption work in our origization. And it seems Job-Level injection can do the same job.

In that case, I think we can follow 2 ways:

Add Mutate and Create functions to our dynamic provider and left it empty. If end-user needs to do some Job level mutation, he/she can modify that functions manually.
Create 2 providers: default and custom. In default provider we don't need to add JobLevel mutation, but in custom user can modify Mutate function in controller. Add new flag to Katib controller, which indicate which provider you want to use.

Do you have any other ideas @sperlingxx ?

jlewi · 2020-07-20T14:03:07Z

@andreyvelich Could you present this at an upcoming community meeting and we can do a design review?

andreyvelich · 2020-07-20T15:34:35Z

@andreyvelich Could you present this at an upcoming community meeting and we can do a design review?

Sure, thanks @jlewi.

sperlingxx · 2020-07-21T01:27:32Z

@andreyvelich In fact, we use MutateJob to do some adaption work in our origization. And it seems Job-Level injection can do the same job.

In that case, I think we can follow 2 ways:

Add Mutate and Create functions to our dynamic provider and left it empty. If end-user needs to do some Job level mutation, he/she can modify that functions manually.

Create 2 providers: default and custom. In default provider we don't need to add JobLevel mutation, but in custom user can modify Mutate function in controller. Add new flag to Katib controller, which indicate which provider you want to use.

Do you have any other ideas @sperlingxx ?

I prefer the second way, which seems to be more extensible. I suppose the dynamic provider will be the default one, and we also support adding custom providers. I am not sure whether I have the right understanding.

jlewi · 2020-07-21T14:40:14Z

Thanks @andreyvelich A couple questions

Have you asked users of Katib for feedback?

How does this proposal compare to the approach Tekton is taking with custom tasks?

Verify that sidecar.istio.io/inject: false label is added.

Why is Katib cotrollers getting involved here? Could a user control this by directly setting labels on their resource? e.g. for TFJob they could add the labels to the PodTemplateSpec in TFJob?

Would the proposal be different if we (Kubeflow/Kubernetes whatever) had a first class concept of inputs and outputs?

The Kubernetes resource model is defined here
The fact that all Kubernetes objects have those fields enables a lot of functionality
- e.g. the reason you can do bulk operations (kubectl apply -f ${DIR}) is because every YAML has apiVersion and kind so kubectl knows what type of object every object is
What if we defined an extension of the KRM; call it KFRM which included fields like inputs and outputs so that outputs could be used to report things like metrics. Would that simplify things?

andreyvelich · 2020-07-21T20:37:35Z

Thanks for the review @jlewi.

Thanks @andreyvelich A couple questions

Have you asked users of Katib for feedback?

We have couple of requests to support Argo, Kubeflow operators. @czheng94 has many various CRD that his team want to use in Katib. Idea of this proposal is definitely on high demand.

How does this proposal compare to the approach Tekton is taking with custom tasks?

From my understanding, custom tasks controller author is responsible to proceed Run object and then execute their custom resource. User's controller should watch for the Run Object.
I didn't find in Tekton proposal that user can specify whole custom specification in Reference, only type of object:

ref:
    apiVersion: example.dev/v0
    kind: Example
    name: my-example

Otherwise, here trial controller will be responsible to submit custom jobs and user doesn't need to modify source code and add watchers for the Katib Trial CR.
As well, we have to properly inject Katib sidecar container which is Katib controller responsibility.

Verify that sidecar.istio.io/inject: false label is added.

Why is Katib cotrollers getting involved here? Could a user control this by directly setting labels on their resource? e.g. for TFJob they could add the labels to the PodTemplateSpec in TFJob?

You are right. This is exactly what I want to say in this proposal. User has to specify this annotation in TrialSpec.

Would the proposal be different if we (Kubeflow/Kubernetes whatever) had a first class concept of inputs and outputs?

The Kubernetes resource model is defined here

The fact that all Kubernetes objects have those fields enables a lot of functionality

e.g. the reason you can do bulk operations (kubectl apply -f ${DIR}) is because every YAML has apiVersion and kind so kubectl knows what type of object every object is

What if we defined an extension of the KRM; call it KFRM which included fields like inputs and outputs so that outputs could be used to report things like metrics. Would that simplify things?

Yes, I think that can simplify things, because metrics collector doesn't need to parse StdOut/File for the training container and metrics can be directly pushed to the DB, but we have various metrics collector (e.g. TFEvent) to get various metrics.

This KRM should handle all of these functionality for the custom user's jobs.

Also, we should think how we can implement Early Stopping (#692) in that approach without Katib sidecar container and without Katib SDK for early stopping, like in other opt framework.

jlewi · 2020-07-22T23:53:07Z

Thanks. I don't consider myself an approver; just a passer by so this PR can be merged whenever the approvers LGTM it.

From my understanding, custom tasks controller author is responsible to proceed Run object and then execute their custom resource. User's controller should watch for the Run Object.

I believe the way custom tasks work is as analogous to how Tasks and TaskRuns work today in Tekton

Your pipeline contains references to a resource that acts as a template; e.g. we might have TFJobTemplate
- The template would define inputs and outputs to be passed at runtime
At runtime Tekton creates an instance of the run object e.g. TFJobRun
- A run is a combination of a Template (e.g. TFJobTemplate) and specific values to be passed to be passed as inputs
- The run controller then executes the job using those paramters; e.g. the run controller could substitute the parameters into the TFJobTemplate and to create a TFJob

andreyvelich · 2020-08-06T12:03:04Z

Your pipeline contains references to a resource that acts as a template; e.g. we might have TFJobTemplate

@jlewi If pipeline contains only reference on a template and input params, where you define template specification?
How Tekton can submit Job in runtime, if we pass only Reference to the object?

I haven't seen examples in Tekton, when we define template spec or maybe I misunderstand something ?

andreyvelich · 2020-08-06T12:03:56Z

@gaocegege @johnugeorge @sperlingxx Do you have any other comment on this proposal or we can merge it and start the implementation ?

johnugeorge · 2020-08-13T14:46:03Z

it looks good to me.

@gaocegege

gaocegege · 2020-08-14T02:14:31Z

/lgtm

andreyvelich

/approve

k8s-ci-robot · 2020-08-14T14:39:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich added 9 commits July 18, 2020 02:56

Add proposal for custom CRD in Trial Template

ce632c5

Fix

18b51fb

Modify doctoc

c8ea425

Doc fixes

fb0e862

Rename header

2bc51db

Fixes

224829b

Change doc

b2fc7d9

Remove comma

67156c6

Fix Implementation

96d4e54

k8s-ci-robot requested review from sperlingxx and jlewi July 18, 2020 04:36

k8s-ci-robot added the size/L label Jul 18, 2020

gaocegege reviewed Jul 19, 2020

View reviewed changes

k8s-ci-robot assigned gaocegege Jul 19, 2020

k8s-ci-robot added the lgtm label Jul 19, 2020

sperlingxx reviewed Jul 20, 2020

View reviewed changes

jlewi mentioned this pull request Jul 23, 2020

feat(wg): Add WG Training kubeflow/community#356

Merged

jlewi removed their request for review August 10, 2020 14:25

andreyvelich commented Aug 14, 2020

View reviewed changes

k8s-ci-robot added the approved label Aug 14, 2020

k8s-ci-robot merged commit 051d1de into kubeflow:master Aug 14, 2020

andreyvelich mentioned this pull request Aug 20, 2020

Custom CRD: Set dynamic watch from controller flags #1302

Merged

andreyvelich mentioned this pull request Sep 2, 2020

Refactoring Supported Job List #1320

Closed

YuxiJin-tobeyjin mentioned this pull request Sep 23, 2020

feature: add support for mpijob in katib #1183

Closed

andreyvelich deleted the custom-crd-trial-proposal branch October 2, 2021 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Support custom CRD in Trial Job #1273

Proposal: Support custom CRD in Trial Job #1273

andreyvelich commented Jul 18, 2020

k8s-ci-robot commented Jul 18, 2020

kubeflow-bot commented Jul 18, 2020

gaocegege left a comment

sperlingxx left a comment •

edited

Loading

sperlingxx Jul 20, 2020

andreyvelich Jul 20, 2020

sperlingxx Jul 20, 2020

andreyvelich Jul 20, 2020

gaocegege Jul 21, 2020

andreyvelich Jul 21, 2020

andreyvelich commented Jul 20, 2020

sperlingxx commented Jul 20, 2020

andreyvelich commented Jul 20, 2020

jlewi commented Jul 20, 2020

andreyvelich commented Jul 20, 2020

sperlingxx commented Jul 21, 2020

jlewi commented Jul 21, 2020

andreyvelich commented Jul 21, 2020

jlewi commented Jul 22, 2020

andreyvelich commented Aug 6, 2020

andreyvelich commented Aug 6, 2020 •

edited

Loading

johnugeorge commented Aug 13, 2020

gaocegege commented Aug 14, 2020

andreyvelich left a comment

k8s-ci-robot commented Aug 14, 2020

Proposal: Support custom CRD in Trial Job #1273

Proposal: Support custom CRD in Trial Job #1273

Conversation

andreyvelich commented Jul 18, 2020

k8s-ci-robot commented Jul 18, 2020

kubeflow-bot commented Jul 18, 2020

gaocegege left a comment

Choose a reason for hiding this comment

sperlingxx left a comment • edited Loading

Choose a reason for hiding this comment

sperlingxx Jul 20, 2020

Choose a reason for hiding this comment

andreyvelich Jul 20, 2020

Choose a reason for hiding this comment

sperlingxx Jul 20, 2020

Choose a reason for hiding this comment

andreyvelich Jul 20, 2020

Choose a reason for hiding this comment

gaocegege Jul 21, 2020

Choose a reason for hiding this comment

andreyvelich Jul 21, 2020

Choose a reason for hiding this comment

andreyvelich commented Jul 20, 2020

sperlingxx commented Jul 20, 2020

andreyvelich commented Jul 20, 2020

jlewi commented Jul 20, 2020

andreyvelich commented Jul 20, 2020

sperlingxx commented Jul 21, 2020

jlewi commented Jul 21, 2020

andreyvelich commented Jul 21, 2020

jlewi commented Jul 22, 2020

andreyvelich commented Aug 6, 2020

andreyvelich commented Aug 6, 2020 • edited Loading

johnugeorge commented Aug 13, 2020

gaocegege commented Aug 14, 2020

andreyvelich left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 14, 2020

sperlingxx left a comment •

edited

Loading

andreyvelich commented Aug 6, 2020 •

edited

Loading