Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Support custom CRD in Trial Job #1273

Merged
merged 9 commits into from
Aug 14, 2020

Conversation

andreyvelich
Copy link
Member

See comment: #1214 (comment).
I added proposal for supporting any kind of CRD in Trial Spec.

Please take a look @gaocegege @johnugeorge @czheng94

/cc @sperlingxx @jlewi @nielsmeima @terrykong

@k8s-ci-robot
Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: nielsmeima, terrykong.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

See comment: #1214 (comment).
I added proposal for supporting any kind of CRD in Trial Spec.

Please take a look @gaocegege @johnugeorge @czheng94

/cc @sperlingxx @jlewi @nielsmeima @terrykong

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubeflow-bot
Copy link

This change is Reviewable

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Member

@sperlingxx sperlingxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new design is supposed to replace current jobPrivider intereface. With the help of new design, we can support arbitrary CRDs without adding go codes (implementing/registering a new jobPrivider).
It looks fantastic to me except how we support Provider.MutateJob under new design.


In the current design trial controller watches
[three supported resource](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L94-L125).
To generate these parameters dynamically when Katib starts, we add additional flag (`-trial-resource`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use configMap to define these trial-resources ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, in katib-config ConfigMap we can set only Suggestion and Metrics collector settings.
Also, we added Watch only when controller starts: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L79.
I think later when we implement dynamically wathers update, we can think about better design how we can send these resources to the Controller.

TrialParameters []TrialParameterSpec `json:"trialParameters,omitempty"`

// Label that determines if pod needs to be injected by Katib sidecar container
PrimaryPodLabel map[string]string `json:"primaryPodLabel,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about how will these extra fields be filled ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this design it will look like this:

PrimaryPodLabel:
  "label-key": "label-value"

Not sure if it is the best design.
We can follow the same API as metricStrategies:

. . .
PrimaryPodLabel *PrimaryPodLabel
. . .

type PrimaryPodLabel struct {
	Name  string `json:"name,omitempty"`
	Value string `json:"value,omitempty"`
}

Does it make sense, if we have can set 1 label currently?

WDYT @gaocegege @johnugeorge ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should provide a map or a slice here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich
Copy link
Member Author

I think the new design is supposed to replace current jobPrivider intereface. With the help of new design, we can support arbitrary CRDs without adding go codes (implementing/registering a new jobPrivider).
It looks fantastic to me except how we support Provider.MutateJob under new design.

Thanks, yes we should refactor jobPrivider and later we can left only 1 unify provider for every CRDs.

@sperlingxx Do we have a use-case when JobLevel injection might be useful?
I believe Job-level injection was introduced here: https://github.com/kubeflow/katib/blob/master/docs/proposals/metrics-collector.md#mutating-webhook.
I can't see that we use ObjectSelector for injection webhook: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/webhook.go#L103-L114.

Also, MutateJob for Kubeflow and Job providers are currently empty: https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L80-L82 and
https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/job.go#L72-L74.

@sperlingxx
Copy link
Member

I think the new design is supposed to replace current jobPrivider intereface. With the help of new design, we can support arbitrary CRDs without adding go codes (implementing/registering a new jobPrivider).
It looks fantastic to me except how we support Provider.MutateJob under new design.

Thanks, yes we should refactor jobPrivider and later we can left only 1 unify provider for every CRDs.

@sperlingxx Do we have a use-case when JobLevel injection might be useful?
I believe Job-level injection was introduced here: https://github.com/kubeflow/katib/blob/master/docs/proposals/metrics-collector.md#mutating-webhook.
I can't see that we use ObjectSelector for injection webhook: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/webhook.go#L103-L114.

Also, MutateJob for Kubeflow and Job providers are currently empty: https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/kubeflow.go#L80-L82 and
https://github.com/kubeflow/katib/blob/master/pkg/job/v1beta1/job.go#L72-L74.

@andreyvelich In fact, we use MutateJob to do some adaption work in our origization. And it seems Job-Level injection can do the same job.

@andreyvelich
Copy link
Member Author

@andreyvelich In fact, we use MutateJob to do some adaption work in our origization. And it seems Job-Level injection can do the same job.

In that case, I think we can follow 2 ways:

  1. Add Mutate and Create functions to our dynamic provider and left it empty. If end-user needs to do some Job level mutation, he/she can modify that functions manually.

  2. Create 2 providers: default and custom. In default provider we don't need to add JobLevel mutation, but in custom user can modify Mutate function in controller. Add new flag to Katib controller, which indicate which provider you want to use.

Do you have any other ideas @sperlingxx ?

@jlewi
Copy link
Contributor

jlewi commented Jul 20, 2020

@andreyvelich Could you present this at an upcoming community meeting and we can do a design review?

@andreyvelich
Copy link
Member Author

@andreyvelich Could you present this at an upcoming community meeting and we can do a design review?

Sure, thanks @jlewi.

@sperlingxx
Copy link
Member

@andreyvelich In fact, we use MutateJob to do some adaption work in our origization. And it seems Job-Level injection can do the same job.

In that case, I think we can follow 2 ways:

  1. Add Mutate and Create functions to our dynamic provider and left it empty. If end-user needs to do some Job level mutation, he/she can modify that functions manually.
  2. Create 2 providers: default and custom. In default provider we don't need to add JobLevel mutation, but in custom user can modify Mutate function in controller. Add new flag to Katib controller, which indicate which provider you want to use.

Do you have any other ideas @sperlingxx ?

I prefer the second way, which seems to be more extensible. I suppose the dynamic provider will be the default one, and we also support adding custom providers. I am not sure whether I have the right understanding.

@jlewi
Copy link
Contributor

jlewi commented Jul 21, 2020

Thanks @andreyvelich A couple questions

Have you asked users of Katib for feedback?

How does this proposal compare to the approach Tekton is taking with custom tasks?

Verify that sidecar.istio.io/inject: false label is added.

Why is Katib cotrollers getting involved here? Could a user control this by directly setting labels on their resource? e.g. for TFJob they could add the labels to the PodTemplateSpec in TFJob?

Would the proposal be different if we (Kubeflow/Kubernetes whatever) had a first class concept of inputs and outputs?

  • The Kubernetes resource model is defined here
  • The fact that all Kubernetes objects have those fields enables a lot of functionality
    • e.g. the reason you can do bulk operations (kubectl apply -f ${DIR}) is because every YAML has apiVersion and kind so kubectl knows what type of object every object is
  • What if we defined an extension of the KRM; call it KFRM which included fields like inputs and outputs so that outputs could be used to report things like metrics. Would that simplify things?

@andreyvelich
Copy link
Member Author

Thanks for the review @jlewi.

Thanks @andreyvelich A couple questions

Have you asked users of Katib for feedback?

We have couple of requests to support Argo, Kubeflow operators. @czheng94 has many various CRD that his team want to use in Katib. Idea of this proposal is definitely on high demand.

How does this proposal compare to the approach Tekton is taking with custom tasks?

From my understanding, custom tasks controller author is responsible to proceed Run object and then execute their custom resource. User's controller should watch for the Run Object.
I didn't find in Tekton proposal that user can specify whole custom specification in Reference, only type of object:

ref:
    apiVersion: example.dev/v0
    kind: Example
    name: my-example

Otherwise, here trial controller will be responsible to submit custom jobs and user doesn't need to modify source code and add watchers for the Katib Trial CR.
As well, we have to properly inject Katib sidecar container which is Katib controller responsibility.

Verify that sidecar.istio.io/inject: false label is added.

Why is Katib cotrollers getting involved here? Could a user control this by directly setting labels on their resource? e.g. for TFJob they could add the labels to the PodTemplateSpec in TFJob?

You are right. This is exactly what I want to say in this proposal. User has to specify this annotation in TrialSpec.

Would the proposal be different if we (Kubeflow/Kubernetes whatever) had a first class concept of inputs and outputs?

  • The Kubernetes resource model is defined here

  • The fact that all Kubernetes objects have those fields enables a lot of functionality

    • e.g. the reason you can do bulk operations (kubectl apply -f ${DIR}) is because every YAML has apiVersion and kind so kubectl knows what type of object every object is
  • What if we defined an extension of the KRM; call it KFRM which included fields like inputs and outputs so that outputs could be used to report things like metrics. Would that simplify things?

Yes, I think that can simplify things, because metrics collector doesn't need to parse StdOut/File for the training container and metrics can be directly pushed to the DB, but we have various metrics collector (e.g. TFEvent) to get various metrics.

This KRM should handle all of these functionality for the custom user's jobs.

Also, we should think how we can implement Early Stopping (#692) in that approach without Katib sidecar container and without Katib SDK for early stopping, like in other opt framework.

@jlewi
Copy link
Contributor

jlewi commented Jul 22, 2020

Thanks. I don't consider myself an approver; just a passer by so this PR can be merged whenever the approvers LGTM it.

From my understanding, custom tasks controller author is responsible to proceed Run object and then execute their custom resource. User's controller should watch for the Run Object.

I believe the way custom tasks work is as analogous to how Tasks and TaskRuns work today in Tekton

  1. Your pipeline contains references to a resource that acts as a template; e.g. we might have TFJobTemplate

    • The template would define inputs and outputs to be passed at runtime
  2. At runtime Tekton creates an instance of the run object e.g. TFJobRun

    • A run is a combination of a Template (e.g. TFJobTemplate) and specific values to be passed to be passed as inputs
    • The run controller then executes the job using those paramters; e.g. the run controller could substitute the parameters into the TFJobTemplate and to create a TFJob

@andreyvelich
Copy link
Member Author

  1. Your pipeline contains references to a resource that acts as a template; e.g. we might have TFJobTemplate

@jlewi If pipeline contains only reference on a template and input params, where you define template specification?
How Tekton can submit Job in runtime, if we pass only Reference to the object?

I haven't seen examples in Tekton, when we define template spec or maybe I misunderstand something ?

@andreyvelich
Copy link
Member Author

andreyvelich commented Aug 6, 2020

@gaocegege @johnugeorge @sperlingxx Do you have any other comment on this proposal or we can merge it and start the implementation ?

@jlewi jlewi removed their request for review August 10, 2020 14:25
@johnugeorge
Copy link
Member

it looks good to me.

@gaocegege

@gaocegege
Copy link
Member

/lgtm

Copy link
Member Author

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants