-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Support custom CRD in Trial Job #1273
Proposal: Support custom CRD in Trial Job #1273
Conversation
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: nielsmeima, terrykong. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new design is supposed to replace current jobPrivider intereface. With the help of new design, we can support arbitrary CRDs without adding go codes (implementing/registering a new jobPrivider).
It looks fantastic to me except how we support Provider.MutateJob
under new design.
|
||
In the current design trial controller watches | ||
[three supported resource](https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L94-L125). | ||
To generate these parameters dynamically when Katib starts, we add additional flag (`-trial-resource`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use configMap to define these trial-resources ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, in katib-config
ConfigMap we can set only Suggestion and Metrics collector settings.
Also, we added Watch only when controller starts: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/trial/trial_controller.go#L79.
I think later when we implement dynamically wathers update, we can think about better design how we can send these resources to the Controller.
TrialParameters []TrialParameterSpec `json:"trialParameters,omitempty"` | ||
|
||
// Label that determines if pod needs to be injected by Katib sidecar container | ||
PrimaryPodLabel map[string]string `json:"primaryPodLabel,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about how will these extra fields be filled ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this design it will look like this:
PrimaryPodLabel:
"label-key": "label-value"
Not sure if it is the best design.
We can follow the same API as metricStrategies
:
. . .
PrimaryPodLabel *PrimaryPodLabel
. . .
type PrimaryPodLabel struct {
Name string `json:"name,omitempty"`
Value string `json:"value,omitempty"`
}
Does it make sense, if we have can set 1 label currently?
WDYT @gaocegege @johnugeorge ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should provide a map or a slice here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Map looks good to me, since it follow k8s way: https://github.com/kubernetes/apimachinery/blob/master/pkg/apis/meta/v1/types.go#L226-L231.
Thanks, yes we should refactor @sperlingxx Do we have a use-case when JobLevel injection might be useful? Also, |
@andreyvelich In fact, we use |
In that case, I think we can follow 2 ways:
Do you have any other ideas @sperlingxx ? |
@andreyvelich Could you present this at an upcoming community meeting and we can do a design review? |
Sure, thanks @jlewi. |
I prefer the second way, which seems to be more extensible. I suppose the dynamic provider will be the |
Thanks @andreyvelich A couple questions Have you asked users of Katib for feedback? How does this proposal compare to the approach Tekton is taking with custom tasks?
Why is Katib cotrollers getting involved here? Could a user control this by directly setting labels on their resource? e.g. for TFJob they could add the labels to the PodTemplateSpec in TFJob? Would the proposal be different if we (Kubeflow/Kubernetes whatever) had a first class concept of inputs and outputs?
|
Thanks for the review @jlewi.
We have couple of requests to support Argo, Kubeflow operators. @czheng94 has many various CRD that his team want to use in Katib. Idea of this proposal is definitely on high demand.
From my understanding, custom tasks controller author is responsible to proceed ref:
apiVersion: example.dev/v0
kind: Example
name: my-example Otherwise, here trial controller will be responsible to submit custom jobs and user doesn't need to modify source code and add watchers for the Katib Trial CR.
You are right. This is exactly what I want to say in this proposal. User has to specify this annotation in TrialSpec.
Yes, I think that can simplify things, because metrics collector doesn't need to parse StdOut/File for the training container and metrics can be directly pushed to the DB, but we have various metrics collector (e.g. TFEvent) to get various metrics. This KRM should handle all of these functionality for the custom user's jobs. Also, we should think how we can implement Early Stopping (#692) in that approach without Katib sidecar container and without Katib SDK for early stopping, like in other opt framework. |
Thanks. I don't consider myself an approver; just a passer by so this PR can be merged whenever the approvers LGTM it.
I believe the way custom tasks work is as analogous to how Tasks and TaskRuns work today in Tekton
|
@jlewi If pipeline contains only reference on a template and input params, where you define template specification? I haven't seen examples in Tekton, when we define template spec or maybe I misunderstand something ? |
@gaocegege @johnugeorge @sperlingxx Do you have any other comment on this proposal or we can merge it and start the implementation ? |
it looks good to me. |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
See comment: #1214 (comment).
I added proposal for supporting any kind of CRD in Trial Spec.
Please take a look @gaocegege @johnugeorge @czheng94
/cc @sperlingxx @jlewi @nielsmeima @terrykong