-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Studyctl crd #141
Studyctl crd #141
Conversation
@gaocegege It looks so many files are changed. But almost are vendoring pkg. |
@YujiOshima Awesome work! Could you please split the commit? I think we could put all hand written code in one commit. Then it is friendly to reviewers. BTW, I will travel to Japan in 8.10-17. 😄 Then I will be offline in that time. |
@gaocegege I split it!
Great! Have a nice trip! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! Generally LGTM with some nits.
cmd/studycontroller/main.go
Outdated
package main | ||
|
||
import ( | ||
"log" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recommend logrus or zap if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to separate the change introducing zap
from this PR.
I will use zap for all components.
names: | ||
kind: StudyController | ||
singular: studycontroller | ||
plural: studycontroller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the plural should be studycontrollers
spec: | ||
serviceAccountName: study-controller | ||
imagePullSecrets: | ||
- name: gitlabregcred |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary for the example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix it.
examples/hypb-example.yaml
Outdated
@@ -0,0 +1,57 @@ | |||
apiVersion: "kubeflow.org/v1alpha1" | |||
kind: StudyController |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we name it Study?
I think studycontroller will confuse users. Our controller is also study controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to distinguish between resource and process.
The Study and Trial are resources that should be saved to DB persistently.
The Worker and StudyController(now) are a process of Trial and Study each other.
Since the process is ephemeral, I want to make them CRD.
I agree StudyController is confusing. We should rename it.
Do you have any idea for the name?
For example, Experiments, StudyJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed StudyController to StudyJob and StudyJob-Controller.
manifests/studycontroller/crd.yml
Outdated
names: | ||
kind: StudyController | ||
singular: studycontroller | ||
plural: studycontroller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maintain one copy for the crd yaml? I think we have two now.
pkg/apis/apis.go
Outdated
@@ -0,0 +1,32 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we name the package to api? I think it is a convension in k8s community. WDYT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kubebuilder automatically generate apis dir. I will fix it.
// It is represented in RFC3339 form and is in UTC. | ||
LastReconcileTime *metav1.Time `json:"lastReconcileTime,omitempty"` | ||
|
||
State State `json:"state,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In k8s community I think we prefer conditions instead of phase. PTAL kubernetes/kubernetes#7856
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will fix it.
type WorkerSpec struct { | ||
Image string `json:"image,omitempty"` | ||
Command []string `json:"command,omitempty"` | ||
Gpu int `json:"gpu,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU is better, IMO
return false | ||
} | ||
|
||
func (r *ReconcileStudyController) controllerloop(instance *katibv1alpha1.StudyController) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could add some comments for the func, since it is the main control loop.
@gaocegege Thank you for your review. I updated. PTAL. |
/retest |
2 similar comments
/retest |
/retest |
/retest |
@gaocegege I updated. Could you review it? |
I will do it today or tomorrow. |
Woo Hoo! A couple high level questions. I'll take a look at the code to see if I can answer these but it would be good to include the answers directly in the PR description as this would help speed up the review process. (There are so many files its crashing reviewable and I'm having a hard time looking at the code)
2.Will each new job type require explicit support in the controller?
My expectation is that the user provides the following
The CRD controller should create the job by getting a set of hyperparameters and substituting them into the template. So the user can use any K8s resource they want and use any features they want (e.g. PVs, environment variables, resource specs) just by supplying the appropriate template. |
/assign jlewi |
Looking at your example It looks like the WorkerSpec is how user specifies the job to launch. This seems limited
Why not
If we start with a config map then the CRD could just use the K8s client libraries to fetch the config and we don't need to worry about mounting PVs. What do you think? |
@jlewi Thanks for your comment.
The workerSpec limitation you pointed is right. And I agree it is inconvenient. |
Your CRD is defining a new type of object "WorkerSpec" Worker Spec: This is limiting and in my opinion limiting I'd like to be able to specify the fully spec for any K8s resource and the CRD controller should just create it. So the "value" of WorkerSpec should just be a K8s resource. The CRD can inspect metadata to figure out how to create it (i.e. API Endpoint). However, I need to provide a template not a fully specified K8s resource because we need to be able to substitute in the hyperparameters. So why I want to supply to the CRD controller is a template for K8s resource which will be filled in with the actual parameters. Right now you are basically defining your own template engine that is very limited. i.e. it makes assumptions about what type of object and only allows certain places for the parameters to be supplied. Those restrictions are unnecessary. We already have a variety of really powerful and well understood template engines (jinja, go templates, etc...) So I think the CRD should let users pick a supported template engine and provide a template that is parameterized with a set of named parameters. The CRD can then substitute in the parameters to create I don't care how exactly you provide the template. A ConfigMap was one suggestion. You could also just store the template in the spec as a string field. |
@jlewi Thanks. OK, I understand. |
@YujiOshima Can we use the same idea to generalize metrics collection? i.e as part of the StudyCrd controller I specify a resource (or service) to collect metrics. For example, for a TensorFlow job we have a couple options
I think a good starting point would be add a metricsCollector resource to the CRD. This could be a job that will get launched after the training job. This should be a job that reports metrics to Katib using the Katib API. |
@jlewi I try to implement the first step as below.
Why CronJob: Katib should collect metrics during training for Early stopping not only the evaluation step. After that, we can add more flexible metrics collecter for TF-Job or other specific Job. |
This sounds good to me but let me make sure I understand correctly.
A Report metrics API sounds great. Is that already in this PR? Can you provide a link to the API definition. Side note; I don't know that metrics should be reported by the operator. That would require us to build metrics reporting into the operator and that might be too limiting.
Why is it a CronJob? What is MetricsCollector doing? Is this just collecting metrics from stdout and reporting them via the API? Is the idea that StudJobController creates a MetricsCollector cron job for each HP tuning job. This job periodically collects the metrics and reports them via the ReportMetrics APi? |
Not yet. I will update soon.
Yes. CRD controller runs event-driven. I want to collect and report the metrics even no events happened. |
} | ||
} | ||
} | ||
return &pb.GetMetricsReply{MetricsLogSets: mls}, nil | ||
} | ||
|
||
func (s *server) ReportMetrics(ctx context.Context, in *pb.ReportMetricsRequest) (*pb.ReportMetricsReply, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this RPC new in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
if err != nil { | ||
return &pb.ReportMetricsReply{}, err | ||
} | ||
for _, mls := range in.MetricsLogSets { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this RPC taking as input the logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked at the API looks like this is not the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Katib can store every log of metrics.
I would change the name of RPC to ReportMetricsLogs
.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Lets not change it in this PR. If you think its worth changing lets file an issue and do it in a follow on PR.
string trial_id = 2; | ||
string runtime = 3; | ||
WorkerConfig worker_config = 4; | ||
message CreateWorkerReauest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the CreateWorkerRequest for? Isn't the StudyController creating the workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #150
Would be good to document the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can store the worker info to DB and generate workerId by CreateWorker rpc.
I would change rename the RPC to RegisterWorker
.
Would be good to document the API.
OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I don't think this needs to change in this PR. I'd suggest filing an issue if you think its worth fixing.
|
||
// StudyJobSpec defines the desired state of StudyJob | ||
type StudyJobSpec struct { | ||
StudySpec *StudySpec `json:"studySpec,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name StudySpec seems confusing; I would think the whole thing is the StudySpec.
What's the distinction between StudySpec and SuggestionSpec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SuggestionSpec is parameters that related to only suggestion service.
How about merge StudySpec to StudyJobSpec like below.
type StudyJobSpec struct {
Name string `json:"name,omitempty"`
Owner string `json:"owner,omitempty"`
OptimizationType OptimizationType `json:"optimizationtype,omitempty"`
OptimizationGoal *float64 `json:"optimizationgoal,omitempty"`
....
WorkerSpec *WorkerSpec `json:"workerSpec,omitempty"`
SuggestionSpec *SuggestionSpec `json:"suggestionSpec,omitempty"`
EarlyStoppingSpec *EarlyStoppingSpec `json:"earlyStoppingSpec,omitempty"`
MetricsCollectorSpec *MetricsCollectorSpec `json:"metricsCollectorSpec,omitempty"`
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
I reviewed the commits And left some comments. The 7727 files still make it really difficult to review everything; (e.g. because I can't use reviewable to track review threads). Should we just go ahead and submit this and fix issues in follow on PRs? |
Hmm, I thought It needs to specify the object type before loading the template. But it may not be necessary. I'm going to try. |
Do you mean we should delete vendored code from this repo as @vinaykakade said? Let's open new issue about this. |
@YujiOshima I'm pretty happy with this PR as is. Is there any work that you think should be accomplished in this PR as opposed to a follow on PR? e.g. should we try wait for a follow on PR to remove the type? This PR is already pretty massive so I think its better if we defer additional changes to a follow on PR. Hopefully, once we have the initial PR is committed; maybe we can split it up the remaining work more easily? /lgtm /hold |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I will work on a new PR. |
/cancel hold |
@YujiOshima - given that the public API will change with this PR, should we create a 0.2 branch from current master which will have stable 0.2 code, or check this in a 0.3 branch, or any other proposal you may have in mind to minimize the migration pain for current customers? Overall, would be great if we are thinking about customers who are dependent on current public API while working through backwards incompatible changes. |
@vinaykakade Thank you for your advice. |
https://github.com/kubeflow/katib/tree/0.2 I created one, thanks for the advice |
/retest |
I'll merge this manually. It looks like the reviewable status check is choking on the large PR size and that's blocking the merge. |
…ges. * Related to kubeflow#141 katib releaser * Related to kubeflow/kubeflow#1574 use prow to build our images * We are moving to using prow to run our release workflows and treating them just like regular workflows. * We are doing this because we need to get regular signal about whether the image builds are succeeding by running on postsubmit. * We also want to run them on presubmit so that we can verify any changes to the workflwo don't break the workflow. * For this reason we also want to move the workflows into the repository that contains the source code for the images being built rather than having them all in kubeflow/kubeflow.
* Related to kubeflow#141 katib releaser * Related to kubeflow/kubeflow#1574 use prow to build our images * We are moving to using prow to run our release workflows and treating them just like regular workflows. * We are doing this because we need to get regular signal about whether the image builds are succeeding by running on postsubmit. * We also want to run them on presubmit so that we can verify any changes to the workflwo don't break the workflow. * Rather than define a new workflow to build the images; we can just reuse the existing E2E workflow which already builds all the images. We just change postsubmit to push to kubeflow-images-public. * Delete the releaser app; we will just the existing E2E test workflow and have that push to gcr.io/kubeflow-images-public on postsubmit.
* Related to #141 katib releaser * Related to kubeflow/kubeflow#1574 use prow to build our images * We are moving to using prow to run our release workflows and treating them just like regular workflows. * We are doing this because we need to get regular signal about whether the image builds are succeeding by running on postsubmit. * We also want to run them on presubmit so that we can verify any changes to the workflwo don't break the workflow. * Rather than define a new workflow to build the images; we can just reuse the existing E2E workflow which already builds all the images. We just change postsubmit to push to kubeflow-images-public. * Delete the releaser app; we will just the existing E2E test workflow and have that push to gcr.io/kubeflow-images-public on postsubmit.
Add StudyController
CRD: studycontroller.kubeflow.org
Operator: StudyController
Update examples.
This implementation is polling workers status in go process of StudyController.
Though I understand this is not an elegant implementation, this is the least impact to existing codes.
Next step we should make worker CRD and its controller and support multi-type jobs (k8s, TF-Job..).
Assign @gaocegege
This change isdata:image/s3,"s3://crabby-images/d0bb7/d0bb7f7625ca5bf5c3cf7a2b7a514cf841ab8395" alt="Reviewable"