Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Studyctl crd #141

Merged
merged 9 commits into from
Aug 21, 2018
Merged

Studyctl crd #141

merged 9 commits into from
Aug 21, 2018

Conversation

YujiOshima
Copy link
Contributor

@YujiOshima YujiOshima commented Jul 26, 2018

Add StudyController
CRD: studycontroller.kubeflow.org
Operator: StudyController

Update examples.
This implementation is polling workers status in go process of StudyController.
Though I understand this is not an elegant implementation, this is the least impact to existing codes.

Next step we should make worker CRD and its controller and support multi-type jobs (k8s, TF-Job..).
Assign @gaocegege


This change is Reviewable

@YujiOshima
Copy link
Contributor Author

YujiOshima commented Jul 26, 2018

@gaocegege It looks so many files are changed. But almost are vendoring pkg.
The main changes for the controller are under pkg/apis, pkg/controller.
The get-start doc is https://github.com/YujiOshima/hp-tuning/blob/studyctlCRD/examples/MinikubeDemo.md .
Please take a look.

@gaocegege
Copy link
Member

@YujiOshima Awesome work!

Could you please split the commit? I think we could put all hand written code in one commit. Then it is friendly to reviewers.

BTW, I will travel to Japan in 8.10-17. 😄 Then I will be offline in that time.

@YujiOshima
Copy link
Contributor Author

@gaocegege I split it!

travel to Japan

Great! Have a nice trip!

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! Generally LGTM with some nits.

package main

import (
"log"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recommend logrus or zap if you like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to separate the change introducing zap from this PR.
I will use zap for all components.

names:
kind: StudyController
singular: studycontroller
plural: studycontroller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the plural should be studycontrollers

spec:
serviceAccountName: study-controller
imagePullSecrets:
- name: gitlabregcred
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary for the example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it.

@@ -0,0 +1,57 @@
apiVersion: "kubeflow.org/v1alpha1"
kind: StudyController
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we name it Study?
I think studycontroller will confuse users. Our controller is also study controller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to distinguish between resource and process.
The Study and Trial are resources that should be saved to DB persistently.
The Worker and StudyController(now) are a process of Trial and Study each other.
Since the process is ephemeral, I want to make them CRD.

I agree StudyController is confusing. We should rename it.
Do you have any idea for the name?
For example, Experiments, StudyJob.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed StudyController to StudyJob and StudyJob-Controller.

names:
kind: StudyController
singular: studycontroller
plural: studycontroller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maintain one copy for the crd yaml? I think we have two now.

pkg/apis/apis.go Outdated
@@ -0,0 +1,32 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we name the package to api? I think it is a convension in k8s community. WDYT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubebuilder automatically generate apis dir. I will fix it.

// It is represented in RFC3339 form and is in UTC.
LastReconcileTime *metav1.Time `json:"lastReconcileTime,omitempty"`

State State `json:"state,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In k8s community I think we prefer conditions instead of phase. PTAL kubernetes/kubernetes#7856

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will fix it.

type WorkerSpec struct {
Image string `json:"image,omitempty"`
Command []string `json:"command,omitempty"`
Gpu int `json:"gpu,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU is better, IMO

return false
}

func (r *ReconcileStudyController) controllerloop(instance *katibv1alpha1.StudyController) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could add some comments for the func, since it is the main control loop.

@YujiOshima
Copy link
Contributor Author

@gaocegege Thank you for your review. I updated. PTAL.

@YujiOshima
Copy link
Contributor Author

/retest

2 similar comments
@YujiOshima
Copy link
Contributor Author

/retest

@YujiOshima
Copy link
Contributor Author

/retest

@YujiOshima
Copy link
Contributor Author

/retest

@YujiOshima
Copy link
Contributor Author

@gaocegege I updated. Could you review it?

@gaocegege
Copy link
Member

I will do it today or tomorrow.
Thanks for your contribution!

@jlewi
Copy link
Contributor

jlewi commented Aug 1, 2018

Woo Hoo!

A couple high level questions. I'll take a look at the code to see if I can answer these but it would be good to include the answers directly in the PR description as this would help speed up the review process. (There are so many files its crashing reviewable and I'm having a hard time looking at the code)

  1. Are you using a kit like kubebuilder (https://github.com/kubernetes-sigs/kubebuilder) to help build the CRD and handle code gen?

2.Will each new job type require explicit support in the controller?

  1. What is the interface between the CRD controller and the user's code? For example, suppose I want to use Katib to optimize learning rate and suppose that corresponds to a command line argument "--learning_rate", how do I tell Katib that?

  2. What if my job requires certain PVs to be mounted or certain environment variables to be set. How do I tell that?

My expectation is that the user provides the following

  1. A template (e.g. jinja/go/helm)
  2. A list of parameters taken by the template that correspond to hyperparameters.

The CRD controller should create the job by getting a set of hyperparameters and substituting them into the template.

So the user can use any K8s resource they want and use any features they want (e.g. PVs, environment variables, resource specs) just by supplying the appropriate template.

@jlewi
Copy link
Contributor

jlewi commented Aug 1, 2018

/assign jlewi

@jlewi
Copy link
Contributor

jlewi commented Aug 1, 2018

Looking at your example
https://github.com/YujiOshima/hp-tuning/blob/6ad60bcdec83e75183ab61a3f9cc0537c3793105/examples/MinikubeDemo.md

It looks like the WorkerSpec is how user specifies the job to launch. This seems limited

  1. User can only control things exposed in WorkerSpec

  2. Users can not reuse existing templates for their jobs

  3. WorkerSpec must explicitly support new K8s resource types.

  4. CRD makes strong assumptions about how HP parameters are passed to the job

    • e.g. you assume parameters are passed as command line arguments; what if user uses environment variables?

Why not

  1. Take a template as a file path

    • We can start with whichever template you want and is easy to implement (e.g. go templates)
  2. User can use PV/ConfigMap or other means to make the template accessible

  3. CRD creates the spec and resource by substituting in the hyperparameters into the template

If we start with a config map then the CRD could just use the K8s client libraries to fetch the config and we don't need to worry about mounting PVs.

What do you think?

@gaocegege
Copy link
Member

gaocegege commented Aug 2, 2018

@jlewi We can review from 5ac22f9. The first two commits are auto-generated.

@YujiOshima
Copy link
Contributor Author

@jlewi Thanks for your comment.

  1. Are you using a kit like kubebuilder (https://github.com/kubernetes-sigs/kubebuilder) to help build the CRD and handle code gen?
  1. Yes, I used kubebuilder.

The workerSpec limitation you pointed is right. And I agree it is inconvenient.
But I'm not sure about your suggestion.
Your expectation is:
User create ConfigMap for worker template(e.g. pod Spec)
Then Controller read the ConfigMap, embed hyperparameters to the template, and create resources from it.
Is my understanding correct?

@jlewi
Copy link
Contributor

jlewi commented Aug 3, 2018

@YujiOshima

Your CRD is defining a new type of object "WorkerSpec"

Worker Spec:
Command:
python
/mxnet/example/image-classification/train_mnist.py
--batch-size=64
Image: katib/mxnet-mnist-example
Mountconf:

This is limiting and in my opinion limiting I'd like to be able to specify the fully spec for any K8s resource and the CRD controller should just create it.

So the "value" of WorkerSpec should just be a K8s resource. The CRD can inspect metadata to figure out how to create it (i.e. API Endpoint).

However, I need to provide a template not a fully specified K8s resource because we need to be able to substitute in the hyperparameters.

So why I want to supply to the CRD controller is a template for K8s resource which will be filled in with the actual parameters.

Right now you are basically defining your own template engine that is very limited. i.e. it makes assumptions about what type of object and only allows certain places for the parameters to be supplied.

Those restrictions are unnecessary. We already have a variety of really powerful and well understood template engines (jinja, go templates, etc...)

So I think the CRD should let users pick a supported template engine and provide a template that is parameterized with a set of named parameters. The CRD can then substitute in the parameters to create
the actual object.

I don't care how exactly you provide the template. A ConfigMap was one suggestion.

You could also just store the template in the spec as a string field.

@YujiOshima
Copy link
Contributor Author

@jlewi Thanks. OK, I understand.
SGTM about using a powerful template engine since it is flexible.
But I wonder how to collect metrics from user-defined workers.
Currently, Katib uses k8s log API from pods to collect metrics.
How about we allow only k8s job or Kubeflow-job (TF-Job, Cainer-Job..) for Worker.
Users need to upload templates for k8s job or Kubeflow-job.
When the worker is TF-Job of other Kubeflow-Jobs, I don't know the best way to collect metrics but it depends on operators.
What do you think?

@jlewi
Copy link
Contributor

jlewi commented Aug 5, 2018

@YujiOshima Can we use the same idea to generalize metrics collection? i.e as part of the StudyCrd controller I specify a resource (or service) to collect metrics.

For example, for a TensorFlow job we have a couple options

  1. After the training job finishes. We could run an evaluation job that would then report metrics back to Katib using the Katib API.

    • This job might just open up the TFEvents file, get metrics, and report them back to Katib
  2. Distributed TF jobs have eval workers; so in the eval workers the user could run code that reports metrics to Katib.

I think a good starting point would be add a metricsCollector resource to the CRD. This could be a job that will get launched after the training job. This should be a job that reports metrics to Katib using the Katib API.

@YujiOshima
Copy link
Contributor Author

@jlewi I try to implement the first step as below.

  • Add Report metrics API. (for generalization. All job operator can report metrics in the same way)
  • Add MetricsCollector as a CronJob. It will be created by Studyjobcontroller.
  • As a default worker, users can push k8s job config as a template of a worker. the MetricsCollector will collect log from stdout of the job.

Why CronJob: Katib should collect metrics during training for Early stopping not only the evaluation step.

After that, we can add more flexible metrics collecter for TF-Job or other specific Job.
WDYT?

@jlewi
Copy link
Contributor

jlewi commented Aug 6, 2018

This sounds good to me but let me make sure I understand correctly.

Add Report metrics API. (for generalization. All job operator can report metrics in the same way)

A Report metrics API sounds great. Is that already in this PR? Can you provide a link to the API definition.

Side note; I don't know that metrics should be reported by the operator. That would require us to build metrics reporting into the operator and that might be too limiting.

Add MetricsCollector as a CronJob. It will be created by Studyjobcontroller.

Why is it a CronJob? What is MetricsCollector doing? Is this just collecting metrics from stdout and reporting them via the API?

Is the idea that StudJobController creates a MetricsCollector cron job for each HP tuning job. This job periodically collects the metrics and reports them via the ReportMetrics APi?

@YujiOshima
Copy link
Contributor Author

@jlewi

Is that already in this PR?

Not yet. I will update soon.

Is the idea that StudJobController creates a MetricsCollector cron job for each HP tuning job. This job periodically collects the metrics and reports them via the ReportMetrics APi?

Yes. CRD controller runs event-driven. I want to collect and report the metrics even no events happened.

}
}
}
return &pb.GetMetricsReply{MetricsLogSets: mls}, nil
}

func (s *server) ReportMetrics(ctx context.Context, in *pb.ReportMetricsRequest) (*pb.ReportMetricsReply, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this RPC new in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

if err != nil {
return &pb.ReportMetricsReply{}, err
}
for _, mls := range in.MetricsLogSets {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this RPC taking as input the logs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at the API looks like this is not the logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Katib can store every log of metrics.
I would change the name of RPC to ReportMetricsLogs.
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Lets not change it in this PR. If you think its worth changing lets file an issue and do it in a follow on PR.

string trial_id = 2;
string runtime = 3;
WorkerConfig worker_config = 4;
message CreateWorkerReauest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the CreateWorkerRequest for? Isn't the StudyController creating the workers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #150
Would be good to document the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can store the worker info to DB and generate workerId by CreateWorker rpc.
I would change rename the RPC to RegisterWorker.

Would be good to document the API.

OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I don't think this needs to change in this PR. I'd suggest filing an issue if you think its worth fixing.


// StudyJobSpec defines the desired state of StudyJob
type StudyJobSpec struct {
StudySpec *StudySpec `json:"studySpec,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name StudySpec seems confusing; I would think the whole thing is the StudySpec.

What's the distinction between StudySpec and SuggestionSpec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SuggestionSpec is parameters that related to only suggestion service.
How about merge StudySpec to StudyJobSpec like below.

type StudyJobSpec struct {
	Name               string            `json:"name,omitempty"`
	Owner              string            `json:"owner,omitempty"`
	OptimizationType   OptimizationType  `json:"optimizationtype,omitempty"`
	OptimizationGoal   *float64          `json:"optimizationgoal,omitempty"`
   ....
	WorkerSpec           *WorkerSpec           `json:"workerSpec,omitempty"`
	SuggestionSpec       *SuggestionSpec       `json:"suggestionSpec,omitempty"`
	EarlyStoppingSpec    *EarlyStoppingSpec    `json:"earlyStoppingSpec,omitempty"`
	MetricsCollectorSpec *MetricsCollectorSpec `json:"metricsCollectorSpec,omitempty"`
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@jlewi
Copy link
Contributor

jlewi commented Aug 20, 2018

I reviewed the commits
https://github.com/kubeflow/katib/pull/141/files/5429a4dd121ccd08d3989edd90476e2cbf954c41..842ee429692cbdced85f7a6f47e295cd60f113ec

And left some comments.

The 7727 files still make it really difficult to review everything; (e.g. because I can't use reviewable to track review threads).

Should we just go ahead and submit this and fix issues in follow on PRs?

@YujiOshima
Copy link
Contributor Author

YujiOshima commented Aug 20, 2018

@jlewi

Why do you need WorkerType?

Hmm, I thought It needs to specify the object type before loading the template. But it may not be necessary. I'm going to try.

@YujiOshima
Copy link
Contributor Author

@jlewi

Should we just go ahead and submit this and fix issues in follow on PRs?

Do you mean we should delete vendored code from this repo as @vinaykakade said?
#142 (comment)

Let's open new issue about this.

@jlewi
Copy link
Contributor

jlewi commented Aug 21, 2018

@YujiOshima I'm pretty happy with this PR as is. Is there any work that you think should be accomplished in this PR as opposed to a follow on PR? e.g. should we try wait for a follow on PR to remove the type?

This PR is already pretty massive so I think its better if we defer additional changes to a follow on PR. Hopefully, once we have the initial PR is committed; maybe we can split it up the remaining work more easily?

/lgtm
/approve

/hold

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@YujiOshima
Copy link
Contributor Author

I will work on a new PR.
/hold cancel

@jlewi
Copy link
Contributor

jlewi commented Aug 21, 2018

/cancel hold

@vinaykakade
Copy link
Contributor

@YujiOshima - given that the public API will change with this PR, should we create a 0.2 branch from current master which will have stable 0.2 code, or check this in a 0.3 branch, or any other proposal you may have in mind to minimize the migration pain for current customers? Overall, would be great if we are thinking about customers who are dependent on current public API while working through backwards incompatible changes.

@YujiOshima
Copy link
Contributor Author

@vinaykakade Thank you for your advice.
@gaocegege @jlewi I don't have a authorization for creating a new branch. Could you create a new branch for keeping backward compatibility?

@gaocegege
Copy link
Member

https://github.com/kubeflow/katib/tree/0.2

I created one, thanks for the advice

@YujiOshima
Copy link
Contributor Author

/retest

@jlewi
Copy link
Contributor

jlewi commented Aug 21, 2018

I'll merge this manually. It looks like the reviewable status check is choking on the large PR size and that's blocking the merge.

@jlewi jlewi merged commit 089cd6a into kubeflow:master Aug 21, 2018
jlewi added a commit to jlewi/katib that referenced this pull request Sep 21, 2018
…ges.

* Related to kubeflow#141 katib releaser
* Related to kubeflow/kubeflow#1574 use prow to build our images

* We are moving to using prow to run our release workflows and treating them
 just like regular workflows.

* We are doing this because we need to get regular signal about whether
  the image builds are succeeding by running on postsubmit.

* We also want to run them on presubmit so that we can verify any changes
  to the workflwo don't break the workflow.

* For this reason we also want to move the workflows into the repository
  that contains the source code for the images being built rather than
  having them all in kubeflow/kubeflow.
jlewi added a commit to jlewi/katib that referenced this pull request Sep 21, 2018
* Related to kubeflow#141 katib releaser
* Related to kubeflow/kubeflow#1574 use prow to build our images

* We are moving to using prow to run our release workflows and treating them
 just like regular workflows.

* We are doing this because we need to get regular signal about whether
  the image builds are succeeding by running on postsubmit.

* We also want to run them on presubmit so that we can verify any changes
  to the workflwo don't break the workflow.

* Rather than define a new workflow to build the images; we can just reuse the
  existing E2E workflow which already builds all the images. We just
  change postsubmit to push to kubeflow-images-public.

* Delete the releaser app; we will just the existing E2E test workflow
  and have that push to gcr.io/kubeflow-images-public on postsubmit.
k8s-ci-robot pushed a commit that referenced this pull request Sep 21, 2018
* Related to #141 katib releaser
* Related to kubeflow/kubeflow#1574 use prow to build our images

* We are moving to using prow to run our release workflows and treating them
 just like regular workflows.

* We are doing this because we need to get regular signal about whether
  the image builds are succeeding by running on postsubmit.

* We also want to run them on presubmit so that we can verify any changes
  to the workflwo don't break the workflow.

* Rather than define a new workflow to build the images; we can just reuse the
  existing E2E workflow which already builds all the images. We just
  change postsubmit to push to kubeflow-images-public.

* Delete the releaser app; we will just the existing E2E test workflow
  and have that push to gcr.io/kubeflow-images-public on postsubmit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants