Adding pytorch v1alpha2 controller #786

johnugeorge · 2018-08-20T08:10:03Z

This contains the pytorch v1 alpha2 code that is consistent with TF. I verified with sample pytorch example. It is adapted from TF controller and more code can be shared in the future. Currently, CRDs are kept completely independent.

#785

This change is

k8s-ci-robot · 2018-08-20T08:10:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: djangopeng

If they are not already assigned, you can assign the PR to them by writing /assign @djangopeng in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

TravisBuddy · 2018-08-20T08:37:02Z

Travis tests have failed

Hey @johnugeorge,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

3rd Build

gometalinter --config=linter_config.json --vendor ./...

pkg/controller.v2/pytorch/job.go:128:9:warning: pc.jobClientSet.Pytorch is deprecated: please explicitly pick a version if possible.  (SA1019) (staticcheck)
pkg/controller.v2/pytorch/status.go:123:12:warning: pc.jobClientSet.Pytorch is deprecated: please explicitly pick a version if possible.  (SA1019) (staticcheck)
pkg/apis/pytorch/validation/validation.go:46:6:warning: should omit comparison to bool constant, can be simplified to !defaultContainerPresent (S1002) (gosimple)
pkg/controller.v2/pytorch/controller.go:429:9:warning: pc.jobClientSet.Pytorch is deprecated: please explicitly pick a version if possible.  (SA1019) (staticcheck)

travis_time:end:0892ac16:start=1534752742342524329,finish=1534752880374170876,duration=138031646547

coveralls · 2018-08-20T08:37:12Z

Coverage decreased (-15.1%) to 42.832% when pulling f050cc8 on johnugeorge:pytorchimpl into 78cfba1 on kubeflow:master.

johnugeorge · 2018-08-20T08:44:50Z

unit tests, examples have to added. Tracked in #785

johnugeorge · 2018-08-20T16:53:17Z

\cc @jlewi @gaocegege

johnugeorge · 2018-08-20T17:55:28Z

/assign @gaocegege @jlewi

johnugeorge · 2018-08-20T18:01:47Z

/cc @jose5918

jlewi · 2018-08-21T02:31:18Z

So is the plan to retire the pytorch repository after this PR?

jlewi · 2018-08-21T02:31:28Z

/assign @richardsliu

johnugeorge · 2018-08-21T03:07:22Z

@jlewi This was based on your suggestion.

As per the slack group discussions
We had three options

Use pytorch-operator repo
Use tf-operator repo
Use a common repo for all operators.

For option 1, we will have to duplicate the code(the current shared implementation) for each operator. However,adding tests and examples are easy for option 1. Presubmit workflows are also already present.

For option 2, there is no duplication of code. However,here will be some effort needed in adding tests, examples and presubmit. This is the next task. Also, we have to rename this in the near future IMO to avoid confusion for public. Pytorch repo can be archived after this.

For option 3, we have to ensure that individual repos do not diverge while stabilizing the operator. All Individual operator repo can be archived after this.

jlewi

Reviewed 8 of 56 files at r1, 3 of 4 files at r2.
Reviewable status: 11 of 56 files reviewed, 1 unresolved discussion (waiting on @johnugeorge, @willb, @ddysher, and @jose5918)

pkg/controller.v2/pytorch/controller.go, line 321 at r2 (raw file):

}

func (pc *PyTorchController) GetTotalReplicas(obj metav1.Object) int32 {

What is this function used for?

jlewi · 2018-08-21T04:49:37Z

Why does #1 entail duplicating the code for each operator?
I would expect the shared code to be in a go pkg that could be go imported into other controllers
e.g
https://github.com/kubeflow/tf-operator/tree/master/pkg/controller.v2/jobcontroller

So couldn't other operators just go import
github.com/kubeflow/tf-operator/pkg/controller.v2

I'm not suggesting you can't move the code into tf-operator. Just trying to understand.

gaocegege · 2018-08-21T06:24:29Z

@jlewi

Personally, I think maintaining pytorch and tensorflow in one repository is easier than keeping them separate. And @johnugeorge 's implementation uses JobController as the base interface (See https://github.com/kubeflow/tf-operator/blob/f050cc86bfda663928574e2356dd55fddea57efa/pkg/controller.v2/pytorch/controller.go#L70)

The implementation generally LGTM while I hope you can split the PRs to small commits. For example, one commit for handwritten code, and one commit for codegen. Then it is friendly to reviewers.

gaocegege · 2018-08-21T06:25:27Z

/cc @codeflitting

I am not sure if you are interested in this PR.

johnugeorge · 2018-08-21T10:28:40Z

@gaocegege Sorry about it. I tried to get to all in the Initial PR so as to avoid delay. Since the changes are already added to branch, splitting into small PRs will be little tedious at this point.
pkg/client and zz_generated_* are the generated files.

If you still think that it is difficult to mange and better to split, let me know. I will close this PR

@jlewi Thats an option we can try. Though it can be implemented, I don't know if it makes sense to keep jobcontroller as a goimport package. If we keep it as separate package in a separate repo, we have to ensure that changes in the base package do not break the individual operators. We may have to trigger presubmits of individual operators(in different repos) for each base controller change.

Also in the long run, Isn't it better to get all operators in a single repo ? It will be easier to maintain.
Hence I feel as a first step, keeping all operators together in tf-operator repo for now and then renaming it later, might be a good option.

WDYT? Shall we decide soon so that there is no delay in our release schedule?

johnugeorge · 2018-08-21T10:43:40Z

pkg/controller.v2/pytorch/controller.go, line 321 at r2 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

What is this function used for?

Given a job, it returns total number of replicas(pods). It is used for gang scheduling (https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/jobcontroller/jobcontroller.go#L206)

jlewi · 2018-08-21T12:09:44Z

@johnugeorge In the near term; what is the least disruptive option? It seems like moving the PyTorch code into TFJob repo in this PR is more disruptive; additional work is needed to get the submit tests up and running again for PyTorch; whereas if you go import it into the existing code in the PyTorch repo then the presubmit will immediately check that the code is working.

You can even edit EXTRA_REPOS in the presubmit test so you can checkout the tf-operator code from a pending PR so you can verify that the PyTorch job will work before the upstream changes to tf-operator can be submitted.

Why not continue to use the PyTorch repo until we have a clear indication that using a separate repo is a pain point?

In terms of avoiding breakages if we use multiple repos

PyTorch can pin a specific version of the shared library; so it doesn't get broken just because of an upstream change
We can trigger tests in other repos
- Our E2E tests already support checking out multiple repos; so there's no reason we can't check out
  the PyTorch repo from the TFOperator pre/postsubmit and build and run the tests using the code in the tf-operator repo.

My expectation is that long-term there should be a clean separation boundary between the shared implementation and the operator specific bits. If there is then maintaining separate repos for each operator should be quite manageable. If there isn't then that's a problem that needs to be fixed; and not directly addressed by putting all the code in a single repository.

As an example, look at CRD frameworks like KubeBuilder. One doesn't put all CRDs based on that framework into that repo.

johnugeorge · 2018-08-21T12:29:56Z

@jlewi The current job controller code is a shared implementation and doesn't have clear boundary between common code and operator implementation. Hence, I am not sure if it qualifies enough to be a separate golang library. However, i agree that adding to the pytorch repo is the least disruptive option for now. As you said, once we get more clarity, we can create a true controller interface.

I will move this code in pytorch repo for now

jlewi · 2018-08-21T12:55:16Z

@johnugeorge Makes sense. I would expect that at this point in time the boundary isn't particularly clear because this is our first attempt at unification.

johnugeorge · 2018-08-22T18:53:45Z

Moving the code to pytorch-repo
/close

Adding pytorch v1alpha controller

81eaa4c

k8s-ci-robot requested review from ddysher and willb August 20, 2018 08:10

k8s-ci-robot added the size/XXL label Aug 20, 2018

Fixing lint issues

f050cc8

johnugeorge mentioned this pull request Aug 20, 2018

Add SchedulerName in V1alpha2 #787

Merged

k8s-ci-robot assigned gaocegege and jlewi Aug 20, 2018

k8s-ci-robot requested a review from jose5918 August 20, 2018 18:01

k8s-ci-robot assigned richardsliu Aug 21, 2018

jlewi suggested changes Aug 21, 2018

View reviewed changes

k8s-ci-robot requested a review from codeflitting August 21, 2018 06:25

jlewi mentioned this pull request Aug 21, 2018

v1alpha2 pytorch API should try to be consistent with TFJob kubeflow/pytorch-operator#49

Closed

k8s-ci-robot closed this Aug 22, 2018

jzp1025 mentioned this pull request Aug 30, 2018

[feature] Implement V2 kubeflow/mxnet-operator#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding pytorch v1alpha2 controller #786

Adding pytorch v1alpha2 controller #786

johnugeorge commented Aug 20, 2018 •

edited by jlewi

Loading

k8s-ci-robot commented Aug 20, 2018

TravisBuddy commented Aug 20, 2018

coveralls commented Aug 20, 2018 •

edited

Loading

johnugeorge commented Aug 20, 2018

johnugeorge commented Aug 20, 2018

johnugeorge commented Aug 20, 2018

johnugeorge commented Aug 20, 2018

jlewi commented Aug 21, 2018

jlewi commented Aug 21, 2018

johnugeorge commented Aug 21, 2018

jlewi left a comment

jlewi commented Aug 21, 2018

gaocegege commented Aug 21, 2018

gaocegege commented Aug 21, 2018

johnugeorge commented Aug 21, 2018

johnugeorge commented Aug 21, 2018

jlewi commented Aug 21, 2018

johnugeorge commented Aug 21, 2018

jlewi commented Aug 21, 2018

johnugeorge commented Aug 22, 2018

Adding pytorch v1alpha2 controller #786

Adding pytorch v1alpha2 controller #786

Conversation

johnugeorge commented Aug 20, 2018 • edited by jlewi Loading

k8s-ci-robot commented Aug 20, 2018

TravisBuddy commented Aug 20, 2018

Travis tests have failed

3rd Build

coveralls commented Aug 20, 2018 • edited Loading

johnugeorge commented Aug 20, 2018

johnugeorge commented Aug 20, 2018

johnugeorge commented Aug 20, 2018

johnugeorge commented Aug 20, 2018

jlewi commented Aug 21, 2018

jlewi commented Aug 21, 2018

johnugeorge commented Aug 21, 2018

jlewi left a comment

Choose a reason for hiding this comment

jlewi commented Aug 21, 2018

gaocegege commented Aug 21, 2018

gaocegege commented Aug 21, 2018

johnugeorge commented Aug 21, 2018

johnugeorge commented Aug 21, 2018

jlewi commented Aug 21, 2018

johnugeorge commented Aug 21, 2018

jlewi commented Aug 21, 2018

johnugeorge commented Aug 22, 2018

johnugeorge commented Aug 20, 2018 •

edited by jlewi

Loading

coveralls commented Aug 20, 2018 •

edited

Loading