Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding pytorch v1alpha2 controller #786

Closed
wants to merge 2 commits into from
Closed

Adding pytorch v1alpha2 controller #786

wants to merge 2 commits into from

Conversation

johnugeorge
Copy link
Member

@johnugeorge johnugeorge commented Aug 20, 2018

This contains the pytorch v1 alpha2 code that is consistent with TF. I verified with sample pytorch example. It is adapted from TF controller and more code can be shared in the future. Currently, CRDs are kept completely independent.

#785


This change is Reviewable

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: djangopeng

If they are not already assigned, you can assign the PR to them by writing /assign @djangopeng in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@TravisBuddy
Copy link

Travis tests have failed

Hey @johnugeorge,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

3rd Build

gometalinter --config=linter_config.json --vendor ./...
pkg/controller.v2/pytorch/job.go:128:9:warning: pc.jobClientSet.Pytorch is deprecated: please explicitly pick a version if possible.  (SA1019) (staticcheck)
pkg/controller.v2/pytorch/status.go:123:12:warning: pc.jobClientSet.Pytorch is deprecated: please explicitly pick a version if possible.  (SA1019) (staticcheck)
pkg/apis/pytorch/validation/validation.go:46:6:warning: should omit comparison to bool constant, can be simplified to !defaultContainerPresent (S1002) (gosimple)
pkg/controller.v2/pytorch/controller.go:429:9:warning: pc.jobClientSet.Pytorch is deprecated: please explicitly pick a version if possible.  (SA1019) (staticcheck)

travis_time:end:0892ac16:start=1534752742342524329,finish=1534752880374170876,duration=138031646547

@coveralls
Copy link

coveralls commented Aug 20, 2018

Coverage Status

Coverage decreased (-15.1%) to 42.832% when pulling f050cc8 on johnugeorge:pytorchimpl into 78cfba1 on kubeflow:master.

@johnugeorge
Copy link
Member Author

unit tests, examples have to added. Tracked in #785

@johnugeorge
Copy link
Member Author

\cc @jlewi @gaocegege

@johnugeorge
Copy link
Member Author

/assign @gaocegege @jlewi

@johnugeorge
Copy link
Member Author

/cc @jose5918

@k8s-ci-robot k8s-ci-robot requested a review from jose5918 August 20, 2018 18:01
@jlewi
Copy link
Contributor

jlewi commented Aug 21, 2018

So is the plan to retire the pytorch repository after this PR?

@jlewi
Copy link
Contributor

jlewi commented Aug 21, 2018

/assign @richardsliu

@johnugeorge
Copy link
Member Author

@jlewi This was based on your suggestion.

As per the slack group discussions
We had three options

  1. Use pytorch-operator repo
  2. Use tf-operator repo
  3. Use a common repo for all operators.

For option 1, we will have to duplicate the code(the current shared implementation) for each operator. However,adding tests and examples are easy for option 1. Presubmit workflows are also already present.

For option 2, there is no duplication of code. However,here will be some effort needed in adding tests, examples and presubmit. This is the next task. Also, we have to rename this in the near future IMO to avoid confusion for public. Pytorch repo can be archived after this.

For option 3, we have to ensure that individual repos do not diverge while stabilizing the operator. All Individual operator repo can be archived after this.

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 56 files at r1, 3 of 4 files at r2.
Reviewable status: 11 of 56 files reviewed, 1 unresolved discussion (waiting on @johnugeorge, @willb, @ddysher, and @jose5918)


pkg/controller.v2/pytorch/controller.go, line 321 at r2 (raw file):

}

func (pc *PyTorchController) GetTotalReplicas(obj metav1.Object) int32 {

What is this function used for?

@jlewi
Copy link
Contributor

jlewi commented Aug 21, 2018

Why does #1 entail duplicating the code for each operator?
I would expect the shared code to be in a go pkg that could be go imported into other controllers
e.g
https://github.com/kubeflow/tf-operator/tree/master/pkg/controller.v2/jobcontroller

So couldn't other operators just go import
github.com/kubeflow/tf-operator/pkg/controller.v2

I'm not suggesting you can't move the code into tf-operator. Just trying to understand.

@gaocegege
Copy link
Member

@jlewi

Personally, I think maintaining pytorch and tensorflow in one repository is easier than keeping them separate. And @johnugeorge 's implementation uses JobController as the base interface (See https://github.com/kubeflow/tf-operator/blob/f050cc86bfda663928574e2356dd55fddea57efa/pkg/controller.v2/pytorch/controller.go#L70)

The implementation generally LGTM while I hope you can split the PRs to small commits. For example, one commit for handwritten code, and one commit for codegen. Then it is friendly to reviewers.

@gaocegege
Copy link
Member

/cc @codeflitting

I am not sure if you are interested in this PR.

@johnugeorge
Copy link
Member Author

@gaocegege Sorry about it. I tried to get to all in the Initial PR so as to avoid delay. Since the changes are already added to branch, splitting into small PRs will be little tedious at this point.
pkg/client and zz_generated_* are the generated files.

If you still think that it is difficult to mange and better to split, let me know. I will close this PR

@jlewi Thats an option we can try. Though it can be implemented, I don't know if it makes sense to keep jobcontroller as a goimport package. If we keep it as separate package in a separate repo, we have to ensure that changes in the base package do not break the individual operators. We may have to trigger presubmits of individual operators(in different repos) for each base controller change.

Also in the long run, Isn't it better to get all operators in a single repo ? It will be easier to maintain.
Hence I feel as a first step, keeping all operators together in tf-operator repo for now and then renaming it later, might be a good option.

WDYT? Shall we decide soon so that there is no delay in our release schedule?

@johnugeorge
Copy link
Member Author


pkg/controller.v2/pytorch/controller.go, line 321 at r2 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

What is this function used for?

Given a job, it returns total number of replicas(pods). It is used for gang scheduling (https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/jobcontroller/jobcontroller.go#L206)

@jlewi
Copy link
Contributor

jlewi commented Aug 21, 2018

@johnugeorge In the near term; what is the least disruptive option? It seems like moving the PyTorch code into TFJob repo in this PR is more disruptive; additional work is needed to get the submit tests up and running again for PyTorch; whereas if you go import it into the existing code in the PyTorch repo then the presubmit will immediately check that the code is working.

You can even edit EXTRA_REPOS in the presubmit test so you can checkout the tf-operator code from a pending PR so you can verify that the PyTorch job will work before the upstream changes to tf-operator can be submitted.

Why not continue to use the PyTorch repo until we have a clear indication that using a separate repo is a pain point?

In terms of avoiding breakages if we use multiple repos

  • PyTorch can pin a specific version of the shared library; so it doesn't get broken just because of an upstream change
  • We can trigger tests in other repos
    • Our E2E tests already support checking out multiple repos; so there's no reason we can't check out
      the PyTorch repo from the TFOperator pre/postsubmit and build and run the tests using the code in the tf-operator repo.

My expectation is that long-term there should be a clean separation boundary between the shared implementation and the operator specific bits. If there is then maintaining separate repos for each operator should be quite manageable. If there isn't then that's a problem that needs to be fixed; and not directly addressed by putting all the code in a single repository.

As an example, look at CRD frameworks like KubeBuilder. One doesn't put all CRDs based on that framework into that repo.

@johnugeorge
Copy link
Member Author

@jlewi The current job controller code is a shared implementation and doesn't have clear boundary between common code and operator implementation. Hence, I am not sure if it qualifies enough to be a separate golang library. However, i agree that adding to the pytorch repo is the least disruptive option for now. As you said, once we get more clarity, we can create a true controller interface.

I will move this code in pytorch repo for now

@jlewi
Copy link
Contributor

jlewi commented Aug 21, 2018

@johnugeorge Makes sense. I would expect that at this point in time the boundary isn't particularly clear because this is our first attempt at unification.

@johnugeorge
Copy link
Member Author

Moving the code to pytorch-repo
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants