-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding pytorch v1alpha2 controller #786
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Travis tests have failedHey @johnugeorge, 3rd Buildgometalinter --config=linter_config.json --vendor ./...
|
unit tests, examples have to added. Tracked in #785 |
\cc @jlewi @gaocegege |
/assign @gaocegege @jlewi |
/cc @jose5918 |
So is the plan to retire the pytorch repository after this PR? |
/assign @richardsliu |
@jlewi This was based on your suggestion. As per the slack group discussions
For option 1, we will have to duplicate the code(the current shared implementation) for each operator. However,adding tests and examples are easy for option 1. Presubmit workflows are also already present. For option 2, there is no duplication of code. However,here will be some effort needed in adding tests, examples and presubmit. This is the next task. Also, we have to rename this in the near future IMO to avoid confusion for public. Pytorch repo can be archived after this. For option 3, we have to ensure that individual repos do not diverge while stabilizing the operator. All Individual operator repo can be archived after this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 56 files at r1, 3 of 4 files at r2.
Reviewable status: 11 of 56 files reviewed, 1 unresolved discussion (waiting on @johnugeorge, @willb, @ddysher, and @jose5918)
pkg/controller.v2/pytorch/controller.go, line 321 at r2 (raw file):
} func (pc *PyTorchController) GetTotalReplicas(obj metav1.Object) int32 {
What is this function used for?
Why does #1 entail duplicating the code for each operator? So couldn't other operators just go import I'm not suggesting you can't move the code into tf-operator. Just trying to understand. |
Personally, I think maintaining pytorch and tensorflow in one repository is easier than keeping them separate. And @johnugeorge 's implementation uses JobController as the base interface (See https://github.com/kubeflow/tf-operator/blob/f050cc86bfda663928574e2356dd55fddea57efa/pkg/controller.v2/pytorch/controller.go#L70) The implementation generally LGTM while I hope you can split the PRs to small commits. For example, one commit for handwritten code, and one commit for codegen. Then it is friendly to reviewers. |
/cc @codeflitting I am not sure if you are interested in this PR. |
@gaocegege Sorry about it. I tried to get to all in the Initial PR so as to avoid delay. Since the changes are already added to branch, splitting into small PRs will be little tedious at this point. If you still think that it is difficult to mange and better to split, let me know. I will close this PR @jlewi Thats an option we can try. Though it can be implemented, I don't know if it makes sense to keep jobcontroller as a goimport package. If we keep it as separate package in a separate repo, we have to ensure that changes in the base package do not break the individual operators. We may have to trigger presubmits of individual operators(in different repos) for each base controller change. Also in the long run, Isn't it better to get all operators in a single repo ? It will be easier to maintain. WDYT? Shall we decide soon so that there is no delay in our release schedule? |
pkg/controller.v2/pytorch/controller.go, line 321 at r2 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Given a job, it returns total number of replicas(pods). It is used for gang scheduling (https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/jobcontroller/jobcontroller.go#L206) |
@johnugeorge In the near term; what is the least disruptive option? It seems like moving the PyTorch code into TFJob repo in this PR is more disruptive; additional work is needed to get the submit tests up and running again for PyTorch; whereas if you go import it into the existing code in the PyTorch repo then the presubmit will immediately check that the code is working. You can even edit EXTRA_REPOS in the presubmit test so you can checkout the tf-operator code from a pending PR so you can verify that the PyTorch job will work before the upstream changes to tf-operator can be submitted. Why not continue to use the PyTorch repo until we have a clear indication that using a separate repo is a pain point? In terms of avoiding breakages if we use multiple repos
My expectation is that long-term there should be a clean separation boundary between the shared implementation and the operator specific bits. If there is then maintaining separate repos for each operator should be quite manageable. If there isn't then that's a problem that needs to be fixed; and not directly addressed by putting all the code in a single repository. As an example, look at CRD frameworks like KubeBuilder. One doesn't put all CRDs based on that framework into that repo. |
@jlewi The current job controller code is a shared implementation and doesn't have clear boundary between common code and operator implementation. Hence, I am not sure if it qualifies enough to be a separate golang library. However, i agree that adding to the pytorch repo is the least disruptive option for now. As you said, once we get more clarity, we can create a true controller interface. I will move this code in pytorch repo for now |
@johnugeorge Makes sense. I would expect that at this point in time the boundary isn't particularly clear because this is our first attempt at unification. |
Moving the code to pytorch-repo |
This contains the pytorch v1 alpha2 code that is consistent with TF. I verified with sample pytorch example. It is adapted from TF controller and more code can be shared in the future. Currently, CRDs are kept completely independent.
#785
This change is