-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for a Common Operator #960
Comments
Looks great to me in general! I really like the new introduced type |
@terrytangyuan
|
Thanks. Sounds good then. |
SGTM I think it will help us a lot. One thing that I care about is if the common operator will be a real operator or just a repository to store the common-maintained CRD APIs. |
@johnugeorge This will be after 0.5, and after those 2 PRs are merged. @gaocegege I think it is cleaner for it to be just a repository for common APIs. What do you think? |
@richardsliu Since it would be a breaking API change, won't this need one more release before v1? |
@richardsliu SGTM, this is much cleaner. |
+1 to this proposal! Actually, we already have such a common operator for those framework; and we plan to open source it recently (I can give a demo in weekly meeting if it's ok). It will try to support other frameworks; not only ML frameworks, but also some others, e.g. bigdata. And this operator will work with kube-batch for batch scheduling capability. I'm thihking whether we can work together on that. |
+1 for common operator, and for common API, can we abstract more? such as define a common master/slave model for distributed jobs. |
@k82cn @ScorpioCPH I have considered further abstractions to make the operator common for distributed jobs, but it can become difficult to keep one implementation across different frameworks. For example the MPI operator has framework-specific details that do not apply to TF or PyTorch. I think it will be cleaner and safer to just refactor the most shared and least controversial elements (such as RunPolicy) for now. However I would be interested to see how the common operator works. |
We already have something there, run TF and MPI job by one operator/controller; maybe I can give a demo later, and discuss our next step. WDYT? |
@richardsliu SGTM, let's achieve the final goal step by step :) |
@k82cn Thanks for the information you shared, we would be glad to see a demo. Will you be available on Mar 27 at 8:30 a.m. Beijing time (our community call in a week)? |
@k82cn My current thought is that for the upcoming v1.0 release, we (kubeflow) will limit the scope of changes to just abstracting the common APIs and types. This will help stabilize our patterns across different training frameworks without introducing significant changes. Meanwhile, I would like to see how volcano develops in parallel, in particular its batch-scheduling functionalities. Post v1, we can look at options for using volcano to introduce batch scheduling support to our common operators. How does that sound? |
The common repository is created now: https://github.com/kubeflow/common. I thought about naming it Directory structure will be mapped as follows:
[1] This is originally its own package called |
@johnugeorge @terrytangyuan @gaocegege
|
Yes v1 sounds good to me.
I am fine with Go modules. It seems like the majority of kubeflow projects doesn't use Go modules yet (only kfp uses Go modules from a quick glance) though. MPI operator should be fine with Golang 1.11+. |
Sounds good to me. |
That's great to work together on the batch scheduling part :) |
@richardsliu kubebuilder provides powerful libraries and tools to build operator, we have use it to build some operators internal :) |
This would be very useful for other types of operators. I have one question here, can we provide the document or example for other operator to use the common operator? this would help new users to begin. |
@merlintang Agree, we should create an example to demonstrate how to use the common APIs. Meanwhile, @jian-he has committed this PR which defined the common interfaces to be implemented by custom operators: kubeflow/common#12. |
@richardsliu What is the remaining work to close out this issue? |
@jlewi |
We also need a plan to migrate the existing operators (tf, pytorch, mxnet, etc) to the common library. |
After discussion with contributors, we are postponing the migration of TF and Pytorch to the common library. Instead, a new operator (e.g. XGBoost or MPI) can start using it first, which gives us sufficient time to find all the issue. So this issue will no longer be blocking for TFJob 1.0. |
XGBoost Operator based on common lib would be a better way to debug and
demonstrate.
…On Thu, May 9, 2019 at 10:53 AM Richard Liu ***@***.***> wrote:
After discussion with contributors, we are postponing the migration of TF
and Pytorch to the common library. Instead, a new operator (e.g. XGBoost or
MPI) can start using it first, which gives us sufficient time to find all
the issue. So this issue will no longer be blocking for TFJob 1.0.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#960 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAK5R6PS777EOB6EQ5HA66TPURQJJANCNFSM4G6V2D5A>
.
|
I think we can close the issue now. We already have the repo for the common operator. |
Proposal
Move common API types and lower-level libraries to a new, common repository.
Motivation
TFJob is currently in v1beta1 (v1beta2 after 0.5.0) and is fairly stable. Its common types and libraries are being used in other operators like pytorch and MPI.
As we continue to grow and support other distributed training frameworks, it makes sense to refactor and make these common types available in a standalone repository. This has the following advantages:
Details
Create a common-operator repository, consisting of the following directories from tf-operator:
If possible,
py/kubeflow
(containing test methods) can also be moved to common.For the common API, we can introduce a new type:
This will be included as part of Job specs. So after this refactoring, a job spec should contain just replica types and other framework-specific details:
This way the operators do not need to duplicate common functionalities. But the operators are still loosely-coupled enough such that they do not have to rely on a single implementation.
How does this sound?
@gaocegege @johnugeorge @terrytangyuan @cheyang @k82cn
The text was updated successfully, but these errors were encountered: