Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for a Common Operator #960

Closed
richardsliu opened this issue Mar 15, 2019 · 29 comments
Closed

Proposal for a Common Operator #960

richardsliu opened this issue Mar 15, 2019 · 29 comments

Comments

@richardsliu
Copy link
Contributor

Proposal

Move common API types and lower-level libraries to a new, common repository.

Motivation

TFJob is currently in v1beta1 (v1beta2 after 0.5.0) and is fairly stable. Its common types and libraries are being used in other operators like pytorch and MPI.

As we continue to grow and support other distributed training frameworks, it makes sense to refactor and make these common types available in a standalone repository. This has the following advantages:

  • Reducing code duplication
  • Other operators do not need to import tf-operator as a dependency
  • Providing a cleaner pattern for future operators
  • More convenient for long-term API governance
  • Preserves the independence of each operator (as opposed to introducing a single monolithic operator for all frameworks)

Details

Create a common-operator repository, consisting of the following directories from tf-operator:

pkg/api/common
pkg/common
pkg/control
pkg/logger
pkg/util

If possible, py/kubeflow (containing test methods) can also be moved to common.

For the common API, we can introduce a new type:

type RunPolicy struct {
	// CleanPodPolicy defines the policy to kill pods after TFJob is
	// succeeded.
	// Default to Running.
	CleanPodPolicy *CleanPodPolicy `json:"cleanPodPolicy,omitempty"`

	// TTLSecondsAfterFinished is the TTL to clean up tf-jobs (temporary
	// before kubernetes adds the cleanup controller).
	// It may take extra ReconcilePeriod seconds for the cleanup, since
	// reconcile gets called periodically.
	// Default to infinite.
	TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"`
 
	// Specifies the duration in seconds relative to the startTime that the job may be active
	// before the system tries to terminate it; value must be positive integer
	// +optional
	ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"`

	// Optional number of retries before marking this job failed.
	// Defaults to 6
	// +optional
	BackoffLimit *int32 `json:"backoffLimit,omitempty"`

        // Not implemented today - see https://github.com/kubeflow/tf-operator/issues/916#issuecomment-458729706
        SchedulingPolicy *SchedulingPolicy `json:"schedulingPolicy,omitempty"`
}

This will be included as part of Job specs. So after this refactoring, a job spec should contain just replica types and other framework-specific details:

type TFJobSpec struct {
	// Common run policy
	RunPolicy *common.RunPolicy `json:"runPolicy,omitempty"`

	// TFReplicaSpecs is map of TFReplicaType and ReplicaSpec
	// specifies the TF replicas to run.
	// For example,
	//   {
	//     "PS": ReplicaSpec,
	//     "Worker": ReplicaSpec,
	//   }
	TFReplicaSpecs map[TFReplicaType]*common.ReplicaSpec `json:"tfReplicaSpecs"`
}

This way the operators do not need to duplicate common functionalities. But the operators are still loosely-coupled enough such that they do not have to rely on a single implementation.

How does this sound?

@gaocegege @johnugeorge @terrytangyuan @cheyang @k82cn

@terrytangyuan
Copy link
Member

Looks great to me in general! I really like the new introduced type RunPolicy. Also are we missing JobStatus here which is what initially triggered our further discussion on the common operator?

@richardsliu
Copy link
Contributor Author

@terrytangyuan JobStatus is already in TFJob, so there should not be any changes:

type TFJob struct {
	metav1.TypeMeta `json:",inline"`

	// Standard object's metadata.
	metav1.ObjectMeta `json:"metadata,omitempty"`

	// Specification of the desired behavior of the TFJob.
	Spec TFJobSpec `json:"spec,omitempty"`

	// Most recently observed status of the TFJob.
	// This data may not be up to date.
	// Populated by the system.
	// Read-only.
	Status common.JobStatus `json:"status,omitempty"`
}

@terrytangyuan
Copy link
Member

terrytangyuan commented Mar 15, 2019

Thanks. Sounds good then.

@johnugeorge
Copy link
Member

Yes. As we discussed, we can wait till
#958 and
#954 are merged.

Since this would be a breaking API change, should we target in 0.5 itself? We can refactor API but code can moved later?

@gaocegege
Copy link
Member

SGTM

I think it will help us a lot. One thing that I care about is if the common operator will be a real operator or just a repository to store the common-maintained CRD APIs.

@richardsliu
Copy link
Contributor Author

@johnugeorge This will be after 0.5, and after those 2 PRs are merged.

@gaocegege I think it is cleaner for it to be just a repository for common APIs. What do you think?

@johnugeorge
Copy link
Member

@richardsliu Since it would be a breaking API change, won't this need one more release before v1?

@gaocegege
Copy link
Member

@richardsliu SGTM, this is much cleaner.

@k82cn
Copy link
Collaborator

k82cn commented Mar 15, 2019

+1 to this proposal!

Actually, we already have such a common operator for those framework; and we plan to open source it recently (I can give a demo in weekly meeting if it's ok). It will try to support other frameworks; not only ML frameworks, but also some others, e.g. bigdata. And this operator will work with kube-batch for batch scheduling capability. I'm thihking whether we can work together on that.

@ScorpioCPH
Copy link
Member

+1 for common operator, and for common API, can we abstract more? such as define a common master/slave model for distributed jobs.

@richardsliu
Copy link
Contributor Author

@k82cn @ScorpioCPH I have considered further abstractions to make the operator common for distributed jobs, but it can become difficult to keep one implementation across different frameworks. For example the MPI operator has framework-specific details that do not apply to TF or PyTorch. I think it will be cleaner and safer to just refactor the most shared and least controversial elements (such as RunPolicy) for now. However I would be interested to see how the common operator works.

@k82cn
Copy link
Collaborator

k82cn commented Mar 16, 2019

but it can become difficult to keep one implementation across different frameworks

We already have something there, run TF and MPI job by one operator/controller; maybe I can give a demo later, and discuss our next step. WDYT?

@ScorpioCPH
Copy link
Member

@richardsliu SGTM, let's achieve the final goal step by step :)

@richardsliu
Copy link
Contributor Author

@k82cn Thanks for the information you shared, we would be glad to see a demo. Will you be available on Mar 27 at 8:30 a.m. Beijing time (our community call in a week)?

@richardsliu
Copy link
Contributor Author

richardsliu commented Mar 18, 2019

@k82cn My current thought is that for the upcoming v1.0 release, we (kubeflow) will limit the scope of changes to just abstracting the common APIs and types. This will help stabilize our patterns across different training frameworks without introducing significant changes.

Meanwhile, I would like to see how volcano develops in parallel, in particular its batch-scheduling functionalities. Post v1, we can look at options for using volcano to introduce batch scheduling support to our common operators. How does that sound?

@richardsliu
Copy link
Contributor Author

The common repository is created now: https://github.com/kubeflow/common. I thought about naming it api to follow the conventions of Kubernetes, but unlike https://github.com/kubernetes/api/ which contains only types and generated Go files, we also have common libraries like the job controller.

Directory structure will be mapped as follows:

pkg/api/common/<version> -->   common/operator/<version>
pkg/control              -->   common/job_controller [1]
pkg/common/jobcontroller -->   common/job_controller
pkg/common/util/<version>/unstructured --> common/unstructured/<version>
pkg/common/util/<version>/testutil     --> common/test_util/<version>
pkg/common/logger         -->     common/util [2]
pkg/common/util           -->     common/util

[1] This is originally its own package called control. But I only see it being used by jobcontroller, so I think it can be merged.
[2] Combined into util since there's not much on its own.

@richardsliu
Copy link
Contributor Author

@johnugeorge @terrytangyuan @gaocegege
A couple of open questions:

  1. What should the initial version be? TFJob and PytorchJob are current in v1beta2 while MPI is in v1alpha2. Should we just start with v1?
  2. What to do about dependency management? Other kubeflow components like kfctl are moving towards using Go modules (https://github.com/golang/go/wiki/Modules) but it requires Golang 1.11+.

@terrytangyuan
Copy link
Member

terrytangyuan commented Mar 29, 2019

@johnugeorge @terrytangyuan @gaocegege
A couple of open questions:

  1. What should the initial version be? TFJob and PytorchJob are current in v1beta2 while MPI is in v1alpha2. Should we just start with v1?

Yes v1 sounds good to me.

  1. What to do about dependency management? Other kubeflow components like kfctl are moving towards using Go modules (https://github.com/golang/go/wiki/Modules) but it requires Golang 1.11+.

I am fine with Go modules. It seems like the majority of kubeflow projects doesn't use Go modules yet (only kfp uses Go modules from a quick glance) though. MPI operator should be fine with Golang 1.11+.

@johnugeorge
Copy link
Member

Sounds good to me.

@k82cn
Copy link
Collaborator

k82cn commented Mar 30, 2019

Post v1, we can look at options for using volcano to introduce batch scheduling support to our common operators. How does that sound?

That's great to work together on the batch scheduling part :)

@ScorpioCPH
Copy link
Member

@richardsliu kubebuilder provides powerful libraries and tools to build operator, we have use it to build some operators internal :)

@merlintang
Copy link

This would be very useful for other types of operators. I have one question here, can we provide the document or example for other operator to use the common operator? this would help new users to begin.

@richardsliu
Copy link
Contributor Author

@merlintang Agree, we should create an example to demonstrate how to use the common APIs. Meanwhile, @jian-he has committed this PR which defined the common interfaces to be implemented by custom operators: kubeflow/common#12.

@jlewi
Copy link
Contributor

jlewi commented Apr 22, 2019

@richardsliu What is the remaining work to close out this issue?

@jian-he
Copy link
Contributor

jian-he commented Apr 22, 2019

@jlewi
The common repo still needs a couple of patches to go in and also requires documentation and examples.

@richardsliu
Copy link
Contributor Author

We also need a plan to migrate the existing operators (tf, pytorch, mxnet, etc) to the common library.

@richardsliu
Copy link
Contributor Author

After discussion with contributors, we are postponing the migration of TF and Pytorch to the common library. Instead, a new operator (e.g. XGBoost or MPI) can start using it first, which gives us sufficient time to find all the issue. So this issue will no longer be blocking for TFJob 1.0.

@merlintang
Copy link

merlintang commented May 9, 2019 via email

@gaocegege
Copy link
Member

I think we can close the issue now. We already have the repo for the common operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants