Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI Operator v1alpha2 API Design Proposal #92

Closed
terrytangyuan opened this issue Feb 11, 2019 · 10 comments
Closed

MPI Operator v1alpha2 API Design Proposal #92

terrytangyuan opened this issue Feb 11, 2019 · 10 comments

Comments

@terrytangyuan
Copy link
Member

terrytangyuan commented Feb 11, 2019

Hi community,

I am proposing the design for v1alpha2 API version for MPI Operator. You are very welcomed to join the discussion here if you have any questions, comments, concerns, and suggestions. Once we have a concensus from the community, we can then start working on individual items.

Here are the main API changes before we dive into the detail API spec (not including specific implementations):

Below is the proposed API spec for v1alpha2:

type MPIJob struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`
	Spec              MPIJobSpec   `json:"spec,omitempty"`
	Status            MPIJobStatus `json:"status,omitempty"`
}

type MPIJobList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata"`
	Items           []MPIJob `json:"items"`
}

type MPIJobSpec struct {

	// Specifies the desired number of processing units the MPIJob should run on.
	// Mutually exclusive with the `Replicas` field.
	// +optional
	ProcessingUnits *int32 `json:"processingUnits,omitempty"`

	// The maximum number of processing units available per node.
	// Note that this will be ignored if the processing resources are explicitly
	// specified in the MPIJob pod spec.
	// +optional
	ProcessingUnitsPerNode *int32 `json:"processingUnitsPerNode,omitempty"`

	// The processing resource type, e.g. 'nvidia.com/gpu' or 'cpu'.
	// Defaults to 'nvidia.com/gpu'
	// +optional
	ProcessingResourceType string `json:"processingResourceType,omitempty"`

	// Specifies the number of slots per worker used in hostfile.
	// Defaults to the number of processing units per worker.
	// +optional
	SlotsPerWorker *int32 `json:"slotsPerWorker,omitempty"`

	// Run the launcher on the master.
	// Defaults to false.
	// +optional
	LauncherOnMaster bool `json:"launcherOnMaster,omitempty"`

	// Specifies the number of retries before marking this job failed.
	// Defaults to 6.
	// +optional
	BackoffLimit *int32 `json:"backoffLimit,omitempty"`

	// Specifies the duration in seconds relative to the start time that
	// the job may be active before the system tries to terminate it.
	// Note that this takes precedence over `BackoffLimit` field.
	// +optional
	ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"`

	// Specifies the desired number of replicas the MPIJob should run on.
	// The `PodSpec` should specify the number of processing units.
	// Mutually exclusive with the `ProcessingUnits` fields.
	// +optional
	Replicas *int32 `json:"replicas,omitempty"`

	// Describes the launcher pod that will be created when executing an MPIJob.
	LauncherSpec corev1.PodTemplateSpec `json:"template,omitempty"`

	// Describes the worker pods that will be created when executing an MPIJob.
	WorkerSpec corev1.PodTemplateSpec `json:"template,omitempty"`
}

type MPIJobPodStatusType string

// The current observed state of the corresponding pod (either launcher or worker pods).
const (
	// Active means the corresponding pod is actively running.
	Active MPIJobPodStatusType = "Active"
	// Succeeded means the corresponding pod has succeeded.
	Succeeded MPIJobPodStatusType = "Succeeded"
	// Failed means the corresponding pod has failed its execution.
	Failed MPIJobPodStatusType = "Failed"
)


type MPIJobStatus struct {
	// Current status of the launcher job.
	// +optional
	LauncherStatus MPIJobPodStatusType `json:"launcherStatus,omitempty"`

	// Current statuses of the worker replicas.
	// +optional
	ReplicaStatuses []MPIJobPodStatusType `json:"replicaStatuses,omitempty"`

	// Represents time when the job was acknowledged by the job controller.
	// It is not guaranteed to be set in happens-before order across separate operations.
	// It is represented in RFC3339 form and is in UTC.
	StartTime *metav1.Time `json:"startTime,omitempty"`

	// Represents time when the job was completed. It is not guaranteed to
	// be set in happens-before order across separate operations.
	// It is represented in RFC3339 form and is in UTC.
	CompletionTime *metav1.Time `json:"completionTime,omitempty"`
}

cc: @rongou @anfeng @jlewi @everpeace @gaocegege @Nivedita-V @madhukarkm @ywskycn @ScorpioCPH @jian-he @cheyang @richardsliu

Feel free to tag others who might be interested.

@richardsliu
Copy link

For things like JobStatus, we should aim to have a common implementation across operators. Please see https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/common/v1beta2/common_types.go.

The pytorch operator is a great example for using the common types and libraries.

@johnugeorge

@johnugeorge
Copy link
Member

Yes. As @richardsliu suggested, it would be better if you use common JobStatus. It would be easier to implement at this point. We are aiming to reach a point where all operators have common JobStatus type so that other components can take use of this.

In Pytorch operator, See https://github.com/kubeflow/pytorch-operator/blob/master/pkg/apis/pytorch/v1beta1/types.go#L41

@Jeffwan
Copy link
Member

Jeffwan commented Feb 15, 2019

+1 on common job status. I am recently working on kubebench and found it's better to have common job status to orchestrate workflow to avoid extra control on job status. Otherwise, I have to define job finish condition based on different JobStatus based on different DL framework operator.

@terrytangyuan
Copy link
Member Author

terrytangyuan commented Feb 18, 2019

Thanks @richardsliu @johnugeorge @Jeffwan for the suggestion. I totally agree that we can reuse JobStatus. Should we extract the common JobStatus out instead of importing from tf-operator? I don't think it makes sense for pytorch-operator and mpi-operator to depend on tf-operator. I am more than happy to help move them out to a common repo as part of the process.

@madhukarkm
Copy link

+1 on using the common job status and spec. Looking more: (a) ReplicaStatus in both lists only the number of Active/Running, Succeeded and Failed replicas; how about other states like Pending when some replicas are waiting for resources to be scheduled. (b) BackoffLimit would be a good candidate to move into common RestartPolicy (but is probably a broader change).

@johnugeorge
Copy link
Member

@terrytangyuan we have been thinking about it for sometime. It didn't happen because of effort to be put into it. We also have a JobController which was designed to share features across operators. It should be also moved to a common repo. (See https://github.com/kubeflow/tf-operator/tree/master/pkg/common/jobcontroller )

@richardsliu Please add your thoughts too

@jian-he
Copy link
Contributor

jian-he commented Feb 20, 2019

+1 to separate the common module out.
we also have a project (a DL framework) thinking to reuse the common job controller logic.

@terrytangyuan
Copy link
Member Author

terrytangyuan commented Mar 5, 2019

Thanks for everyone's feedback! I've created a PR in #95 for the initial v1alpha2 MPIJob API Spec based on everyone's feedback. Please take a look and let me know if there's anything else that needs to be addressed.

Note that I copied the common types from tf-operator for now since I don't believe mpi-operator should depend on tf-operator. We can switch to use a common repo once it's ready. We should continue this discussion. @richardsliu @jlewi Please also add your thoughts on this.

@jlewi
Copy link
Contributor

jlewi commented Mar 9, 2019

I'll defer to @richardsliu @johnugeorge since they have largely been driving operators these days.

@johnugeorge
Copy link
Member

We can take this up after 0.5 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants