MPI Operator v1alpha2 API Design Proposal #92

terrytangyuan · 2019-02-11T23:53:35Z

Hi community,

I am proposing the design for v1alpha2 API version for MPI Operator. You are very welcomed to join the discussion here if you have any questions, comments, concerns, and suggestions. Once we have a concensus from the community, we can then start working on individual items.

Here are the main API changes before we dive into the detail API spec (not including specific implementations):

Removes deprecated fields that are GPU specific, specifically GPUs and GPUsPerNode. This is the remaining work from Support processing resource types other than GPU #75 and Move processing unit specific flags to MPIJobSpec #85.
Separates Template into LauncherSpec and WorkerSpec. See separate out worker and launcher pod specs #54 and Launcher and worker statuses do not correctly indicate the underlying states #90.
Replaces MPIJobLauncherStatusType with a more generic MPIJobPodStatusType that represents different states of the either launcher or worker pods.
Adds ReplicaStatuses that represents statuses of all the worker replicas and removes WorkerReplicas since it can be inferred from ReplicaStatuses. See Launcher and worker statuses do not correctly indicate the underlying states #90.

Below is the proposed API spec for v1alpha2:

type MPIJob struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`
	Spec              MPIJobSpec   `json:"spec,omitempty"`
	Status            MPIJobStatus `json:"status,omitempty"`
}

type MPIJobList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata"`
	Items           []MPIJob `json:"items"`
}

type MPIJobSpec struct {

	// Specifies the desired number of processing units the MPIJob should run on.
	// Mutually exclusive with the `Replicas` field.
	// +optional
	ProcessingUnits *int32 `json:"processingUnits,omitempty"`

	// The maximum number of processing units available per node.
	// Note that this will be ignored if the processing resources are explicitly
	// specified in the MPIJob pod spec.
	// +optional
	ProcessingUnitsPerNode *int32 `json:"processingUnitsPerNode,omitempty"`

	// The processing resource type, e.g. 'nvidia.com/gpu' or 'cpu'.
	// Defaults to 'nvidia.com/gpu'
	// +optional
	ProcessingResourceType string `json:"processingResourceType,omitempty"`

	// Specifies the number of slots per worker used in hostfile.
	// Defaults to the number of processing units per worker.
	// +optional
	SlotsPerWorker *int32 `json:"slotsPerWorker,omitempty"`

	// Run the launcher on the master.
	// Defaults to false.
	// +optional
	LauncherOnMaster bool `json:"launcherOnMaster,omitempty"`

	// Specifies the number of retries before marking this job failed.
	// Defaults to 6.
	// +optional
	BackoffLimit *int32 `json:"backoffLimit,omitempty"`

	// Specifies the duration in seconds relative to the start time that
	// the job may be active before the system tries to terminate it.
	// Note that this takes precedence over `BackoffLimit` field.
	// +optional
	ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"`

	// Specifies the desired number of replicas the MPIJob should run on.
	// The `PodSpec` should specify the number of processing units.
	// Mutually exclusive with the `ProcessingUnits` fields.
	// +optional
	Replicas *int32 `json:"replicas,omitempty"`

	// Describes the launcher pod that will be created when executing an MPIJob.
	LauncherSpec corev1.PodTemplateSpec `json:"template,omitempty"`

	// Describes the worker pods that will be created when executing an MPIJob.
	WorkerSpec corev1.PodTemplateSpec `json:"template,omitempty"`
}

type MPIJobPodStatusType string

// The current observed state of the corresponding pod (either launcher or worker pods).
const (
	// Active means the corresponding pod is actively running.
	Active MPIJobPodStatusType = "Active"
	// Succeeded means the corresponding pod has succeeded.
	Succeeded MPIJobPodStatusType = "Succeeded"
	// Failed means the corresponding pod has failed its execution.
	Failed MPIJobPodStatusType = "Failed"
)


type MPIJobStatus struct {
	// Current status of the launcher job.
	// +optional
	LauncherStatus MPIJobPodStatusType `json:"launcherStatus,omitempty"`

	// Current statuses of the worker replicas.
	// +optional
	ReplicaStatuses []MPIJobPodStatusType `json:"replicaStatuses,omitempty"`

	// Represents time when the job was acknowledged by the job controller.
	// It is not guaranteed to be set in happens-before order across separate operations.
	// It is represented in RFC3339 form and is in UTC.
	StartTime *metav1.Time `json:"startTime,omitempty"`

	// Represents time when the job was completed. It is not guaranteed to
	// be set in happens-before order across separate operations.
	// It is represented in RFC3339 form and is in UTC.
	CompletionTime *metav1.Time `json:"completionTime,omitempty"`
}

cc: @rongou @anfeng @jlewi @everpeace @gaocegege @Nivedita-V @madhukarkm @ywskycn @ScorpioCPH @jian-he @cheyang @richardsliu

Feel free to tag others who might be interested.

The text was updated successfully, but these errors were encountered:

richardsliu · 2019-02-14T03:08:10Z

For things like JobStatus, we should aim to have a common implementation across operators. Please see https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/common/v1beta2/common_types.go.

The pytorch operator is a great example for using the common types and libraries.

@johnugeorge

johnugeorge · 2019-02-14T03:55:29Z

Yes. As @richardsliu suggested, it would be better if you use common JobStatus. It would be easier to implement at this point. We are aiming to reach a point where all operators have common JobStatus type so that other components can take use of this.

In Pytorch operator, See https://github.com/kubeflow/pytorch-operator/blob/master/pkg/apis/pytorch/v1beta1/types.go#L41

Jeffwan · 2019-02-15T21:57:47Z

+1 on common job status. I am recently working on kubebench and found it's better to have common job status to orchestrate workflow to avoid extra control on job status. Otherwise, I have to define job finish condition based on different JobStatus based on different DL framework operator.

terrytangyuan · 2019-02-18T15:34:34Z

Thanks @richardsliu @johnugeorge @Jeffwan for the suggestion. I totally agree that we can reuse JobStatus. Should we extract the common JobStatus out instead of importing from tf-operator? I don't think it makes sense for pytorch-operator and mpi-operator to depend on tf-operator. I am more than happy to help move them out to a common repo as part of the process.

madhukarkm · 2019-02-19T01:56:39Z

+1 on using the common job status and spec. Looking more: (a) ReplicaStatus in both lists only the number of Active/Running, Succeeded and Failed replicas; how about other states like Pending when some replicas are waiting for resources to be scheduled. (b) BackoffLimit would be a good candidate to move into common RestartPolicy (but is probably a broader change).

johnugeorge · 2019-02-20T04:59:04Z

@terrytangyuan we have been thinking about it for sometime. It didn't happen because of effort to be put into it. We also have a JobController which was designed to share features across operators. It should be also moved to a common repo. (See https://github.com/kubeflow/tf-operator/tree/master/pkg/common/jobcontroller )

@richardsliu Please add your thoughts too

jian-he · 2019-02-20T08:24:37Z

+1 to separate the common module out.
we also have a project (a DL framework) thinking to reuse the common job controller logic.

terrytangyuan · 2019-03-05T23:33:09Z

Thanks for everyone's feedback! I've created a PR in #95 for the initial v1alpha2 MPIJob API Spec based on everyone's feedback. Please take a look and let me know if there's anything else that needs to be addressed.

Note that I copied the common types from tf-operator for now since I don't believe mpi-operator should depend on tf-operator. We can switch to use a common repo once it's ready. We should continue this discussion. @richardsliu @jlewi Please also add your thoughts on this.

jlewi · 2019-03-09T00:11:30Z

I'll defer to @richardsliu @johnugeorge since they have largely been driving operators these days.

johnugeorge · 2019-03-09T10:56:58Z

We can take this up after 0.5 release.

johnugeorge mentioned this issue Feb 20, 2019

Removing Operator specific handling during a StudyJob run kubeflow/katib#387

Merged

terrytangyuan mentioned this issue Mar 5, 2019

Initial v1alpha2 MPIJob API Spec #95

Merged

terrytangyuan closed this as completed May 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI Operator v1alpha2 API Design Proposal #92

MPI Operator v1alpha2 API Design Proposal #92

terrytangyuan commented Feb 11, 2019 •

edited

Loading

richardsliu commented Feb 14, 2019

johnugeorge commented Feb 14, 2019

Jeffwan commented Feb 15, 2019

terrytangyuan commented Feb 18, 2019 •

edited

Loading

madhukarkm commented Feb 19, 2019

johnugeorge commented Feb 20, 2019

jian-he commented Feb 20, 2019 •

edited

Loading

terrytangyuan commented Mar 5, 2019 •

edited

Loading

jlewi commented Mar 9, 2019

johnugeorge commented Mar 9, 2019

MPI Operator v1alpha2 API Design Proposal #92

MPI Operator v1alpha2 API Design Proposal #92

Comments

terrytangyuan commented Feb 11, 2019 • edited Loading

richardsliu commented Feb 14, 2019

johnugeorge commented Feb 14, 2019

Jeffwan commented Feb 15, 2019

terrytangyuan commented Feb 18, 2019 • edited Loading

madhukarkm commented Feb 19, 2019

johnugeorge commented Feb 20, 2019

jian-he commented Feb 20, 2019 • edited Loading

terrytangyuan commented Mar 5, 2019 • edited Loading

jlewi commented Mar 9, 2019

johnugeorge commented Mar 9, 2019

terrytangyuan commented Feb 11, 2019 •

edited

Loading

terrytangyuan commented Feb 18, 2019 •

edited

Loading

jian-he commented Feb 20, 2019 •

edited

Loading

terrytangyuan commented Mar 5, 2019 •

edited

Loading