Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully consolidate tf-operator to training-operator #1727

Closed
tenzen-y opened this issue Jan 17, 2023 · 11 comments · Fixed by #1850
Closed

Fully consolidate tf-operator to training-operator #1727

tenzen-y opened this issue Jan 17, 2023 · 11 comments · Fixed by #1850
Assignees

Comments

@tenzen-y
Copy link
Member

tenzen-y commented Jan 17, 2023

We must fully consolidate tf-operator to training-operator since we use duplicated functions with kubeflow/common.

For example:

// In order to minimize the changes, we copy TFController's logic here to override kubeflow/commons reconcile logic
// This should be removed later unless TF has specific logics there
// reconcilePods checks and updates pods for each given TFReplicaSpec.
// It will requeue the tfjob in case of an error while creating/deleting pods.
func (r *TFJobReconciler) ReconcilePods(

@tenzen-y
Copy link
Member Author

/kind feature

@johnugeorge
Copy link
Member

Related: #1714

@tenzen-y
Copy link
Member Author

tenzen-y commented May 7, 2023

/assign

@tenzen-y
Copy link
Member Author

tenzen-y commented May 7, 2023

In the ReconcilePods function, we have 3 different behaviors between TFJob and a common repository:

1. Conditions whether the controller restart failed pods

  • TFJob: restart failed pods in the following cases:
    1. a pod is failed, the RestartPolicy is ExitCode, and the exit code is retryable.
    2. a pod is failed, and the RestartPolicy is OnFailure.
    3. a pod is failed, and the RestartPolicy is Always.

// Check if the pod is retryable.
if pod.Status.Phase == v1.PodFailed &&
(spec.RestartPolicy == commonv1.RestartPolicyExitCode && train_util.IsRetryableExitCode(exitCode) ||
spec.RestartPolicy == commonv1.RestartPolicyOnFailure ||
spec.RestartPolicy == commonv1.RestartPolicyAlways) {

  • common repository: restart failed pods only when a pod is failed, the RestartPolicy is ExitCode, and the exit code is retryable.

https://github.com/kubeflow/common/blob/fdb9739e01be18ce57ad6db140d75a82315aa60c/pkg/reconciler.v1/common/pod.go#L182-L184

2. Report metrics to Prometheus when pods are restarted

  • TFJob: increase RestartedCount.

trainingoperatorcommon.RestartedJobsCounterInc(tfJob.Namespace, kubeflowv1.TFJobFrameworkName)

  • common repository: increase FailedCount.

https://github.com/kubeflow/common/blob/fdb9739e01be18ce57ad6db140d75a82315aa60c/pkg/reconciler.v1/common/pod.go#L185

3. Update Job Condition when the controller restarts failed pods

  • TFJob: updates Job condition.

// with common library framework, we have to handle restart status here
// or we won't know which replica has been restarted in updateJobStatus after reconciling all replicas
msg := fmt.Sprintf("TFJob %s is restarting because %s replica(s) failed.",
tfJob.Name, rtype)
r.Recorder.Event(tfJob, corev1.EventTypeWarning, tfJobRestartingReason, msg)
err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobRestarting, tfJobRestartingReason, msg)

  • common repository: doesn't update Job condition.

None

@tenzen-y
Copy link
Member Author

tenzen-y commented May 7, 2023

I think that the TFJob behavior is correct in all the different points.

@kubeflow/wg-training-leads @zw0610 WDYT?

@tenzen-y
Copy link
Member Author

@kubeflow/wg-training-leads @zw0610 Please let me know what you think.

@johnugeorge
Copy link
Member

johnugeorge commented May 15, 2023

@tenzen-y Sorry for late response.

Thanks for pointing out. Yes, TFJob implementation is the correct one for all 3 points. What changes do you propose for common? This needs to be fixed. Either before or after merge of common to training operator repo

@tenzen-y
Copy link
Member Author

@johnugeorge Thank you for the confirmation.

What changes do you propose for common? This needs to be fixed. Either before or after merge of common to training operator repo

Yes, we should fix that. I would adopt the difference of 3 points to TFJob implementations in the common repo.
For example, I will change https://github.com/kubeflow/common/blob/fdb9739e01be18ce57ad6db140d75a82315aa60c/pkg/reconciler.v1/common/pod.go#L182-L184 to

// Check if the pod is retryable.
if pod.Status.Phase == v1.PodFailed &&
(spec.RestartPolicy == commonv1.RestartPolicyExitCode && train_util.IsRetryableExitCode(exitCode) ||
spec.RestartPolicy == commonv1.RestartPolicyOnFailure ||
spec.RestartPolicy == commonv1.RestartPolicyAlways) {
.

However, I'm on the fence about whether we should fix that before the merge of common.

After the merge of common fixing that might be more safety.

What do you think?

@johnugeorge
Copy link
Member

Sounds good. I have started working on common merge changes.

@tenzen-y
Copy link
Member Author

Sounds good. I have started working on common merge changes.

Great! Please let me know if you create a PR to merge the common into this repository. I will review the PR.

@tenzen-y
Copy link
Member Author

tenzen-y commented Jul 5, 2023

Through more work, I found the ReconcilePods of the JobController have only 2. Report metrics to Prometheus when pods are restarted and 3. Update Job Condition when the controller restarts failed pods differences.

Regarding 1. Conditions whether the controller restart failed pods, it is only for the ReconcilePods of the JobReconciler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants