Fully consolidate tf-operator to training-operator #1727

tenzen-y · 2023-01-17T05:51:38Z

We must fully consolidate tf-operator to training-operator since we use duplicated functions with kubeflow/common.

For example:

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Lines 712 to 716 in ddf372c

    
           // In order to minimize the changes, we copy TFController's logic here to override kubeflow/commons reconcile logic 
        
           // This should be removed later unless TF has specific logics there 
        
           // reconcilePods checks and updates pods for each given TFReplicaSpec. 
        
           // It will requeue the tfjob in case of an error while creating/deleting pods. 
        
           func (r *TFJobReconciler) ReconcilePods(

tenzen-y · 2023-01-17T06:07:58Z

/kind feature

johnugeorge · 2023-01-17T06:59:56Z

Related: #1714

tenzen-y · 2023-05-07T15:06:43Z

/assign

tenzen-y · 2023-05-07T16:34:54Z

In the ReconcilePods function, we have 3 different behaviors between TFJob and a common repository:

1. Conditions whether the controller restart failed pods

TFJob: restart failed pods in the following cases:
1. a pod is failed, the RestartPolicy is ExitCode, and the exit code is retryable.
2. a pod is failed, and the RestartPolicy is OnFailure.
3. a pod is failed, and the RestartPolicy is Always.

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Lines 802 to 806 in 8a066f9

    
           // Check if the pod is retryable. 
        
           if pod.Status.Phase == v1.PodFailed && 
        
           	(spec.RestartPolicy == commonv1.RestartPolicyExitCode && train_util.IsRetryableExitCode(exitCode) || 
        
           		spec.RestartPolicy == commonv1.RestartPolicyOnFailure || 
        
           		spec.RestartPolicy == commonv1.RestartPolicyAlways) {

common repository: restart failed pods only when a pod is failed, the RestartPolicy is ExitCode, and the exit code is retryable.

https://github.com/kubeflow/common/blob/fdb9739e01be18ce57ad6db140d75a82315aa60c/pkg/reconciler.v1/common/pod.go#L182-L184

2. Report metrics to Prometheus when pods are restarted

TFJob: increase RestartedCount.

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Line 822 in 8a066f9

    
           trainingoperatorcommon.RestartedJobsCounterInc(tfJob.Namespace, kubeflowv1.TFJobFrameworkName)

common repository: increase FailedCount.

https://github.com/kubeflow/common/blob/fdb9739e01be18ce57ad6db140d75a82315aa60c/pkg/reconciler.v1/common/pod.go#L185

3. Update Job Condition when the controller restarts failed pods

TFJob: updates Job condition.

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Lines 812 to 817 in 8a066f9

    
           // with common library framework, we have to handle restart status here 
        
           // or we won't know which replica has been restarted in updateJobStatus after reconciling all replicas 
        
           msg := fmt.Sprintf("TFJob %s is restarting because %s replica(s) failed.", 
        
           	tfJob.Name, rtype) 
        
           r.Recorder.Event(tfJob, corev1.EventTypeWarning, tfJobRestartingReason, msg) 
        
           err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobRestarting, tfJobRestartingReason, msg)

common repository: doesn't update Job condition.

None

tenzen-y · 2023-05-07T16:37:11Z

I think that the TFJob behavior is correct in all the different points.

@kubeflow/wg-training-leads @zw0610 WDYT?

tenzen-y · 2023-05-15T17:21:53Z

@kubeflow/wg-training-leads @zw0610 Please let me know what you think.

johnugeorge · 2023-05-15T19:55:50Z

@tenzen-y Sorry for late response.

Thanks for pointing out. Yes, TFJob implementation is the correct one for all 3 points. What changes do you propose for common? This needs to be fixed. Either before or after merge of common to training operator repo

tenzen-y · 2023-05-15T20:04:50Z

@johnugeorge Thank you for the confirmation.

What changes do you propose for common? This needs to be fixed. Either before or after merge of common to training operator repo

Yes, we should fix that. I would adopt the difference of 3 points to TFJob implementations in the common repo.
For example, I will change https://github.com/kubeflow/common/blob/fdb9739e01be18ce57ad6db140d75a82315aa60c/pkg/reconciler.v1/common/pod.go#L182-L184 to

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Lines 802 to 806 in 8a066f9

    
           // Check if the pod is retryable. 
        
           if pod.Status.Phase == v1.PodFailed && 
        
           	(spec.RestartPolicy == commonv1.RestartPolicyExitCode && train_util.IsRetryableExitCode(exitCode) || 
        
           		spec.RestartPolicy == commonv1.RestartPolicyOnFailure || 
        
           		spec.RestartPolicy == commonv1.RestartPolicyAlways) {

.

However, I'm on the fence about whether we should fix that before the merge of common.

After the merge of common fixing that might be more safety.

What do you think?

johnugeorge · 2023-05-15T20:07:44Z

Sounds good. I have started working on common merge changes.

tenzen-y · 2023-05-15T20:10:46Z

Sounds good. I have started working on common merge changes.

Great! Please let me know if you create a PR to merge the common into this repository. I will review the PR.

tenzen-y · 2023-07-05T18:38:18Z

Through more work, I found the ReconcilePods of the JobController have only 2. Report metrics to Prometheus when pods are restarted and 3. Update Job Condition when the controller restarts failed pods differences.

Regarding 1. Conditions whether the controller restart failed pods, it is only for the ReconcilePods of the JobReconciler.

google-oss-prow bot added the kind/feature label Jan 17, 2023

tenzen-y mentioned this issue May 2, 2023

Pod name using generated name kubeflow/common#215

Closed

google-oss-prow bot assigned tenzen-y May 7, 2023

tenzen-y mentioned this issue Jul 5, 2023

Fully consolidate tfjob-operator to training-operator #1850

Merged

1 task

google-oss-prow bot closed this as completed in #1850 Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fully consolidate tf-operator to training-operator #1727

Fully consolidate tf-operator to training-operator #1727

tenzen-y commented Jan 17, 2023 •

edited

Loading

tenzen-y commented Jan 17, 2023

johnugeorge commented Jan 17, 2023

tenzen-y commented May 7, 2023

tenzen-y commented May 7, 2023 •

edited

Loading

tenzen-y commented May 7, 2023

tenzen-y commented May 15, 2023

johnugeorge commented May 15, 2023 •

edited

Loading

tenzen-y commented May 15, 2023

johnugeorge commented May 15, 2023

tenzen-y commented May 15, 2023

tenzen-y commented Jul 5, 2023 •

edited

Loading

Fully consolidate tf-operator to training-operator #1727

Fully consolidate tf-operator to training-operator #1727

Comments

tenzen-y commented Jan 17, 2023 • edited Loading

tenzen-y commented Jan 17, 2023

johnugeorge commented Jan 17, 2023

tenzen-y commented May 7, 2023

tenzen-y commented May 7, 2023 • edited Loading

1. Conditions whether the controller restart failed pods

2. Report metrics to Prometheus when pods are restarted

3. Update Job Condition when the controller restarts failed pods

tenzen-y commented May 7, 2023

tenzen-y commented May 15, 2023

johnugeorge commented May 15, 2023 • edited Loading

tenzen-y commented May 15, 2023

johnugeorge commented May 15, 2023

tenzen-y commented May 15, 2023

tenzen-y commented Jul 5, 2023 • edited Loading

tenzen-y commented Jan 17, 2023 •

edited

Loading

tenzen-y commented May 7, 2023 •

edited

Loading

johnugeorge commented May 15, 2023 •

edited

Loading

tenzen-y commented Jul 5, 2023 •

edited

Loading