Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: tfjob with restartPolicy=ExitCode not work #1562

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 22 additions & 7 deletions pkg/controller.v1/tensorflow/tfjob_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -409,6 +409,20 @@ func (r *TFJobReconciler) UpdateJobStatus(job interface{}, replicas map[commonv1
r.WorkQueue.AddAfter(tfJobKey, time.Duration(*tfJob.Spec.RunPolicy.ActiveDeadlineSeconds)*time.Second)
}
}

// For the situation that jobStatus has a restarting condition, and append a running condition,
// the restarting condition will be removed from jobStatus by commonv1.filterOutCondition(),
// so we need to record the existing restarting condition for later use.
var existingRestartingCondition *commonv1.JobCondition
for _, condition := range jobStatus.Conditions {
if condition.Type == commonv1.JobRestarting {
existingRestartingCondition = &commonv1.JobCondition{
Reason: condition.Reason,
Message: condition.Message,
}
}
}

// iterate the replica spec based on this order
allTypes := []commonv1.ReplicaType{
tensorflowv1.TFReplicaTypeChief,
Expand Down Expand Up @@ -499,14 +513,15 @@ func (r *TFJobReconciler) UpdateJobStatus(job interface{}, replicas map[commonv1
}

if failed > 0 {
restart := false
for _, condition := range jobStatus.Conditions {
if condition.Type == commonv1.JobRestarting {
restart = true
// For the situation that jobStatus has a restarting condition, and appends a new running condition,
// the restarting condition will be removed from jobStatus by commonv1.filterOutCondition(),
// so we need to append the restarting condition back to jobStatus.
if existingRestartingCondition != nil {
err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobRestarting, existingRestartingCondition.Reason, existingRestartingCondition.Message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we reset the value of existingRestartingCondition after this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @richardsliu, I don't know if we could reset it. I record the existingRestartingCondition outside the loop, which is the condition and result for current reconcile(). Inside the loop, when iterate over all pods, if there are multiple workers or masters .etc, are running, and only one worker failed, by the logic inside of updateJobConditions(), restarting + running combination will make running condition overwrite restarting.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardsliu is this the right way to deal with it? If one of the pods is restarting, do we mark the whole job as restarting ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do something along the lines of, would it achieve the same thing?

if condition.Type == commonv1.JobRunning || condition.Type == commonv1.JobRestarting {
  restart = true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pavanky, I would like to hear more advices: ) For now, I just follow the current convention. See https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/tensorflow/tfjob_controller.go#L762 . Current strategy is that if one of the pods failed, then update the whole job status to failed : /

Copy link

@pavanky pavanky Mar 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that technically mean we are updating the job status to restart and incrementing the count two times for the same failure ?

Copy link
Member Author

@cheimu cheimu Mar 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we could see it from 2 points of views.

  1. If jobStatus is running, and one of roles got failed pods, then the whole job will be updated to failed, then in next reconcile, the pods will be deleted, so jobstatus should have a restarting (don't know if I understand the code base correctly, we don't have specific restarting logic but use general reconcile logic)
  2. Technically we just set jobstatus's field twice without sending request to apiServer, so in fact, it didn't get really updated. But yeah, current one is not elegant. 🤔

if err != nil {
commonutil.LoggerForJob(tfJob).Infof("Append tfjob condition error: %v", err)
return err
}
}

if restart {
// job is restarting, no need to set it failed
// we know it because we update the status condition when reconciling the replicas
trainingoperatorcommon.RestartedJobsCounterInc(tfJob.Namespace, tensorflowv1.FrameworkName)
Expand Down