When `restartPolicy` is `ExitCode` and a pod is deleted (137), the entire TFJob will still be marked as failed. #186

rllin · 2022-03-21T19:17:18Z

Expectations: if a replica has restartPolicy=ExitCode, then a pod deletion (triggers 137) should cause that pod to restart without triggering TFJob failure.
Reality: The entire TFJob fails.

However, the correct behavior happens for OnFailure where the pod is properly restarted without the entire job failing.

How to reproduce:

Take below spec [0], and apply.
Delete the evaluator pod.
See that the entire job fails.
Replace the evaluator pod's restartPolicy with OnFailure, repeat steps 1,2 and see that the pod restarts without failing the job.

Suspected issue (may not be core issue):

Every replica type is checked for failures
Regardless of replica type, if the replica has a failure and if the job is not restarting, it will be marked for job failure
Seems like that L513 block should also NOT trigger if the jobStatus.Conditions is running?

[0] TFJob spec.

apiVersion: kubeflow.org/v1
kind: TFJob
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      restartPolicy: ExitCode
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          affinity: {}
          containers:
          - command: [ "/bin/bash", "-c", "--" ]
            args: [ "while true; do sleep 30; done;" ]
            image: busybox
            imagePullPolicy: Always
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
              protocol: TCP
            resources:
              limits:
                cpu: "2"
                memory: 10Gi
    Evaluator:
      replicas: 1
      restartPolicy: ExitCode
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command: [ "/bin/bash", "-c", "--" ]
            args: [ "while true; do sleep 30; done;" ]
            image: busybox
            imagePullPolicy: Always
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
              protocol: TCP
            resources:
              limits:
                cpu: "2"
                memory: 10Gi

The text was updated successfully, but these errors were encountered:

rllin · 2022-03-21T19:56:18Z

extra information:

Chief works as expected:

set restartPolicy of Chief to ExitCode
delete chief pod
job is still running and chief pod comes back as intended

Worker however does not work either like Evaluator

gaocegege · 2022-03-23T03:11:56Z

/cc @zw0610 @kubeflow/wg-training-leads

cheimu · 2022-03-23T05:43:09Z

Hi @rllin,

For busybox, I use /bin/sh to make it run, otherwise pod will get a containerCannotRun error.

It is unrelated to the main issue.

rllin · 2022-03-23T06:13:54Z

@cheimu sorry, that's my bad, i had to replace an internal command with the sleep so had not doublechecked the command properly.

cheimu · 2022-03-23T11:51:40Z

Well, it's a very tricky problem. It took me a very long time to figure it out.

call chain is following:

jc.Controller.ReconcilePods(metaObject, &jobStatus, pods, rtype, spec, replicas) -> commonutil.UpdateJobConditions() now, tfjob has a restarting condition.
jc.Controller.UpdateJobStatus(job, replicas, &jobStatus) -> commonutil.UpdateJobConditions() HERE, WE GOT THE PROBLEM:

We got a for loop here, for @rllin cases, we will iterate 2 times, chief first, and then evaluator: at loop 1, conds are: creating and restarting, and rtype is chief, so it will "append" a new running cond to job as shown in figure. HOWEVER, for this case, it is not append at all.

commonutil.UpdateJobConditions() -> setCondition(jobStatus, condition) -> newConditions := filterOutCondition(status.Conditions, condition.Type):
In our case, existing conds are : creating and restarting, and the new one is running, which will be continued (as shown in figure), so restarting cond will be lost. And the new job conds become creating and running, then at loop 2 (when for evaluator), because there is a failed pod with running cond, so jobstatus will become failed as well

Here the job is set to failed.

https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/tensorflow/tfjob_controller.go#L526, and the whole job is failed....

@zw0610 @gaocegege Hi experts, am I correct about filterOutCondition()? I think it is confusing...

pavanky · 2022-03-23T17:49:18Z

This is inline with what we think is the issue. But we weren't sure if the bugfix was to be in the UpdateJobConditions or if line 504 in the second loop needs to check for JobRunning.

cheimu · 2022-03-23T18:22:56Z

This is inline with what we think is the issue. But we weren't sure if the bugfix was to be in the UpdateJobConditions or if line 504 in the second loop needs to check for JobRunning.

yeah, probably.
I've tried to record the restarting cond first, then append it back, it passed the unit tests and worked for this situation. Please discuss to see if there is a better way to handle it elegantly! : D

gaocegege · 2022-03-24T02:21:01Z

/cc @jian-he

Seems that the code is from @jian-he, would you mind having a look at this issue?

rllin · 2022-03-24T17:11:07Z

@cheimu thanks for the quick turnaround!

johnugeorge · 2022-12-22T07:51:02Z

Fixed by kubeflow/training-operator#1562

rllin changed the title ~~When restartPolicy is ExitCode, the entire TFJob will still be marked as failed for deleted pods (137)~~ When restartPolicy is ExitCode and a pod is deleted (137), the entire TFJob will still be marked as failed. Mar 21, 2022

rllin mentioned this issue Mar 22, 2022

When restartPolicy is ExitCode and a pod is deleted (137), the entire TFJob will still be marked as failed. kubeflow/training-operator#1560

Closed

cheimu mentioned this issue Mar 23, 2022

fix: tfjob with restartPolicy=ExitCode not work kubeflow/training-operator#1562

Merged

1 task

johnugeorge closed this as completed Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When `restartPolicy` is `ExitCode` and a pod is deleted (137), the entire TFJob will still be marked as failed. #186

When `restartPolicy` is `ExitCode` and a pod is deleted (137), the entire TFJob will still be marked as failed. #186

rllin commented Mar 21, 2022

rllin commented Mar 21, 2022 •

edited

Loading

gaocegege commented Mar 23, 2022

cheimu commented Mar 23, 2022

rllin commented Mar 23, 2022 •

edited

Loading

cheimu commented Mar 23, 2022 •

edited

Loading

pavanky commented Mar 23, 2022

cheimu commented Mar 23, 2022 •

edited

Loading

gaocegege commented Mar 24, 2022

rllin commented Mar 24, 2022

johnugeorge commented Dec 22, 2022

When restartPolicy is ExitCode and a pod is deleted (137), the entire TFJob will still be marked as failed. #186

When restartPolicy is ExitCode and a pod is deleted (137), the entire TFJob will still be marked as failed. #186

Comments

rllin commented Mar 21, 2022

rllin commented Mar 21, 2022 • edited Loading

gaocegege commented Mar 23, 2022

cheimu commented Mar 23, 2022

rllin commented Mar 23, 2022 • edited Loading

cheimu commented Mar 23, 2022 • edited Loading

pavanky commented Mar 23, 2022

cheimu commented Mar 23, 2022 • edited Loading

gaocegege commented Mar 24, 2022

rllin commented Mar 24, 2022

johnugeorge commented Dec 22, 2022

When `restartPolicy` is `ExitCode` and a pod is deleted (137), the entire TFJob will still be marked as failed. #186

When `restartPolicy` is `ExitCode` and a pod is deleted (137), the entire TFJob will still be marked as failed. #186

rllin commented Mar 21, 2022 •

edited

Loading

rllin commented Mar 23, 2022 •

edited

Loading

cheimu commented Mar 23, 2022 •

edited

Loading

cheimu commented Mar 23, 2022 •

edited

Loading