When restartPolicy is ExitCode and a pod is deleted (137), the entire TFJob will still be marked as failed. #1560

rllin · 2022-03-22T17:17:02Z

Expectations: if a replica has restartPolicy=ExitCode, then a pod deletion (triggers 137) should cause that pod to restart without triggering TFJob failure.
Reality: The entire TFJob fails.

However, the correct behavior happens for OnFailure where the pod is properly restarted without the entire job failing.

How to reproduce:

Take below spec [0], and apply.
Delete the evaluator pod.
See that the entire job fails.
Replace the evaluator pod's restartPolicy with OnFailure, repeat steps 1,2 and see that the pod restarts without failing the job.

Suspected issue (may not be core issue):

Every replica type is checked for failures
Regardless of replica type, if the replica has a failure and if the job is not restarting, it will be marked for job failure
Seems like that L513 block should also NOT trigger if the jobStatus.Conditions is running?

[0] TFJob spec.

apiVersion: kubeflow.org/v1
kind: TFJob
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      restartPolicy: ExitCode
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          affinity: {}
          containers:
          - command: [ "/bin/bash", "-c", "--" ]
            args: [ "while true; do sleep 30; done;" ]
            image: busybox
            imagePullPolicy: Always
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
              protocol: TCP
            resources:
              limits:
                cpu: "2"
                memory: 10Gi
    Evaluator:
      replicas: 1
      restartPolicy: ExitCode
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command: [ "/bin/bash", "-c", "--" ]
            args: [ "while true; do sleep 30; done;" ]
            image: busybox
            imagePullPolicy: Always
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
              protocol: TCP
            resources:
              limits:
                cpu: "2"
                memory: 10Gi

extra information:

Chief works as expected:

set restartPolicy of Chief to ExitCode
delete chief pod
job is still running and chief pod comes back as intended

Worker however does not work either like Evaluator

The text was updated successfully, but these errors were encountered:

rllin · 2022-03-22T17:17:09Z

replicated from kubeflow/common#186

cheimu · 2022-03-23T11:57:48Z

updated in kubeflow/common#186 (comment)

cheimu mentioned this issue Mar 23, 2022

fix: tfjob with restartPolicy=ExitCode not work #1562

Merged

1 task

google-oss-prow bot closed this as completed in #1562 Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When restartPolicy is ExitCode and a pod is deleted (137), the entire TFJob will still be marked as failed. #1560

When restartPolicy is ExitCode and a pod is deleted (137), the entire TFJob will still be marked as failed. #1560

rllin commented Mar 22, 2022

rllin commented Mar 22, 2022

cheimu commented Mar 23, 2022

When restartPolicy is ExitCode and a pod is deleted (137), the entire TFJob will still be marked as failed. #1560

When restartPolicy is ExitCode and a pod is deleted (137), the entire TFJob will still be marked as failed. #1560

Comments

rllin commented Mar 22, 2022

rllin commented Mar 22, 2022

cheimu commented Mar 23, 2022