Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When restartPolicy is ExitCode and a pod is deleted (137), the entire TFJob will still be marked as failed. #1560

Closed
rllin opened this issue Mar 22, 2022 · 2 comments · Fixed by #1562

Comments

@rllin
Copy link

rllin commented Mar 22, 2022

Expectations: if a replica has restartPolicy=ExitCode, then a pod deletion (triggers 137) should cause that pod to restart without triggering TFJob failure.
Reality: The entire TFJob fails.

However, the correct behavior happens for OnFailure where the pod is properly restarted without the entire job failing.

How to reproduce:

  1. Take below spec [0], and apply.
  2. Delete the evaluator pod.
  3. See that the entire job fails.
  4. Replace the evaluator pod's restartPolicy with OnFailure, repeat steps 1,2 and see that the pod restarts without failing the job.

Suspected issue (may not be core issue):

  • Every replica type is checked for failures
  • Regardless of replica type, if the replica has a failure and if the job is not restarting, it will be marked for job failure
  • Seems like that L513 block should also NOT trigger if the jobStatus.Conditions is running?

[0] TFJob spec.

apiVersion: kubeflow.org/v1
kind: TFJob
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      restartPolicy: ExitCode
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          affinity: {}
          containers:
          - command: [ "/bin/bash", "-c", "--" ]
            args: [ "while true; do sleep 30; done;" ]
            image: busybox
            imagePullPolicy: Always
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
              protocol: TCP
            resources:
              limits:
                cpu: "2"
                memory: 10Gi
    Evaluator:
      replicas: 1
      restartPolicy: ExitCode
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command: [ "/bin/bash", "-c", "--" ]
            args: [ "while true; do sleep 30; done;" ]
            image: busybox
            imagePullPolicy: Always
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
              protocol: TCP
            resources:
              limits:
                cpu: "2"
                memory: 10Gi

extra information:

Chief works as expected:

  1. set restartPolicy of Chief to ExitCode
  2. delete chief pod
  3. job is still running and chief pod comes back as intended

Worker however does not work either like Evaluator

@rllin
Copy link
Author

rllin commented Mar 22, 2022

replicated from kubeflow/common#186

@cheimu
Copy link
Member

cheimu commented Mar 23, 2022

updated in kubeflow/common#186 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants