Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

When restartPolicy is ExitCode and a pod is deleted (137), the entire TFJob will still be marked as failed. #186

Closed
rllin opened this issue Mar 21, 2022 · 10 comments

Comments

@rllin
Copy link

rllin commented Mar 21, 2022

Expectations: if a replica has restartPolicy=ExitCode, then a pod deletion (triggers 137) should cause that pod to restart without triggering TFJob failure.
Reality: The entire TFJob fails.

However, the correct behavior happens for OnFailure where the pod is properly restarted without the entire job failing.

How to reproduce:

  1. Take below spec [0], and apply.
  2. Delete the evaluator pod.
  3. See that the entire job fails.
  4. Replace the evaluator pod's restartPolicy with OnFailure, repeat steps 1,2 and see that the pod restarts without failing the job.

Suspected issue (may not be core issue):

  • Every replica type is checked for failures
  • Regardless of replica type, if the replica has a failure and if the job is not restarting, it will be marked for job failure
  • Seems like that L513 block should also NOT trigger if the jobStatus.Conditions is running?

[0] TFJob spec.

apiVersion: kubeflow.org/v1
kind: TFJob
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      restartPolicy: ExitCode
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          affinity: {}
          containers:
          - command: [ "/bin/bash", "-c", "--" ]
            args: [ "while true; do sleep 30; done;" ]
            image: busybox
            imagePullPolicy: Always
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
              protocol: TCP
            resources:
              limits:
                cpu: "2"
                memory: 10Gi
    Evaluator:
      replicas: 1
      restartPolicy: ExitCode
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command: [ "/bin/bash", "-c", "--" ]
            args: [ "while true; do sleep 30; done;" ]
            image: busybox
            imagePullPolicy: Always
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
              protocol: TCP
            resources:
              limits:
                cpu: "2"
                memory: 10Gi
@rllin rllin changed the title When restartPolicy is ExitCode, the entire TFJob will still be marked as failed for deleted pods (137) When restartPolicy is ExitCode and a pod is deleted (137), the entire TFJob will still be marked as failed. Mar 21, 2022
@rllin
Copy link
Author

rllin commented Mar 21, 2022

extra information:

Chief works as expected:

  1. set restartPolicy of Chief to ExitCode
  2. delete chief pod
  3. job is still running and chief pod comes back as intended

Worker however does not work either like Evaluator

@gaocegege
Copy link
Member

/cc @zw0610 @kubeflow/wg-training-leads

@cheimu
Copy link
Member

cheimu commented Mar 23, 2022

Hi @rllin,
image
For busybox, I use /bin/sh to make it run, otherwise pod will get a containerCannotRun error.

It is unrelated to the main issue.

@rllin
Copy link
Author

rllin commented Mar 23, 2022

@cheimu sorry, that's my bad, i had to replace an internal command with the sleep so had not doublechecked the command properly.

@cheimu
Copy link
Member

cheimu commented Mar 23, 2022

Well, it's a very tricky problem. It took me a very long time to figure it out.

call chain is following:

  1. jc.Controller.ReconcilePods(metaObject, &jobStatus, pods, rtype, spec, replicas) -> commonutil.UpdateJobConditions() now, tfjob has a restarting condition.
  2. jc.Controller.UpdateJobStatus(job, replicas, &jobStatus) -> commonutil.UpdateJobConditions() HERE, WE GOT THE PROBLEM:

We got a for loop here, for @rllin cases, we will iterate 2 times, chief first, and then evaluator: at loop 1, conds are: creating and restarting, and rtype is chief, so it will "append" a new running cond to job as shown in figure. HOWEVER, for this case, it is not append at all.
image

  1. commonutil.UpdateJobConditions() -> setCondition(jobStatus, condition) -> newConditions := filterOutCondition(status.Conditions, condition.Type):
    In our case, existing conds are : creating and restarting, and the new one is running, which will be continued (as shown in figure), so restarting cond will be lost. And the new job conds become creating and running, then at loop 2 (when for evaluator), because there is a failed pod with running cond, so jobstatus will become failed as well

image

Here the job is set to failed.
image
https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/tensorflow/tfjob_controller.go#L526, and the whole job is failed....

@zw0610 @gaocegege Hi experts, am I correct about filterOutCondition()? I think it is confusing...

@pavanky
Copy link

pavanky commented Mar 23, 2022

This is inline with what we think is the issue. But we weren't sure if the bugfix was to be in the UpdateJobConditions or if line 504 in the second loop needs to check for JobRunning.

@cheimu
Copy link
Member

cheimu commented Mar 23, 2022

This is inline with what we think is the issue. But we weren't sure if the bugfix was to be in the UpdateJobConditions or if line 504 in the second loop needs to check for JobRunning.

yeah, probably.
I've tried to record the restarting cond first, then append it back, it passed the unit tests and worked for this situation. Please discuss to see if there is a better way to handle it elegantly! : D

@gaocegege
Copy link
Member

/cc @jian-he

Seems that the code is from @jian-he, would you mind having a look at this issue?

@rllin
Copy link
Author

rllin commented Mar 24, 2022

@cheimu thanks for the quick turnaround!

@johnugeorge
Copy link
Member

Fixed by kubeflow/training-operator#1562

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants