-
Notifications
You must be signed in to change notification settings - Fork 71
When restartPolicy
is ExitCode
and a pod is deleted (137), the entire TFJob will still be marked as failed.
#186
Comments
restartPolicy
is ExitCode
, the entire TFJob will still be marked as failed for deleted pods (137)restartPolicy
is ExitCode
and a pod is deleted (137), the entire TFJob will still be marked as failed.
extra information:
|
/cc @zw0610 @kubeflow/wg-training-leads |
Hi @rllin, It is unrelated to the main issue. |
@cheimu sorry, that's my bad, i had to replace an internal command with the |
Well, it's a very tricky problem. It took me a very long time to figure it out. call chain is following:
We got a for loop here, for @rllin cases, we will iterate 2 times,
Here the job is set to @zw0610 @gaocegege Hi experts, am I correct about |
This is inline with what we think is the issue. But we weren't sure if the bugfix was to be in the UpdateJobConditions or if line 504 in the second loop needs to check for |
yeah, probably. |
@cheimu thanks for the quick turnaround! |
Fixed by kubeflow/training-operator#1562 |
Expectations: if a replica has
restartPolicy=ExitCode
, then a pod deletion (triggers137
) should cause that pod to restart without triggering TFJob failure.Reality: The entire TFJob fails.
However, the correct behavior happens for
OnFailure
where the pod is properly restarted without the entire job failing.How to reproduce:
restartPolicy
withOnFailure
, repeat steps 1,2 and see that the pod restarts without failing the job.Suspected issue (may not be core issue):
[0] TFJob spec.
The text was updated successfully, but these errors were encountered: