You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expectations: if a replica has restartPolicy=ExitCode, then a pod deletion (triggers 137) should cause that pod to restart without triggering TFJob failure. Reality: The entire TFJob fails.
However, the correct behavior happens for OnFailure where the pod is properly restarted without the entire job failing.
How to reproduce:
Take below spec [0], and apply.
Delete the evaluator pod.
See that the entire job fails.
Replace the evaluator pod's restartPolicy with OnFailure, repeat steps 1,2 and see that the pod restarts without failing the job.
Expectations: if a replica has
restartPolicy=ExitCode
, then a pod deletion (triggers137
) should cause that pod to restart without triggering TFJob failure.Reality: The entire TFJob fails.
However, the correct behavior happens for
OnFailure
where the pod is properly restarted without the entire job failing.How to reproduce:
restartPolicy
withOnFailure
, repeat steps 1,2 and see that the pod restarts without failing the job.Suspected issue (may not be core issue):
[0] TFJob spec.
extra information:
Chief
works as expected:restartPolicy
ofChief
toExitCode
Worker
however does not work either likeEvaluator
The text was updated successfully, but these errors were encountered: