Worker failed without exit code #2124

w1uo01 · 2024-05-21T16:49:14Z

I noticed that a TFJob could fail without a good reason. More details are as follows:

From training-operator log, I can see that a worker is ignored while checking active pod: Ignoring inactive pod kubeflow/tfjob-0a9ddb52-192a-4186-83de-ee90153c26d8-worker-54 in state Running, deletion time 2024-05-20 21:10:28 +0000 UTC
From log I can see that the pod is actually running, but with a deletion timestamp.
We have restartPolicy for worker, which is OnFailure, and backofflimit is 5, but I don't think it is counted in this case.
Because of this worker failure, we will see next log: TFJob=kubeflow/tfjob-0a9ddb52-192a-4186-83de-ee90153c26d8, ReplicaType=Worker expected=70, running=69, failed=1" job=kubeflow.tfjob-0a9ddb52-192a-4186-83de-ee90153c26d8 uid=0e8f6d8e-9db0-4658-9064-dd1c3de246e0

Can anyone help me understand why one worker pod could get deletion timestamp? The job is managed by training-operator, and still has the status Running. We are not deleting this pod manually definitely. Not sure if it is related, but we also have annotation: cluster-autoscaler.kubernetes.io/safe-to-evict: false in the pod.

Per my experience this issue could happen at any time while job running, sometimes it won't happen until 10 hours later, while sometimes it could happen around ~1 hour. PS machine can get this issue too, like I see this log too: Ignoring inactive pod kubeflow/tfjob-d1938a95-76fe-4af8-b6b4-4a347247f3c4-ps-19 in state Running, deletion time 2024-05-20 22:57:04 +0000 UTC

Could training-operator add deletion timestamp to a running pod? Tbh, I guess it might not be the truth since that doesn't sound reasonable to me. If not training-operator, will this be related to cluster itself? We are using GKE for our jobs.

Thanks for any input!

/kind question
/community question
/help wanted

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-19T20:02:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2024-09-09T00:06:53Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

google-oss-prow bot added the kind/question label May 21, 2024

github-actions bot added the lifecycle/stale label Aug 19, 2024

github-actions bot closed this as completed Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker failed without exit code #2124

Worker failed without exit code #2124

w1uo01 commented May 21, 2024

github-actions bot commented Aug 19, 2024

github-actions bot commented Sep 9, 2024

Worker failed without exit code #2124

Worker failed without exit code #2124

Comments

w1uo01 commented May 21, 2024

github-actions bot commented Aug 19, 2024

github-actions bot commented Sep 9, 2024