Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker failed without exit code #2124

Closed
w1uo01 opened this issue May 21, 2024 · 2 comments
Closed

Worker failed without exit code #2124

w1uo01 opened this issue May 21, 2024 · 2 comments

Comments

@w1uo01
Copy link

w1uo01 commented May 21, 2024

I noticed that a TFJob could fail without a good reason. More details are as follows:

  1. From training-operator log, I can see that a worker is ignored while checking active pod: Ignoring inactive pod kubeflow/tfjob-0a9ddb52-192a-4186-83de-ee90153c26d8-worker-54 in state Running, deletion time 2024-05-20 21:10:28 +0000 UTC
  2. From log I can see that the pod is actually running, but with a deletion timestamp.
  3. We have restartPolicy for worker, which is OnFailure, and backofflimit is 5, but I don't think it is counted in this case.
  4. Because of this worker failure, we will see next log: TFJob=kubeflow/tfjob-0a9ddb52-192a-4186-83de-ee90153c26d8, ReplicaType=Worker expected=70, running=69, failed=1" job=kubeflow.tfjob-0a9ddb52-192a-4186-83de-ee90153c26d8 uid=0e8f6d8e-9db0-4658-9064-dd1c3de246e0

Can anyone help me understand why one worker pod could get deletion timestamp? The job is managed by training-operator, and still has the status Running. We are not deleting this pod manually definitely. Not sure if it is related, but we also have annotation: cluster-autoscaler.kubernetes.io/safe-to-evict: false in the pod.

Per my experience this issue could happen at any time while job running, sometimes it won't happen until 10 hours later, while sometimes it could happen around ~1 hour. PS machine can get this issue too, like I see this log too: Ignoring inactive pod kubeflow/tfjob-d1938a95-76fe-4af8-b6b4-4a347247f3c4-ps-19 in state Running, deletion time 2024-05-20 22:57:04 +0000 UTC

Could training-operator add deletion timestamp to a running pod? Tbh, I guess it might not be the truth since that doesn't sound reasonable to me. If not training-operator, will this be related to cluster itself? We are using GKE for our jobs.

Thanks for any input!

/kind question
/community question
/help wanted

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Copy link

github-actions bot commented Sep 9, 2024

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@github-actions github-actions bot closed this as completed Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant