You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that a TFJob could fail without a good reason. More details are as follows:
From training-operator log, I can see that a worker is ignored while checking active pod: Ignoring inactive pod kubeflow/tfjob-0a9ddb52-192a-4186-83de-ee90153c26d8-worker-54 in state Running, deletion time 2024-05-20 21:10:28 +0000 UTC
From log I can see that the pod is actually running, but with a deletion timestamp.
We have restartPolicy for worker, which is OnFailure, and backofflimit is 5, but I don't think it is counted in this case.
Because of this worker failure, we will see next log: TFJob=kubeflow/tfjob-0a9ddb52-192a-4186-83de-ee90153c26d8, ReplicaType=Worker expected=70, running=69, failed=1" job=kubeflow.tfjob-0a9ddb52-192a-4186-83de-ee90153c26d8 uid=0e8f6d8e-9db0-4658-9064-dd1c3de246e0
Can anyone help me understand why one worker pod could get deletion timestamp? The job is managed by training-operator, and still has the status Running. We are not deleting this pod manually definitely. Not sure if it is related, but we also have annotation: cluster-autoscaler.kubernetes.io/safe-to-evict: false in the pod.
Per my experience this issue could happen at any time while job running, sometimes it won't happen until 10 hours later, while sometimes it could happen around ~1 hour. PS machine can get this issue too, like I see this log too: Ignoring inactive pod kubeflow/tfjob-d1938a95-76fe-4af8-b6b4-4a347247f3c4-ps-19 in state Running, deletion time 2024-05-20 22:57:04 +0000 UTC
Could training-operator add deletion timestamp to a running pod? Tbh, I guess it might not be the truth since that doesn't sound reasonable to me. If not training-operator, will this be related to cluster itself? We are using GKE for our jobs.
Thanks for any input!
/kind question
/community question
/help wanted
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I noticed that a TFJob could fail without a good reason. More details are as follows:
Ignoring inactive pod kubeflow/tfjob-0a9ddb52-192a-4186-83de-ee90153c26d8-worker-54 in state Running, deletion time 2024-05-20 21:10:28 +0000 UTC
TFJob=kubeflow/tfjob-0a9ddb52-192a-4186-83de-ee90153c26d8, ReplicaType=Worker expected=70, running=69, failed=1" job=kubeflow.tfjob-0a9ddb52-192a-4186-83de-ee90153c26d8 uid=0e8f6d8e-9db0-4658-9064-dd1c3de246e0
Can anyone help me understand why one worker pod could get deletion timestamp? The job is managed by training-operator, and still has the status Running. We are not deleting this pod manually definitely. Not sure if it is related, but we also have annotation:
cluster-autoscaler.kubernetes.io/safe-to-evict: false
in the pod.Per my experience this issue could happen at any time while job running, sometimes it won't happen until 10 hours later, while sometimes it could happen around ~1 hour. PS machine can get this issue too, like I see this log too:
Ignoring inactive pod kubeflow/tfjob-d1938a95-76fe-4af8-b6b4-4a347247f3c4-ps-19 in state Running, deletion time 2024-05-20 22:57:04 +0000 UTC
Could training-operator add deletion timestamp to a running pod? Tbh, I guess it might not be the truth since that doesn't sound reasonable to me. If not training-operator, will this be related to cluster itself? We are using GKE for our jobs.
Thanks for any input!
/kind question
/community question
/help wanted
The text was updated successfully, but these errors were encountered: