Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling cases where pod is stuck in Terminating state #576

Open
qGentry opened this issue May 22, 2024 · 10 comments
Open

Handling cases where pod is stuck in Terminating state #576

qGentry opened this issue May 22, 2024 · 10 comments

Comments

@qGentry
Copy link

qGentry commented May 22, 2024

Hi, I was wondering how to properly handle cases where worker pod is stuck in Terminating state.
From my experience, this may happen in various cases:

  • Node has got shut down during pod deletion
  • Kernel hang on node
  • GPU problems

From my quick experiments with JobSet, if worker pod has stuck in Terminating state, JobSet will not trigger restart as it is waiting for underlying pods be terminated.
Quick workaround might be something like CronJob that periodically force deletes jobset-controlled pods that stuck in Terminating state for more than N minutes but this is suboptimal as you cannot subsequently manually investigate what actually happened with pod and why it has got stuck in Terminating state.

It would be great if I could specify something like "podTerminationTimeout" after which JobSet will create new Job without waiting for previous pods to be terminated.

@kannon92
Copy link
Contributor

kannon92 commented May 22, 2024

We created the PodReplacementPolicy in the job api for this reason.

it’s a beta feature in 1.29 and will only recreate a pod once it is fully terminated.

https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated/README.md

rereading this not sure if this KEP would help here. Sounds like you want the job to be marked as failed if it goes to terminating.

@qGentry
Copy link
Author

qGentry commented May 22, 2024

Yeah, I don't see how this KEP would help.
Actually, after a quick reading of this KEP, I think it would introduce the same problem I'm experiencing with JobSet in the vanilla k8s Job – it will never restart if a pod is stuck in the Terminating state.

@danielvegamyhre
Copy link
Contributor

Yeah I have experienced pods staying in terminating state for a long time when doing training on TPUs as well. One way we could get around this is setting some timeout on the Job foreground deletion call, and then forcibly delete all pods once we hit that timeout.

However, this is not great since forcibly deleting the pod objects from etcd doesn't guarantee the underlying container process has been cleaned up - a problematic container process could still be holding a GPU/TPU resource for example, preventing a newly scheduled pod from using it.

@qGentry
Copy link
Author

qGentry commented May 26, 2024

Totally agree with you, I'm currently using hand-crafted argo workflow for launching multi-node training which also requires force deleting pods stuck in Terminating state which just deletes them from etcd and often leads to silently weirdly behaving nodes. I ended up with tainting nodes before force deleting pods which kinda works but is really dirty hack.

That was actually main reason why I wanted to find alternative (like JobSet) for synchronous jobs hoping that this problem will be solved already :)

One possible implementation that comes to my mind (without need to forcefully delete workers) is to name Job created by JobSet with attempt number, like
pytorch-workers-0-attempt-0/pytorch-workers-0-attempt-1/pytorch-workers-0-attempt-2/... (instead of pytorch-workers-0 for each attempt) and providing way to set timeout in JobSet's spec for Job's workers to terminate (default to infinity for backward compability). If time runs out, we just create new job with attempt count increased by one and leave previous Job to just hang for further investigation while new workers will be able to schedule to free nodes and continue training progress.

But at least one important problem I see here is headless service. As the pods for each attempt will be named differently, we have to force users to handle this in user code.

One possible approach would to to env var similar to rank looking like this

              - name: MASTER_NAME
                value: "pytorch-workers-0-0"
              - name: MASTER_CONTAINER
                value: "pytorch"
              - name: ATTEMPT
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['jobset.sigs.k8s.io/restart-attempt']

and setting torchrun --master_addr=$MASTER_ADDR-attempt-$ATTEMPT.$MASTER_CONTAINER as cmd

@kannon92
Copy link
Contributor

If I understand this correctly, it sounds like you want the Job to be failed as soon as a pod goes into terminating. I see that we could implement recreation in Jobset or we could allow a way to mark a job as failed as soon as a pod goes to terminating.

@mimowo @alculquicondor any ideas here? Jobset only recreates jobs once they are failed.

@mimowo
Copy link
Contributor

mimowo commented May 29, 2024

I think a Pod stuck in terminating is something we should eliminate in the first place. Or, at least, we need to understand what is the scenario to propose the best approach.

Underneath JobSet the Pod is managed by the batch/Job controller, and there has been some fixes in the recent k8s versions. For example, when the node is gone, the pod phase should be transitioned from Running to Failed by PodGC in k8s 1.26+.

What is your k8s version? Also, can you share your JobSet yaml, and the yaml for the stuck pod?

@mimowo
Copy link
Contributor

mimowo commented May 29, 2024

Also, what does terminating actually mean in this case? is it in phase running and cannot transition to Failed, or it is already in Failed, but there is a finalizer which blocks the final deletion from the API server.

@alculquicondor
Copy link

It would be better to understand the exact cases where the Pods are getting stuck and bring this up to kubernetes/kubernetes, instead of adding hacks in Jobset.

@alculquicondor
Copy link

cc @SergeyKanzhelev

@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Jul 1, 2024

It would be better to understand the exact cases where the Pods are getting stuck and bring this up to kubernetes/kubernetes, instead of adding hacks in Jobset.

Agreed. I want to prioritize this because it is actually particularly problematic for large scale distributed ML training workloads, as it can substantially increase e2e failure recovery latency. We use foreground deletion when deleting failed Jobs, to prevent exponential backoff of pod creation attempts when the pods from the previous Job iteration still exist. So when pods stay in terminating state, this prevents the JobSet controller from creating a new replacement Job until all pods are finally cleaned up, and only then can the rescheduling of all the new pods begin.

For the cases I've seen, I think it may be due to SIGTERM signal handers in the training code which trigger auto-checkpointing logic on graceful shutdown, and so at least terminationGracePeriodSeconds seconds pass before pod objects are actually deleted from etcd.

I also wonder if the container process is not releasing the accelerator chip cleanly/quickly for some reason.

I will talk with some folks in SIG Node to get their take on this and try to drive a long-term solution for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants