Handling cases where pod is stuck in Terminating state #576

qGentry · 2024-05-22T16:28:11Z

Hi, I was wondering how to properly handle cases where worker pod is stuck in Terminating state.
From my experience, this may happen in various cases:

Node has got shut down during pod deletion
Kernel hang on node
GPU problems

From my quick experiments with JobSet, if worker pod has stuck in Terminating state, JobSet will not trigger restart as it is waiting for underlying pods be terminated.
Quick workaround might be something like CronJob that periodically force deletes jobset-controlled pods that stuck in Terminating state for more than N minutes but this is suboptimal as you cannot subsequently manually investigate what actually happened with pod and why it has got stuck in Terminating state.

It would be great if I could specify something like "podTerminationTimeout" after which JobSet will create new Job without waiting for previous pods to be terminated.

kannon92 · 2024-05-22T16:43:02Z

We created the PodReplacementPolicy in the job api for this reason.

it’s a beta feature in 1.29 and will only recreate a pod once it is fully terminated.

https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated/README.md

rereading this not sure if this KEP would help here. Sounds like you want the job to be marked as failed if it goes to terminating.

qGentry · 2024-05-22T17:34:45Z

Yeah, I don't see how this KEP would help.
Actually, after a quick reading of this KEP, I think it would introduce the same problem I'm experiencing with JobSet in the vanilla k8s Job – it will never restart if a pod is stuck in the Terminating state.

danielvegamyhre · 2024-05-25T16:01:42Z

Yeah I have experienced pods staying in terminating state for a long time when doing training on TPUs as well. One way we could get around this is setting some timeout on the Job foreground deletion call, and then forcibly delete all pods once we hit that timeout.

However, this is not great since forcibly deleting the pod objects from etcd doesn't guarantee the underlying container process has been cleaned up - a problematic container process could still be holding a GPU/TPU resource for example, preventing a newly scheduled pod from using it.

qGentry · 2024-05-26T09:42:48Z

Totally agree with you, I'm currently using hand-crafted argo workflow for launching multi-node training which also requires force deleting pods stuck in Terminating state which just deletes them from etcd and often leads to silently weirdly behaving nodes. I ended up with tainting nodes before force deleting pods which kinda works but is really dirty hack.

That was actually main reason why I wanted to find alternative (like JobSet) for synchronous jobs hoping that this problem will be solved already :)

One possible implementation that comes to my mind (without need to forcefully delete workers) is to name Job created by JobSet with attempt number, like
pytorch-workers-0-attempt-0/pytorch-workers-0-attempt-1/pytorch-workers-0-attempt-2/... (instead of pytorch-workers-0 for each attempt) and providing way to set timeout in JobSet's spec for Job's workers to terminate (default to infinity for backward compability). If time runs out, we just create new job with attempt count increased by one and leave previous Job to just hang for further investigation while new workers will be able to schedule to free nodes and continue training progress.

But at least one important problem I see here is headless service. As the pods for each attempt will be named differently, we have to force users to handle this in user code.

One possible approach would to to env var similar to rank looking like this

              - name: MASTER_NAME
                value: "pytorch-workers-0-0"
              - name: MASTER_CONTAINER
                value: "pytorch"
              - name: ATTEMPT
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['jobset.sigs.k8s.io/restart-attempt']

and setting torchrun --master_addr=$MASTER_ADDR-attempt-$ATTEMPT.$MASTER_CONTAINER as cmd

kannon92 · 2024-05-29T14:34:31Z

If I understand this correctly, it sounds like you want the Job to be failed as soon as a pod goes into terminating. I see that we could implement recreation in Jobset or we could allow a way to mark a job as failed as soon as a pod goes to terminating.

@mimowo @alculquicondor any ideas here? Jobset only recreates jobs once they are failed.

mimowo · 2024-05-29T14:46:50Z

I think a Pod stuck in terminating is something we should eliminate in the first place. Or, at least, we need to understand what is the scenario to propose the best approach.

Underneath JobSet the Pod is managed by the batch/Job controller, and there has been some fixes in the recent k8s versions. For example, when the node is gone, the pod phase should be transitioned from Running to Failed by PodGC in k8s 1.26+.

What is your k8s version? Also, can you share your JobSet yaml, and the yaml for the stuck pod?

mimowo · 2024-05-29T14:48:38Z

Also, what does terminating actually mean in this case? is it in phase running and cannot transition to Failed, or it is already in Failed, but there is a finalizer which blocks the final deletion from the API server.

alculquicondor · 2024-05-29T17:04:38Z

It would be better to understand the exact cases where the Pods are getting stuck and bring this up to kubernetes/kubernetes, instead of adding hacks in Jobset.

alculquicondor · 2024-05-29T17:05:04Z

cc @SergeyKanzhelev

danielvegamyhre · 2024-07-01T00:16:33Z

It would be better to understand the exact cases where the Pods are getting stuck and bring this up to kubernetes/kubernetes, instead of adding hacks in Jobset.

Agreed. I want to prioritize this because it is actually particularly problematic for large scale distributed ML training workloads, as it can substantially increase e2e failure recovery latency. We use foreground deletion when deleting failed Jobs, to prevent exponential backoff of pod creation attempts when the pods from the previous Job iteration still exist. So when pods stay in terminating state, this prevents the JobSet controller from creating a new replacement Job until all pods are finally cleaned up, and only then can the rescheduling of all the new pods begin.

For the cases I've seen, I think it may be due to SIGTERM signal handers in the training code which trigger auto-checkpointing logic on graceful shutdown, and so at least terminationGracePeriodSeconds seconds pass before pod objects are actually deleted from etcd.

I also wonder if the container process is not releasing the accelerator chip cleanly/quickly for some reason.

I will talk with some folks in SIG Node to get their take on this and try to drive a long-term solution for it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling cases where pod is stuck in Terminating state #576

Handling cases where pod is stuck in Terminating state #576

qGentry commented May 22, 2024

kannon92 commented May 22, 2024 •

edited

Loading

qGentry commented May 22, 2024

danielvegamyhre commented May 25, 2024

qGentry commented May 26, 2024

kannon92 commented May 29, 2024

mimowo commented May 29, 2024

mimowo commented May 29, 2024

alculquicondor commented May 29, 2024

alculquicondor commented May 29, 2024

danielvegamyhre commented Jul 1, 2024 •

edited

Loading

Handling cases where pod is stuck in Terminating state #576

Handling cases where pod is stuck in Terminating state #576

Comments

qGentry commented May 22, 2024

kannon92 commented May 22, 2024 • edited Loading

qGentry commented May 22, 2024

danielvegamyhre commented May 25, 2024

qGentry commented May 26, 2024

kannon92 commented May 29, 2024

mimowo commented May 29, 2024

mimowo commented May 29, 2024

alculquicondor commented May 29, 2024

alculquicondor commented May 29, 2024

danielvegamyhre commented Jul 1, 2024 • edited Loading

kannon92 commented May 22, 2024 •

edited

Loading

danielvegamyhre commented Jul 1, 2024 •

edited

Loading