Enhancement: Marking a pending pod as failed after a certain timeout #113211

kannon92 · 2022-10-20T14:21:09Z

What would you like to be added?

We would like to add configurable timeouts to allow pending pods to transition to failed.

It would be ideal if it could be configurable based on container events or messages rather than a catch all timeout.

A possible API could be as follows:

      - regexp: "Failed to pull image.*desc = failed to pull and unpack image"            # Suggests genuine problem with image name, no point in waiting around too long.
        gracePeriod: 90s
      - regexp: "Failed to pull image.*code = Unknown desc = Error response from daemon"  # Seen when image doesn't exist, no point in waiting around too long.
        gracePeriod: 90s

This can allow certain events/statuses to transition to failed.

Why is this needed?

In the Armada project, we found that for batch users, it is useful to control how long Pods stay in pending before marking the pods as failed. This allows for our scheduler to remove pods from the cluster and allow room for new pods to be scheduled. This also allows users to be notified of invalid pods and they are able to resubmit their pods with correct configurations.

We have discussed this idea in conjunction with the non goals of https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures#non-goals and @alculquicondor requested that I create an issue.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-10-20T14:21:19Z

@kannon92: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kannon92 · 2022-10-20T14:22:29Z

/sig node
/wg batch

alculquicondor · 2022-10-20T15:30:34Z

/sig apps

regex is certainly not an API we would like to encourage, as the messages can change from one release to the next.

Is it fair to say that you would like to target these specific Pending scenarios?

Unschedulable pod (there is already a condition PodScheduled=false that can be targeted)
Pull image failures. We can probably add one. I only found one preceding issue about this, but there are probably others Fail pod with invalid image #111788

Are there other Pending scenarios you think are useful to target?

The additional requirement for the API is where to define the timeouts.

alculquicondor · 2022-10-20T15:31:07Z

maybe @smarterclayton has given this a thought already :)

kannon92 · 2022-10-21T13:57:58Z

/sig apps

regex is certainly not an API we would like to encourage, as the messages can change from one release to the next.

Is it fair to say that you would like to target these specific Pending scenarios?

Unschedulable pod (there is already a condition PodScheduled=false that can be targeted)

Pull image failures. We can probably add one. I only found one preceding issue about this, but there are probably others Fail pod with invalid image #111788

Are there other Pending scenarios you think are useful to target?

The additional requirement for the API is where to define the timeouts.

So I asked around. Other areas that we found useful are also in InvalidConfigMaps, InvalidSecrets, bad pvc. You covered a lot of them in the NonGoals of the KEP. We also have issues with GPU Plugin not be able to be initialized.

alculquicondor · 2022-10-21T14:15:02Z

I guess we can cover most of them in a single PodCondition FailedToStartup. We need sig-node input. We can try to get some feedback at kubecon or the next sig-node meeting.

kannon92 · 2022-10-21T14:20:14Z

@alculquicondor Could you explain a bit more about why this couldn't be the responsibility of the Job Controller? The one thing that comes to mind is that this is a bigger problem since just for users scheduling jobs. But I don't quite understand the distinction between responsibility of sig-node and sig-apps? Is it primarly because kubelet is responsible for determining if a Pod has failed?

alculquicondor · 2022-10-21T14:34:34Z

Is it primarly because kubelet is responsible for determining if a Pod has failed?

Yes.
Also, the Job failure policy is based on PodConditions, and there is no pod condition for all the scenarios you have described.

sftim · 2022-11-07T16:38:30Z

Could we provide this functionality as an out-of-tree add-on to Kubernetes? Is there anything about this enhancement proposal that requires a change to Kubernetes itself?

alculquicondor · 2022-11-07T18:05:15Z

It depends. I'm not sure if in all the cases that @kannon92 mentioned, the Pod is stuck with Pending phase. If so, yes, I think an out-of-tree controller could implement the functionality.

If the kubelet makes a transition of the pods to Failed, then we need kubelet changes, because we need to add a Pod condition atomically.

alculquicondor · 2022-11-07T19:38:02Z

Although, an important question is how can an out-of-tree controller detect these failures. Is there a stable Pod condition type or reason that can be monitored?

If the controller has to do regex on error messages, that can easily break from one release to another. So at the very least in upstream we should have consistent Pod conditions.

kannon92 · 2022-11-09T22:13:25Z

@alculquicondor and I followed up offline. I created a repo that goes through the common cases our batch users experience with the pending status.

I created a document to walk through these common use cases and to demonstrate what conditions, statuses and events these states give.

https://github.com/kannon92/PendingPodChecks/blob/main/README.md

kannon92 · 2023-01-19T20:12:01Z

I did a little POC on this in my own fork: The PRs are kannon92#2 and kannon92#1.

In those PRs I added an ability for the job controller to fail pods based on conditions for PendingPods. A brief review from @alculquicondor explained that we will need a much more strict API to deal with transient conditions. Transient conditions are ones where conditions transition to False -> True -> False throughout a pod lifecycle. I don't think there is any guarantee that Pods conditions get set once and never toched again. So I think an API needs to take this is consideration.

Our hope is to bring up to sig-node about the possiblity of adding a new API to Kubelet that can terminate pods based on conditions. This API will need to match on conditions but also take in account transient condition setting.

In general I see two mutually exclusive feature requests to achieve this:

We need a general condition that tells the pod if its stuck. https://github.com/kannon92/PendingPodChecks/blob/main/README.md
An API that allows terminated of Pods if they match a condition value for x amount of time.

There exists conditions already so I believe 2 can be addressed separately from 1. Some of these conditions can be UnableToSchedule or PodHasNetwork (in Beta in 1.26) so it is possible to at least flush out an API with these conditions.

aojea · 2023-01-20T09:10:16Z

/cc @thockin

thockin · 2023-01-20T21:01:53Z

This is an interesting idea, but I wonder if it's a sub-case of a more general capability. Something like "tolerance" (bad name because it's too close to tolerations) - which allows a pod to express bounded-duration situations which it can or cannot deal with.

"I can accept up to 30 seconds of node-unready; after that, kill this pod"

"I can accept up to 5 minutes of pending; after that, kill this pod"

kannon92 · 2023-01-23T19:08:59Z

This is an interesting idea, but I wonder if it's a sub-case of a more general capability. Something like "tolerance" (bad name because it's too close to tolerations) - which allows a pod to express bounded-duration situations which it can or cannot deal with.

"I can accept up to 30 seconds of node-unready; after that, kill this pod"

"I can accept up to 5 minutes of pending; after that, kill this pod"

Your suggestion makes it sound like some kind of Readiness check. Sorta of a check to make sure the Pod can continue rather than a test of the container's ability to be ready.

For this general case, what kind of fields should we decide to watch for "tolerance"?

Conditions seem obvious to me because they kinda tell the progress of the pod but not sure if there are other items I should think of.

thockin · 2023-01-24T00:52:37Z

An "environmental suitability check" :)

I don't think you want to define it exclusively (or maybe at all?) in terms of conditions, but I haven't thought about that.

The one that came up in the past was "node unready" and/or "node unreachable". Some apps have no state and want to be killed more "eagerly" if the node seems down. Other apps have local state that may be expensive to rebuild, and would prefer to wait longer.

I could see "disk full" being another or "node OOM rate".

thockin · 2023-01-24T00:52:51Z

@msau42 wrt stateful stuff

kannon92 · 2023-01-24T19:45:15Z

I don't think you want to define it exclusively (or maybe at all?) in terms of conditions, but I haven't thought about that.

In general I started thinking about this issue because users can submit invalid pods (invalid secrets, volumes mounted from non existent secrets, wrong taints/tolerations for nodes) and all those states happen when a pod is pending. There are various of ways to deduce what went wrong but it is difficult to programmatically handle all these states in a third party controller.

In the project I work on, we use the reason field (In ContainerStateWaiting) for cases related to invalid docker names and invalid config maps, we use the condition UnableToSchedule for scheduling issues and we use events for volume mounting issues.

I wrote up a lot of these cases in https://github.com/kannon92/PendingPodChecks/blob/main/README.md if you are curious.

One condition that is being added to Kubernetes is PodHasNetwork and that would also represent a case where I think the Pod is stuck. So that is why I was thinking of conditions.

alculquicondor · 2023-01-24T19:56:19Z

@thockin we also wrote Job failures policies to match Conditions, but this is limited to Pods that are also marked as Failed https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy

kannon92 · 2023-01-30T18:48:47Z

In the interest of not having multiple conversations going on in this thread, I originally thought of two specific KEPS related to achieving this feature request. It sounds like there is some interest in creating general conditions for common configuration errors. Would there be an objection for me creating a KEP in K/enhancements so we could continue that discussion? I would be open to leading that KEP creation. I created a POC in my fork that adds these conditions so I feel pretty comfortable to write something up.

My KEP would be about adding a general condition for configuration errors. There would be no mention of terminating based on conditions on that.

alculquicondor · 2023-01-30T19:06:39Z

+1 on a KEP for a condition.

It would probably have to spill into the next release, but it doesn't hurt to start early.

kannon92 · 2023-02-01T18:52:48Z

+1 on a KEP for a condition.

It would probably have to spill into the next release, but it doesn't hurt to start early.

Yea that is understandable. I haven't done a KEP before so I think it may take me a little time to learn the process.

I created kubernetes/enhancements#3815 and kubernetes/enhancements#3816 for the discussion on adding conditions for pending pods.

k8s-triage-robot · 2023-05-02T19:00:51Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-06-01T19:23:30Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kannon92 · 2023-06-02T12:40:41Z

/remove-lifecycle rotten

kannon92 · 2023-06-20T15:15:13Z

Is it possible to simply have a controller (like the Job Controller) delete a pod if it stays Pending (or Running with containers in crashloop) beyond a specified timeout? In other words, what would be some scenarios where the controller would prefer a pod to stay Pending (or Running with containers in crashloop) without terminating it.

I am going to try this idea actually. I have a POC where I will create a controller that will monitor pods and terminate based on values in ContainerStatuses or ContainerConditions. If I make good progress on this, maybe I can float this as a subproject for sig-apps or at least present at a sig-apps meeting.

alculquicondor · 2023-06-20T15:23:59Z

I wouldn't start a subproject just for this. There might be enough interest to push this down, even to the kubelet #113211 (comment)

alculquicondor · 2023-06-20T15:26:13Z

OTOH, this sounds like something within the realm of https://github.com/kubernetes-sigs/descheduler
I question that this brings is what to do with pods which are not yet scheduled. Should those have a timeout as well? I'm not too sure.

kannon92 · 2023-06-20T15:32:36Z

I wouldn't start a subproject just for this. There might be enough interest to push this down, even to the kubelet #113211 (comment)

It seems to me that there is more than enough code in Kubelet and if it is possible to handle this as a separate controller out-of-tree (or even out of Kubelet) wouldn't that be beneficial?

I don't really see why it has to live in Kubelet and it does seem that there was enough push to move something like this.

Either way I think I will create a POC controller to get a better understand of how this would work. I may end up using this for Armada so I figured I would float the idea.

alculquicondor · 2023-06-20T15:51:14Z

A good thing about doing it in kubelet is that it can avoid scenarios where the Pod is starting at the same time that the deadline is reached. Although maybe this is not so much of a problem? The pod would be marked as Failed either way.

But it's worth checking with sig-node about their thoughts.

damemi · 2023-07-05T15:45:46Z

OTOH, this sounds like something within the realm of https://github.com/kubernetes-sigs/descheduler I question that this brings is what to do with pods which are not yet scheduled. Should those have a timeout as well? I'm not too sure.

Commented in kubernetes-sigs/descheduler#1183 (comment) downstream, but I think this falls between needing work from the scheduler and descheduler. The descheduler currently only looks at pods that are on a node, which we could change. But making a deterministic decision about eviction would be better with some standard condition on the pod.

imo, a core controller such as the scheduler should be the one to mutate the pod to add that condition. then downstream components like the descheduler could act on it. the scheduler should also know to remove those pods from the scheduling queue

alculquicondor · 2023-07-05T16:43:07Z

The scheduler already adds:

PodScheduled condition with False status when a pod failed to schedule
DisruptionTarget condition when a Pod is about to be preempted.

Is there anything else needed?

damemi · 2023-07-05T16:54:03Z

Sorry, I haven't read through this whole thread, but if all this takes is to evict pods with PodScheduled=False then yeah, that would be pretty straightforward if the descheduler considers pods that aren't on a node

kannon92 · 2023-07-07T19:16:29Z

This thread is long but the main point we are discussing to evict pods that are stuck in the pending stage. I don't think it is as simple as PodScheduled=False (though that is a case actually!).

It can mean that PodScheduled=False but I see other cases where Scheduler finds a node and the pod is stuck due to a variety of reasons.

In the KEP I tried to summarize common cases that our users get tripped up on:

https://github.com/kubernetes/enhancements/blob/32eaed4e7e6deb00b4df620cb003718f68dcd312/keps/sig-node/3815-pod-condition-for-configuration-errors/README.md#pending-pod-state-review

And when I created the KEP I wasn't aware of PodReadyToStartContainers condition so i wanted to target more complicated cases like:

https://github.com/kubernetes/enhancements/blob/32eaed4e7e6deb00b4df620cb003718f68dcd312/keps/sig-node/3815-pod-condition-for-configuration-errors/README.md#missing-secret-when-mounting.

So in most of the cases that I have found the Pod gets scheduled and then gets stuck after scheduling stage.

dberardo-com · 2023-07-10T05:28:55Z

i am assuming that this functionality is not available yet,right ? is there any workaround using kubelete configurations perhaps ?

is it the same on k3s distributions ?

is there maybe any sort of workaround CronJob that regularly checks for pending pods and terminates the ones that have been there for long ?

k8s-triage-robot · 2024-01-24T02:57:06Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-05-01T22:44:44Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alculquicondor · 2024-05-02T12:06:01Z

/remove-lifecycle stale

kannon92 added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 20, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 20, 2022

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. wg/batch Categorizes an issue or PR as relevant to WG Batch. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 20, 2022

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Oct 20, 2022

kannon92 mentioned this issue Oct 20, 2022

Spike: Pod Failure Policy (K8 1.25 alpha feature) Can we utilize this in Armada? armadaproject/armada#1537

Closed

sftim mentioned this issue Nov 8, 2022

Should the KCM have a heartbeat (or should the APIServer be more proactive about disruptive extensions) ? #113401

Open

kannon92 mentioned this issue Nov 29, 2022

REQUEST: New membership for @kannon92 kubernetes/org#3861

Closed

9 tasks

kannon92 mentioned this issue Feb 1, 2023

Adding FailingToStart Condition in cases of configuration errors. kubernetes/enhancements#3815

Closed

4 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 1, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 2, 2023

kannon92 mentioned this issue Jul 4, 2023

Possible support for evicting pending pods that are stuck. kubernetes-sigs/descheduler#1183

Open

kannon92 mentioned this issue Dec 13, 2023

A mechanism to fail a Pod that is stuck due to an invalid Image #122300

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024

thockin removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 1, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2024

Enhancement: Marking a pending pod as failed after a certain timeout #113211

Enhancement: Marking a pending pod as failed after a certain timeout #113211

Comments

kannon92 commented Oct 20, 2022 • edited Loading

What would you like to be added?

Why is this needed?

k8s-ci-robot commented Oct 20, 2022

kannon92 commented Oct 20, 2022

alculquicondor commented Oct 20, 2022

alculquicondor commented Oct 20, 2022

kannon92 commented Oct 21, 2022

alculquicondor commented Oct 21, 2022

kannon92 commented Oct 21, 2022

alculquicondor commented Oct 21, 2022

sftim commented Nov 7, 2022

alculquicondor commented Nov 7, 2022

alculquicondor commented Nov 7, 2022

kannon92 commented Nov 9, 2022

kannon92 commented Jan 19, 2023

aojea commented Jan 20, 2023

thockin commented Jan 20, 2023

kannon92 commented Jan 23, 2023

thockin commented Jan 24, 2023

thockin commented Jan 24, 2023

kannon92 commented Jan 24, 2023

alculquicondor commented Jan 24, 2023

kannon92 commented Jan 30, 2023

alculquicondor commented Jan 30, 2023

kannon92 commented Feb 1, 2023

k8s-triage-robot commented May 2, 2023

k8s-triage-robot commented Jun 1, 2023

kannon92 commented Jun 2, 2023

kannon92 commented Jun 20, 2023

alculquicondor commented Jun 20, 2023

alculquicondor commented Jun 20, 2023

kannon92 commented Jun 20, 2023

alculquicondor commented Jun 20, 2023 • edited Loading

damemi commented Jul 5, 2023

alculquicondor commented Jul 5, 2023

damemi commented Jul 5, 2023

kannon92 commented Jul 7, 2023

dberardo-com commented Jul 10, 2023

k8s-triage-robot commented Jan 24, 2024

k8s-triage-robot commented May 1, 2024

alculquicondor commented May 2, 2024

kannon92 commented Oct 20, 2022 •

edited

Loading

alculquicondor commented Jun 20, 2023 •

edited

Loading