Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Marking a pending pod as failed after a certain timeout #113211

Open
kannon92 opened this issue Oct 20, 2022 · 52 comments
Open

Enhancement: Marking a pending pod as failed after a certain timeout #113211

kannon92 opened this issue Oct 20, 2022 · 52 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. wg/batch Categorizes an issue or PR as relevant to WG Batch.

Comments

@kannon92
Copy link
Contributor

kannon92 commented Oct 20, 2022

What would you like to be added?

We would like to add configurable timeouts to allow pending pods to transition to failed.

It would be ideal if it could be configurable based on container events or messages rather than a catch all timeout.

A possible API could be as follows:

      - regexp: "Failed to pull image.*desc = failed to pull and unpack image"            # Suggests genuine problem with image name, no point in waiting around too long.
        gracePeriod: 90s
      - regexp: "Failed to pull image.*code = Unknown desc = Error response from daemon"  # Seen when image doesn't exist, no point in waiting around too long.
        gracePeriod: 90s

This can allow certain events/statuses to transition to failed.

Why is this needed?

In the Armada project, we found that for batch users, it is useful to control how long Pods stay in pending before marking the pods as failed. This allows for our scheduler to remove pods from the cluster and allow room for new pods to be scheduled. This also allows users to be notified of invalid pods and they are able to resubmit their pods with correct configurations.

We have discussed this idea in conjunction with the non goals of https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures#non-goals and @alculquicondor requested that I create an issue.

@kannon92 kannon92 added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 20, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 20, 2022
@k8s-ci-robot
Copy link
Contributor

@kannon92: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kannon92
Copy link
Contributor Author

/sig node
/wg batch

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. wg/batch Categorizes an issue or PR as relevant to WG Batch. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 20, 2022
@alculquicondor
Copy link
Member

/sig apps

regex is certainly not an API we would like to encourage, as the messages can change from one release to the next.

Is it fair to say that you would like to target these specific Pending scenarios?

  • Unschedulable pod (there is already a condition PodScheduled=false that can be targeted)
  • Pull image failures. We can probably add one. I only found one preceding issue about this, but there are probably others Fail pod with invalid image #111788

Are there other Pending scenarios you think are useful to target?

The additional requirement for the API is where to define the timeouts.

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Oct 20, 2022
@alculquicondor
Copy link
Member

maybe @smarterclayton has given this a thought already :)

@kannon92
Copy link
Contributor Author

/sig apps

regex is certainly not an API we would like to encourage, as the messages can change from one release to the next.

Is it fair to say that you would like to target these specific Pending scenarios?

  • Unschedulable pod (there is already a condition PodScheduled=false that can be targeted)
  • Pull image failures. We can probably add one. I only found one preceding issue about this, but there are probably others Fail pod with invalid image #111788

Are there other Pending scenarios you think are useful to target?

The additional requirement for the API is where to define the timeouts.

So I asked around. Other areas that we found useful are also in InvalidConfigMaps, InvalidSecrets, bad pvc. You covered a lot of them in the NonGoals of the KEP. We also have issues with GPU Plugin not be able to be initialized.

@alculquicondor
Copy link
Member

I guess we can cover most of them in a single PodCondition FailedToStartup. We need sig-node input. We can try to get some feedback at kubecon or the next sig-node meeting.

@kannon92
Copy link
Contributor Author

@alculquicondor Could you explain a bit more about why this couldn't be the responsibility of the Job Controller? The one thing that comes to mind is that this is a bigger problem since just for users scheduling jobs. But I don't quite understand the distinction between responsibility of sig-node and sig-apps? Is it primarly because kubelet is responsible for determining if a Pod has failed?

@alculquicondor
Copy link
Member

Is it primarly because kubelet is responsible for determining if a Pod has failed?

Yes.
Also, the Job failure policy is based on PodConditions, and there is no pod condition for all the scenarios you have described.

@sftim
Copy link
Contributor

sftim commented Nov 7, 2022

Could we provide this functionality as an out-of-tree add-on to Kubernetes? Is there anything about this enhancement proposal that requires a change to Kubernetes itself?

@alculquicondor
Copy link
Member

It depends. I'm not sure if in all the cases that @kannon92 mentioned, the Pod is stuck with Pending phase. If so, yes, I think an out-of-tree controller could implement the functionality.

If the kubelet makes a transition of the pods to Failed, then we need kubelet changes, because we need to add a Pod condition atomically.

@alculquicondor
Copy link
Member

Although, an important question is how can an out-of-tree controller detect these failures. Is there a stable Pod condition type or reason that can be monitored?

If the controller has to do regex on error messages, that can easily break from one release to another. So at the very least in upstream we should have consistent Pod conditions.

@kannon92
Copy link
Contributor Author

kannon92 commented Nov 9, 2022

@alculquicondor and I followed up offline. I created a repo that goes through the common cases our batch users experience with the pending status.

I created a document to walk through these common use cases and to demonstrate what conditions, statuses and events these states give.

https://github.com/kannon92/PendingPodChecks/blob/main/README.md

@kannon92
Copy link
Contributor Author

I did a little POC on this in my own fork: The PRs are kannon92#2 and kannon92#1.

In those PRs I added an ability for the job controller to fail pods based on conditions for PendingPods. A brief review from @alculquicondor explained that we will need a much more strict API to deal with transient conditions. Transient conditions are ones where conditions transition to False -> True -> False throughout a pod lifecycle. I don't think there is any guarantee that Pods conditions get set once and never toched again. So I think an API needs to take this is consideration.

Our hope is to bring up to sig-node about the possiblity of adding a new API to Kubelet that can terminate pods based on conditions. This API will need to match on conditions but also take in account transient condition setting.

In general I see two mutually exclusive feature requests to achieve this:

  1. We need a general condition that tells the pod if its stuck. https://github.com/kannon92/PendingPodChecks/blob/main/README.md

  2. An API that allows terminated of Pods if they match a condition value for x amount of time.

There exists conditions already so I believe 2 can be addressed separately from 1. Some of these conditions can be UnableToSchedule or PodHasNetwork (in Beta in 1.26) so it is possible to at least flush out an API with these conditions.

@aojea
Copy link
Member

aojea commented Jan 20, 2023

/cc @thockin

@thockin
Copy link
Member

thockin commented Jan 20, 2023

This is an interesting idea, but I wonder if it's a sub-case of a more general capability. Something like "tolerance" (bad name because it's too close to tolerations) - which allows a pod to express bounded-duration situations which it can or cannot deal with.

"I can accept up to 30 seconds of node-unready; after that, kill this pod"

"I can accept up to 5 minutes of pending; after that, kill this pod"

@kannon92
Copy link
Contributor Author

This is an interesting idea, but I wonder if it's a sub-case of a more general capability. Something like "tolerance" (bad name because it's too close to tolerations) - which allows a pod to express bounded-duration situations which it can or cannot deal with.

"I can accept up to 30 seconds of node-unready; after that, kill this pod"

"I can accept up to 5 minutes of pending; after that, kill this pod"

Your suggestion makes it sound like some kind of Readiness check. Sorta of a check to make sure the Pod can continue rather than a test of the container's ability to be ready.

For this general case, what kind of fields should we decide to watch for "tolerance"?

Conditions seem obvious to me because they kinda tell the progress of the pod but not sure if there are other items I should think of.

@thockin
Copy link
Member

thockin commented Jan 24, 2023

An "environmental suitability check" :)

I don't think you want to define it exclusively (or maybe at all?) in terms of conditions, but I haven't thought about that.

The one that came up in the past was "node unready" and/or "node unreachable". Some apps have no state and want to be killed more "eagerly" if the node seems down. Other apps have local state that may be expensive to rebuild, and would prefer to wait longer.

I could see "disk full" being another or "node OOM rate".

@thockin
Copy link
Member

thockin commented Jan 24, 2023

@msau42 wrt stateful stuff

@kannon92
Copy link
Contributor Author

I don't think you want to define it exclusively (or maybe at all?) in terms of conditions, but I haven't thought about that.

In general I started thinking about this issue because users can submit invalid pods (invalid secrets, volumes mounted from non existent secrets, wrong taints/tolerations for nodes) and all those states happen when a pod is pending. There are various of ways to deduce what went wrong but it is difficult to programmatically handle all these states in a third party controller.

In the project I work on, we use the reason field (In ContainerStateWaiting) for cases related to invalid docker names and invalid config maps, we use the condition UnableToSchedule for scheduling issues and we use events for volume mounting issues.

I wrote up a lot of these cases in https://github.com/kannon92/PendingPodChecks/blob/main/README.md if you are curious.

One condition that is being added to Kubernetes is PodHasNetwork and that would also represent a case where I think the Pod is stuck. So that is why I was thinking of conditions.

@alculquicondor
Copy link
Member

@thockin we also wrote Job failures policies to match Conditions, but this is limited to Pods that are also marked as Failed https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy

@kannon92
Copy link
Contributor Author

In the interest of not having multiple conversations going on in this thread, I originally thought of two specific KEPS related to achieving this feature request. It sounds like there is some interest in creating general conditions for common configuration errors. Would there be an objection for me creating a KEP in K/enhancements so we could continue that discussion? I would be open to leading that KEP creation. I created a POC in my fork that adds these conditions so I feel pretty comfortable to write something up.

My KEP would be about adding a general condition for configuration errors. There would be no mention of terminating based on conditions on that.

@alculquicondor
Copy link
Member

+1 on a KEP for a condition.

It would probably have to spill into the next release, but it doesn't hurt to start early.

@kannon92
Copy link
Contributor Author

kannon92 commented Feb 1, 2023

+1 on a KEP for a condition.

It would probably have to spill into the next release, but it doesn't hurt to start early.

Yea that is understandable. I haven't done a KEP before so I think it may take me a little time to learn the process.

I created kubernetes/enhancements#3815 and kubernetes/enhancements#3816 for the discussion on adding conditions for pending pods.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 1, 2023
@kannon92
Copy link
Contributor Author

kannon92 commented Jun 2, 2023

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 2, 2023
@kannon92
Copy link
Contributor Author

Is it possible to simply have a controller (like the Job Controller) delete a pod if it stays Pending (or Running with containers in crashloop) beyond a specified timeout? In other words, what would be some scenarios where the controller would prefer a pod to stay Pending (or Running with containers in crashloop) without terminating it.

I am going to try this idea actually. I have a POC where I will create a controller that will monitor pods and terminate based on values in ContainerStatuses or ContainerConditions. If I make good progress on this, maybe I can float this as a subproject for sig-apps or at least present at a sig-apps meeting.

@alculquicondor
Copy link
Member

I wouldn't start a subproject just for this. There might be enough interest to push this down, even to the kubelet #113211 (comment)

@alculquicondor
Copy link
Member

OTOH, this sounds like something within the realm of https://github.com/kubernetes-sigs/descheduler
I question that this brings is what to do with pods which are not yet scheduled. Should those have a timeout as well? I'm not too sure.

@kannon92
Copy link
Contributor Author

I wouldn't start a subproject just for this. There might be enough interest to push this down, even to the kubelet #113211 (comment)

It seems to me that there is more than enough code in Kubelet and if it is possible to handle this as a separate controller out-of-tree (or even out of Kubelet) wouldn't that be beneficial?

I don't really see why it has to live in Kubelet and it does seem that there was enough push to move something like this.

Either way I think I will create a POC controller to get a better understand of how this would work. I may end up using this for Armada so I figured I would float the idea.

@alculquicondor
Copy link
Member

alculquicondor commented Jun 20, 2023

A good thing about doing it in kubelet is that it can avoid scenarios where the Pod is starting at the same time that the deadline is reached. Although maybe this is not so much of a problem? The pod would be marked as Failed either way.

But it's worth checking with sig-node about their thoughts.

@damemi
Copy link
Contributor

damemi commented Jul 5, 2023

OTOH, this sounds like something within the realm of https://github.com/kubernetes-sigs/descheduler I question that this brings is what to do with pods which are not yet scheduled. Should those have a timeout as well? I'm not too sure.

Commented in kubernetes-sigs/descheduler#1183 (comment) downstream, but I think this falls between needing work from the scheduler and descheduler. The descheduler currently only looks at pods that are on a node, which we could change. But making a deterministic decision about eviction would be better with some standard condition on the pod.

imo, a core controller such as the scheduler should be the one to mutate the pod to add that condition. then downstream components like the descheduler could act on it. the scheduler should also know to remove those pods from the scheduling queue

@alculquicondor
Copy link
Member

The scheduler already adds:

  • PodScheduled condition with False status when a pod failed to schedule
  • DisruptionTarget condition when a Pod is about to be preempted.

Is there anything else needed?

@damemi
Copy link
Contributor

damemi commented Jul 5, 2023

Sorry, I haven't read through this whole thread, but if all this takes is to evict pods with PodScheduled=False then yeah, that would be pretty straightforward if the descheduler considers pods that aren't on a node

@kannon92
Copy link
Contributor Author

kannon92 commented Jul 7, 2023

This thread is long but the main point we are discussing to evict pods that are stuck in the pending stage. I don't think it is as simple as PodScheduled=False (though that is a case actually!).

It can mean that PodScheduled=False but I see other cases where Scheduler finds a node and the pod is stuck due to a variety of reasons.

In the KEP I tried to summarize common cases that our users get tripped up on:

https://github.com/kubernetes/enhancements/blob/32eaed4e7e6deb00b4df620cb003718f68dcd312/keps/sig-node/3815-pod-condition-for-configuration-errors/README.md#pending-pod-state-review

And when I created the KEP I wasn't aware of PodReadyToStartContainers condition so i wanted to target more complicated cases like:

https://github.com/kubernetes/enhancements/blob/32eaed4e7e6deb00b4df620cb003718f68dcd312/keps/sig-node/3815-pod-condition-for-configuration-errors/README.md#missing-secret-when-mounting.

So in most of the cases that I have found the Pod gets scheduled and then gets stuck after scheduling stage.

@dberardo-com
Copy link

i am assuming that this functionality is not available yet,right ? is there any workaround using kubelete configurations perhaps ?

is it the same on k3s distributions ?

is there maybe any sort of workaround CronJob that regularly checks for pending pods and terminates the ones that have been there for long ?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024
@thockin thockin removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 1, 2024
@alculquicondor
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. wg/batch Categorizes an issue or PR as relevant to WG Batch.
Projects
Status: Needs Triage
Development

No branches or pull requests