Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3815 : Add Condition for Pending Pods that are stuck due to configuration errors #3816

Closed

Conversation

kannon92
Copy link
Contributor

@kannon92 kannon92 commented Feb 1, 2023

  • One-line PR description:
    Add a new condition for pods that are stuck in pending due to a configuration error.
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Feb 1, 2023
@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 1, 2023
@kannon92
Copy link
Contributor Author

kannon92 commented Feb 1, 2023

/do-not-merge

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 1, 2023
@kannon92 kannon92 changed the title KEP-3815 : Add Condition for Pending Pods that are stuck due to configuration errors WIP KEP-3815 : Add Condition for Pending Pods that are stuck due to configuration errors Feb 1, 2023
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 1, 2023
@kannon92 kannon92 force-pushed the pending-pod-condition-kep branch 2 times, most recently from 11369bd to 59a199f Compare February 1, 2023 20:15
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

-->

### Non-Goals
- Will not modify behavior for ImagePullBackOff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why we would exclude this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought for this is that there isn't enough information in pulling images to distinguish between the different failures:

  1. If registry is overloaded I have seen cases where ImagePullBackOff occurs.
  2. If a docker image doesn't exist you get ImagePullBackOff.
  3. If you have invalid image pull secrets you also get ImagePullBackoff.

There could be an interest in spelling these cases out more but it isn't as important case for us at the moment. I wanted to focus on cases where the error requires human intervention to get it working again.

In some of these cases I could see pulling images to work again if the registry is no longer overloaded. But for 2-3 I guess these are cases that require human intervention but I don't know very much about how we pull images to distinguish between these different cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can tell the difference at SOME point, right? Whether registry returns 503 or 404, for example.

But in either case, the pod is still "failing to start"

Copy link
Contributor Author

@kannon92 kannon92 Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I wasn't really going at it from a k8 dev perspective but from what is present in the pod yaml. I'm not sure if the rest API errors for the registry are exposed to a user when submitting pods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threy are not, but we know them somewhere deep down

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Tim, we should set the condition while the pod doesn't start, even if it could be a transient error that doesn't require human intervention.

I also would recommend reading the API conventions:
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

@kannon92 kannon92 force-pushed the pending-pod-condition-kep branch 4 times, most recently from 7506620 to bd3ec8d Compare February 2, 2023 00:11
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will circle back later - this isn't queued for 1.27, right?

// ErrInvalidImageName - Unable to parse the image name.
ErrInvalidImageName = errors.New("InvalidImageName")
```
- Should we include all these errors as potential failure cases for pending pods? I can replicate ErrImageNeverPull and InvalidImageName but not sure how to reproduce the other errors.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for the reviewers. Love to hear some community input if any of these errors cause problems in a consistent way. If so I would be open to adding them as part of this KEP. Just would like to understand how to reproduce them.

ErrPostStartHook = errors.New("PostStartHookError")
)
```
- If these are set, can we assume that container failed? Not sure how to recreate failures of states other than ErrCreateContainerConfig
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question for a reviewer: Are these internal errors that we should catch for a condition? Or do they occur often?

@kannon92
Copy link
Contributor Author

kannon92 commented Feb 2, 2023

Will circle back later - this isn't queued for 1.27, right?

Correct. Just getting this open now. I have some questions for community on this so I figured it would be helpful to get some comments on it. I don't see a reason to push this into 1.27 and since I missed the KEP deadline so 1.28 it is.

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kannon92 thanks for starting up the KEP. I think it could play nicely with podFailurePolicy KEP and with the fix I’m proposing here: #3757. However, there are details to figure out.

First, IIUC the flow could work like this (I think it would be good to clarify it here):

  • Kubelet would add FailingToStart condition to a Pending and scheduled Pod, the Reason would indicate more details about the scenario.
  • An external controller inspects the Pod and may send DELETE request to the Pod
  • The new code in Kubelet will transition the Pod into Failed phase (Update for second Beta with GA criteria for "KEP-3329: Retriable and non-retriable Pod failures for Jobs" #3757)
  • Job controller can match the failed pod against the configured podFailurePolicy. The most common use case I assume would be to use “FailJob” or “Count” action for the FailingToStart condition.

Another thought is - does Kubelet use any knowledge which isn't already surfaced here in pod status? If not, then maybe we develop an external component which adds the condition based on inspecting the Pod status. Maybe developing this idea as an external pluggable component makes sense so that we can gain more confidence in the solution over time? Then, the flow would be:

Next, I’m wondering if we can really easily separate the configuration errors from transient issues. If we can, then I suggest a more concrete name for the condition, such as “ConfigurationError”.

It probably can be done in the cases you enumerate, but some corner cases can be tricky to tell, such as ImagePullBackoff. Let’s even consider “Image not found locally when ImagePullPolicy is set to never.” Not sure if such a scenario is possible, but what if container runtime (containerd) crashed and was restarting when asked for the image. Also, potentially, someone could be loading the images onto the nodes while the pod is Pending phase. I think it would require a rigorous review to really tell them apart, and I’m not sure we want to go into this, but if we want, then we could have two conditions: “ConfigurationErrors” (permanent ones), and “FailingToStart” for the transient ones.

If we relax the semantics of FailingToStart condition, then its name is generic enough so that it could also cover transient issues, such as ImagePullBackOff (it could be a reason for the condition). Kubelet would add the condition FailingToStart=True, but it would be in position to reset the condition to FailingToStart=False if the image is pulled eventually. It would be up to the external controller to decide if FailingToStart with Reason=ImagePullBackOff should trigger DELETE request. It could do it for example, after waiting 2min to gain more confidence that the issue is permanent.

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ddebroy @agamdua @aojea as people involved in PodHasNetwork/`PodReadyToStartContainers

[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
-->

As a batch user, it is very likely that users experience configuration errors when submitting pods to Kubernetes. We would like to propose a new condition so that users can determine if a pod has an error in startup. A new condition will provide a quick way to verify if a pod detected a unrecoverable configuration error. Many of these configuration errors are fixable via user interactions (ie creating a secret, changing the image name) but without any modifications the pods will stay stuck in the pending state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is specific to batch users.

"has an error" -> "is stuck"? Error seems to imply a terminal state, which is not the case here.

wouldn't it rather be a "recoverable" error?

any modifications user intervention


Batch users may not have deep knowledge of Kubernetes. It is common occurrence to find users submitting Pods with configuration errors (invalid images, wrong image names, missing volumes, missing secrets, wrong volume/secret for config map). These cases cause the Pod to be stuck in pending and there is code required to remove these pods or user input to remove.

As a batch system developer, we find that higher level controllers have to write code to handle Pending Pod cases. Since the pods are not considered failed, the higher level controller needs to transition this state to failed in order to enforce retries. For example, the job controller will run these pods and the state of the Pod will be in Pending. The job controller does not transition this state to failed even if the pod will never start running.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An additional problem both for end-users and system developers is that the errors are surfaced in inconsistent ways: some are in Container statuses, some in the reason or message fields. A standard condition would provide a single place to look at.

However, this paragraph is a little confusing when talking about transitioning to Failed state. IIUC, this KEP is not proposing to add a transition mechanism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea good call out. I'll make sure to explicitly put that in there. and will clean up this paragraph so it isn't mentioning transition to a failed state.


As a batch system developer, we find that higher level controllers have to write code to handle Pending Pod cases. Since the pods are not considered failed, the higher level controller needs to transition this state to failed in order to enforce retries. For example, the job controller will run these pods and the state of the Pod will be in Pending. The job controller does not transition this state to failed even if the pod will never start running.

For CNCF projects, in the Armada project, we implement logic that determines retry ability based on events and statuses. Events are not standard so we expect issues when updating and we discovered various states that don't have a representative status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Events are also best-effort, so they can easily be missed.

-->

### Goals
- Add a new condition FailingToStart to detect failures in configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have something like Started=False instead? Hopefully it's different enough from the "PodHasNetwork" condition that is changing names soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a Ready condition so I could find that to be a bit confusing to explain the difference between Ready and Started. But naming is always hard so I'll leave the name up for debate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 that naming is hard :-) If a negative like Failing (and implying ready/success through a double negative - i.e. Failing = False) can be avoided, that would be nice.

-->

### Non-Goals
- Will not modify behavior for ImagePullBackOff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Tim, we should set the condition while the pod doesn't start, even if it could be a transient error that doesn't require human intervention.

I also would recommend reading the API conventions:
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties


Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].

#### Alpha
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to leave this out while the KEP is marked as "provisional", but you'll need it for alpha.

-->

###### Will enabling / using this feature result in increasing size or count of the existing API objects?
No
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a new condition

As an end-user, if I create a pod with an ImagePullPolicy of Never but image doesn't exist on the node, I would expect to see a condition that states that the pod failed to start due to a configuration error.
#### Story 3
As an end-user, if I create a pod with a missing config map, I would expect to see a condition that states that the pod failed to start due to a configuration error.
### Notes/Constraints/Caveats (Optional)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain the relationship/difference with PodHasNetwork

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a point to explain the difference.

@kannon92 kannon92 changed the title WIP KEP-3815 : Add Condition for Pending Pods that are stuck due to configuration errors KEP-3815 : Add Condition for Pending Pods that are stuck due to configuration errors Feb 3, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2023
@kannon92 kannon92 force-pushed the pending-pod-condition-kep branch 2 times, most recently from cfb0a8a to 7c41d13 Compare February 3, 2023 19:50
@alculquicondor
Copy link
Member

Given kubernetes/kubernetes#115736 (comment) (@smarterclayton), we are back to square-one.

I think the main questions we need to answer are:

@thockin
Copy link
Member

thockin commented Apr 21, 2023

I can buy Clayton's point - that's a bit of history I either forgot or didn't know. :)

The larger problem is still real, though, right? Getting useful, detailed information back is still too hard...

@SergeyKanzhelev
Copy link
Member

/cc

[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
-->

### Goals
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this cover scenarios like these:

  • init container that keeps running and never returns
  • init container on the pod with the restartPolicy=Always keeps failing and being restarted
  • postStart hook of a first container never returns and prevents the second container to run
  • startup probe never returned success and container keep being "started"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could. I'd have to investigate each of these cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sysctl Forbidden is a case that I met before. The kubelet did not add it to the safe sysctl but the apiserver accepted it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this cover scenarios like these:

  • init container that keeps running and never returns
  • init container on the pod with the restartPolicy=Always keeps failing and being restarted
  • postStart hook of a first container never returns and prevents the second container to run
  • startup probe never returned success and container keep being "started"

I don't think this would cover init containers as I was aiming to test cases where pods are explicitly in pending. I don't think it doesn't have to support it but this brings up a point that this lifecycle handling could be useful for init containers, ephemeral containers and sidecars. Currently I was making an assumption that all init containers would have passed successful and then you move onto to main containers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev Do we think this should be considered a nongoal or should we consider support for failures in init containers?

Generally these seem to handled a lot better because init containers have a time limit on them and if there are any issues in pulling/running init containers the entire pod lifecycle is blocked. They are also covered by an initContainers condition so it is easy to distinguish that the pod is stuck in the initialized loop (Initialized=False).

@pacoxu
Copy link
Member

pacoxu commented Jun 14, 2023

Any update or new review, is this target for v1.28 or v1.29?

@kannon92
Copy link
Contributor Author

Any update or new review, is this target for v1.28 or v1.29?

Sorry I missed this. I obviously missed v1.28 but I'd like to get back to this in v1.29.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kannon92
Copy link
Contributor Author

Given kubernetes/kubernetes#115736 (comment) (@smarterclayton), we are back to square-one.

I think the main questions we need to answer are:

To answer your first point, this is mainly a single case that we can cover with a condition. We can match in cases of InvalidImageName in the container status and bubble up a condition. I don't think we want specific conditions for each case but a more general one for programmatic detection.

I'm coming back to this second one. My thought is that they are somewhat similar but different. I really want 3085 to be promoted to beta as the cases that this feature covers are causes of frustration in our batch platform. I am taking over this issue so we can get it promoted to beta. Generally I view these two conditions are separate information about the stage of pod creation but generally help answer the same question: How can I figure out easily if pods have failed?

In terms of this feature, I view PodReadyToStartContainers condition as a specific condition for sandbox creation. Once the sandbox creation is created, then we can continue to other steps in the container lifecycle. If a Pod is stuck due to images issues, config maps, or container hook issues, then it would be stuck in pending with PodReadyToStartContainers condition of true.

Scheduled -> PodReadyToStartContainers (true)-> FailingToStart (true) -> Ready

This is at least how I think of the differences. We should add a condition to determine failures once sandbox creation is complete and the pods have progressed past.

- Invalid Image Name
- Image not found locally when ImagePullPolicy is set to never.
- Existing config map but invalid key reference
- Container lifecycle hook failures
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we are going to cover the following cases:

  • Incorrect node selector
  • Untolerated taints
  • Requested resource too high for the selected node (cpu, memory)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are already covered by the Scheduled=False condition.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification

@dejanzele
Copy link
Contributor

Hi,
What is the latest status on this KEP, I see it hasn't got recent activity.
@kannon92 are you still actively working on this?

@kannon92
Copy link
Contributor Author

Hi, What is the latest status on this KEP, I see it hasn't got recent activity. @kannon92 are you still actively working on this?

#3816 (comment)

This is the last item I did. I aimed to promote 3085 to beta as there is a strong overlap with these.

Main questions I have been thinking about is should we have a dedicated condition for image issues and another one for lifecycle issues?

@thockin
Copy link
Member

thockin commented Jan 23, 2024

@kannon92 whats the word on this one? Push ahead, abandon, replace?

@kannon92
Copy link
Contributor Author

@kannon92 whats the word on this one? Push ahead, abandon, replace?

So I think the main hope was to use some of this for another KEP that @alculquicondor was thinking of.

kubernetes/kubernetes#122300 (comment) was my findings from this KEP.

I think the idea is useful but I am unable to push this in my current role at the moment. I would like to focus on concrete use cases rather than a general condition. So I was hoping that maybe 122300 can provide a use case and a condition can be implemented to help implement that. But not sure where that feature was left off and its lacking an owner at the moment.

@thockin
Copy link
Member

thockin commented Jan 24, 2024

Understood. Should be close this?

@kannon92
Copy link
Contributor Author

yea, I messaged my team to see what they say about it but I'll close for now.
/close

SIG Node PR Triage automation moved this from Needs Reviewer to Done Jan 25, 2024
@k8s-ci-robot
Copy link
Contributor

@kannon92: Closed this PR.

In response to this:

yea, I messaged my team to see what they say about it but I'll close for now.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet