-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP for adding initializationFailureThreshold to health probes #860
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/assign @RobertKrawitz |
@matthyx: GitHub didn't allow me to assign the following users: RobertKrawitz. Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @derekwaynecarr |
The target release is 1.15 |
|
||
### Goals | ||
|
||
Add an elegant and self-explanatory solution to running slow starting containers in Kubernetes with health probes enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wording: something like
Allow slow starting containers to run safely during startup with health probes enabled.
"Elegant" and "self-explanatory" are opinion; the goal should be objective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you also need to answer comments from @thockin @liggitt @yujuhong on kubernetes/kubernetes#71449 and update this based on your decisions on that.
|
||
Add an elegant and self-explanatory solution to running slow starting containers in Kubernetes with health probes enabled. | ||
|
||
By using `maxInitialFailureCount` * `periodSeconds` > **T** you have the following benefits: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This belongs under the proposal. The proposal is specifically for maxInitialFailureCount
, so start by noting that, then describe how it works. The description of the mechanism needs to be more specific, that failures up to a point (maxInitialFailureCount
) are ignored until the container has successfully started. Once a probe has ever succeeded, the container is considered to have been initialized, and the existing criteria are used to determine failure.
For example, if periodSeconds
is 10, maxInitialFailureCount
is 20, and failureThreshold
is 3:
- The kubelet will allow the kubelet 200 seconds to start (20 probes, spaced 10 seconds apart).
- If a probe succeeds at any time during that interval, the container is considered to have started, and
failureThreshold
is used thereafter.
For example:
- The container fails 20 probes at startup. It is considered to have failed, and is terminated.
- The container fails 10 probes at startup, succeeds once, and fails 3 more probes. The probe is considered to have started at 100 seconds, and even though it has hit the failure condition of 3 consecutive failures 140 seconds into its lifetime (less than the 200 permitted by
maxInitialFailureCount
, it is considered to have failed and is terminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what happens if maxInitialFailureCount
is set to less than failureThreshold
(which needs to be addressed in the code as well)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to add clear numerical examples...
|
||
If the `kubelet` is restarted, and `maxInitialFailureCount` is set very high, the effect would be the same as having previously a high `failureThreshold` potentially making the container downtime much longer if a dead-lock occurs during initialization. | ||
|
||
Documentation must be clearly stating that this feature should only be used in particular cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a risk. It's either something that needs to be done, in which case it's a goal, or that you won't do anything, in which case it's a non-goal.
There is potentially a risk involved in someone setting an extremely high threshold, and therefore stuck pods not getting killed in a timely manner, but that risk exists today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably need to rewrite the issue on stateless kubelet
@RobertKrawitz sorry, I am in the middle of editing... I have just marked the points already done as resolved to keep track of the things I have to finish later. (I am doing this during my lunch pause and coffee breaks) |
@RobertKrawitz you can review again whenever you have time, thanks for your patience! |
|
||
## Summary | ||
|
||
Slow starting containers are difficult to address with the current status of health probes: they are either killed before being up, or could be left dead-locked during a very long time before being killed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: dead-locked
=> deadlocked
|
||
## Motivation | ||
|
||
The main problem with slow starting containers is that you have to leave them enough time to start before having `livenessProbe` fail `failureThreshold` times (otherwise `kubelet` kills them before they have a chance to be up). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: again, don't write in the second person.
More substantively, I'm not quite sure what the "problem" is here from the way it's written. I can see a few problems, but you need to be explicit about what you're trying to solve here.
## Motivation | ||
|
||
The main problem with slow starting containers is that you have to leave them enough time to start before having `livenessProbe` fail `failureThreshold` times (otherwise `kubelet` kills them before they have a chance to be up). | ||
This is even worse when load on the nodes can increase this long startup time since you have to take into account the worse case scenario (let's call it **T**). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal doesn't address that, and it's hard for me to see how it could without much deeper changes (the pod creator cannot know what the worst case scenario is). That's OK; this does help, but I would like you to add as an explicit non-goal that this proposal does not attempt to take into account system load; it simply allows a longer grace period at the start.
|
||
You can handle that today with various strategies: | ||
|
||
- Delay the initial `livenessProbe` sufficiently to permit the container to start up (set `initialDelaySeconds` greater than **T**). While this ensures that the `livenessProbe` will not fail during the startup period, it also prevents the container from being killed early in case of a deadlock. Also, since the `livenessProbe` isn't run during that delay, there is no feedback loop on the container state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does setting initialDelaySeconds
prevent the container from being killed early if it deadlocks any more than this proposal does? Is the concern about the container coming up (so that a liveness probe would see it succeed) and only after that deadlock? If that's what you mean, say so. If it's something else, please state that as well.
You can handle that today with various strategies: | ||
|
||
- Delay the initial `livenessProbe` sufficiently to permit the container to start up (set `initialDelaySeconds` greater than **T**). While this ensures that the `livenessProbe` will not fail during the startup period, it also prevents the container from being killed early in case of a deadlock. Also, since the `livenessProbe` isn't run during that delay, there is no feedback loop on the container state. | ||
- Increase the allowed number of `livenessProbe` failures until `kubelet` kills the container (set `failureThreshold` so that `failureThreshold` times `periodSeconds` is greater than **T**). While this gives enough time for the container to start up and allows a feedback loop, it prevents the container from being killed in a timely manner in case of a deadlock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Killed in a timely manner if it deadlocks or otherwise hangs after it has initially successfully come up.
- The container fails 10 probes at startup, starts successfully, and after a long time fails 3 probes. The container is considered to have failed and is terminated after 30 seconds of downtime. | ||
- The container fails 10 probes at startup, succeeds once, and fails 3 more probes. The container is considered to have started at 100 seconds, and even though it is still within the first 200 seconds of its lifetime covered by `maxInitialFailureCount`, it is considered to have failed because of `failureThreshold` and the fact that it had initially started successfully, therefore it is terminated after 30 seconds of downtime. | ||
|
||
You can see the implementation in PR [#71449]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This is implemented in PR..." -- again, avoid the second person.
- The container is running: `hasInitialized` is reverted to false until the next probe which will succeed and reset it to true immediately after. | ||
- The container is dead-locked: `hasInitialized` is reverted to false, which means the container will have a maximum downtime of `periodSeconds` times `maxInitialFailureCount`. This is also the case today with a high `failureThreshold`. | ||
|
||
#### maxInitialFailureCount is set smaller than failureThreshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a risk. This is a case that you need to decide how you want to handle, and state it in the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take this out of the risks and add whatever you decide to do to the description.
The following test cases can be covered by calling the `fakeExecProber` a number of times to verify: | ||
|
||
- the container is killed after `maxInitialFailureCount` if it has never initialized (emulated by always calling `fakeExecProber{probe.Success, nil}`) | ||
- the container is killed after `failureThreshold` once it has initialized (emulated by calling `fakeExecProber{probe.Success, nil}` once) with total probes > `maxInitialFailureCount` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that you need to test this; what's important in this regard is simply that it fails after failureThreshold
failures after it has initialized successfully. But the case of it failing with the total number of probes < maxInitialFailureCount
is simply a more interesting corner case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is important to check for regressions concerning the hasInitialized
boolean. If it is not properly set after the first probe success, the container will be killed when total probes surpasses maxInitialFailureCount
regardless of failureThreshold
.
- the container is killed after `failureThreshold` once it has initialized (emulated by calling `fakeExecProber{probe.Success, nil}` once) with total probes > `maxInitialFailureCount` | ||
- the container is killed after `failureThreshold` once it has initialized (emulated by calling `fakeExecProber{probe.Success, nil}` once) with total probes < `maxInitialFailureCount` | ||
|
||
## Implementation History |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've also implemented (and tested?) code in the PR, correct? You should note the prototype that is concurrently under review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, works well in my case - this is why I am pushing for this feature
|
||
### Risks and Mitigations | ||
|
||
#### Stateless kubelet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really a limitation rather than a risk; you don't have a way at present of persisting hasInitialized
. But OK, it's reasonable enough to put it here.
9e13b9d
to
568f257
Compare
@RobertKrawitz I have edited the KEP to reflect your excellent review, thanks! |
- Delay the initial `livenessProbe` sufficiently to permit the container to start up (set `initialDelaySeconds` greater than **startup time**). While this ensures no `livenessProbe` will run and fail during the startup period (triggering a kill), it also delays deadlock detection if the container starts faster than `initialDelaySeconds`. Also, since the `livenessProbe` isn't run at all during startup, there is no feedback loop on the actual startup time of the container. | ||
- Increase the allowed number of `livenessProbe` failures until `kubelet` kills the container (set `failureThreshold` so that `failureThreshold` times `periodSeconds` is greater than **startup time**). While this gives enough time for the container to start up and allows a feedback loop, it prevents the container from being killed in a timely manner if it deadlocks or otherwise hangs after it has initially successfully come up. | ||
|
||
However, none of these strategies give an acceptable protection against deadlocks, which is the primary use case of `livenessProbe`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal allows faster detection of deadlocks or otherwise unresponsive pods after they have started up, which is a good thing, but it doesn't actually protect against deadlocks.
|
||
- long data initialization: only the first startup takes a lot of time | ||
- heavy workload: every startups take a lot of time | ||
- underpowered/overloaded node: startup times depend on external factors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really a different situation; this problem is extrinsic to the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but the result at the pod level is also a delayed startup... let's remove it for the and I might come back to the subject in a new proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a genuine cause of slow startup. List it, but note that this proposal is not intended to solve that problem.
|
||
### Non-Goals | ||
|
||
- Make probe parameters (`initialDelaySeconds`, `failureThreshold`, `periodSeconds`, ...) auto-tunable depending on node performance (only provide a way to allow a longer grace period at the start of the container). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More broadly, this does not address the issue of pod load affecting startup (or any other liveness probe that may be delayed due to load). This proposal is strictly at the pod level, not the node level.
- The container is running: `hasInitialized` is reverted to false until the next probe which will succeed and reset it to true immediately after. | ||
- The container is dead-locked: `hasInitialized` is reverted to false, which means the container will have a maximum downtime of `periodSeconds` times `maxInitialFailureCount`. This is also the case today with a high `failureThreshold`. | ||
|
||
#### maxInitialFailureCount is set smaller than failureThreshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take this out of the risks and add whatever you decide to do to the description.
@@ -74,7 +74,7 @@ There are various strategies to handle this situation with the current API: | |||
- Delay the initial `livenessProbe` sufficiently to permit the container to start up (set `initialDelaySeconds` greater than **startup time**). While this ensures no `livenessProbe` will run and fail during the startup period (triggering a kill), it also delays deadlock detection if the container starts faster than `initialDelaySeconds`. Also, since the `livenessProbe` isn't run at all during startup, there is no feedback loop on the actual startup time of the container. | |||
- Increase the allowed number of `livenessProbe` failures until `kubelet` kills the container (set `failureThreshold` so that `failureThreshold` times `periodSeconds` is greater than **startup time**). While this gives enough time for the container to start up and allows a feedback loop, it prevents the container from being killed in a timely manner if it deadlocks or otherwise hangs after it has initially successfully come up. | |||
|
|||
However, none of these strategies give an acceptable protection against deadlocks, which is the primary use case of `livenessProbe`. | |||
However, none of these strategies provide an timely answer to slow starting containers stuck in a deadlock, which in theory is the primary reason of setting up a `livenessProbe`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for "in theory".
/lgtm |
@RobertKrawitz: changing LGTM is restricted to assignees, and only kubernetes/enhancements repo collaborators may be assigned issues. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@derekwaynecarr can you review it as well? |
@RobertKrawitz after working again on the implementation, I wonder if it wouldn't make more sense to rename the new parameter to |
I agree, that will be more obvious. |
done |
Can we try to merge it in order to launch the review process initiated by an issue in enhancements? |
cc @derekwaynecarr @dchen1107 unless more review is needed this probably should get merged. |
/lgtm |
/assign @derekwaynecarr |
@derekwaynecarr @dchen1107 now that code thaw for 1.14 has passed, can we move on to this KEP? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine for merging as provisional.
Before it transitions to implementable, we should make clear the graduation criteria for the feature. I assume this needs to be mapped to feature gates, and we should have some documentation on how we handle apiserver and node version skew.
|
||
### Risks and Mitigations | ||
|
||
#### Stateless kubelet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the kep should include a section on graduation criteria?
i assume this new field would have to go through the normal process of feature gates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already an API review here: https://github.com/orgs/kubernetes/projects/13#card-19727773
I am quite new to the KEP process, do you mean something else?
Alright, now I see what you mean by graduation criteria... well if you look at the checklist, we have other points to validate first. |
I am fine w/ merging and iterating in follow-on PRs adding missing items before it goes implementable. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, matthyx, RobertKrawitz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @kubernetes/sig-node-proposals