Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP for adding initializationFailureThreshold to health probes #860

Merged
merged 1 commit into from
Apr 10, 2019

Conversation

matthyx
Copy link
Contributor

@matthyx matthyx commented Feb 28, 2019

/assign @kubernetes/sig-node-proposals

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/design Categorizes issue or PR as related to design. labels Feb 28, 2019
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 28, 2019
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 28, 2019
@matthyx
Copy link
Contributor Author

matthyx commented Mar 5, 2019

/assign @RobertKrawitz

@k8s-ci-robot
Copy link
Contributor

@matthyx: GitHub didn't allow me to assign the following users: RobertKrawitz.

Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @RobertKrawitz

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@matthyx
Copy link
Contributor Author

matthyx commented Mar 5, 2019

/assign @derekwaynecarr
@derekwaynecarr , can you merge this KEP now that I have the greenlight from sig-node and @RobertKrawitz as reviewer.

@matthyx
Copy link
Contributor Author

matthyx commented Mar 5, 2019

The target release is 1.15


### Goals

Add an elegant and self-explanatory solution to running slow starting containers in Kubernetes with health probes enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording: something like

Allow slow starting containers to run safely during startup with health probes enabled.

"Elegant" and "self-explanatory" are opinion; the goal should be objective.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you also need to answer comments from @thockin @liggitt @yujuhong on kubernetes/kubernetes#71449 and update this based on your decisions on that.


Add an elegant and self-explanatory solution to running slow starting containers in Kubernetes with health probes enabled.

By using `maxInitialFailureCount` * `periodSeconds` > **T** you have the following benefits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This belongs under the proposal. The proposal is specifically for maxInitialFailureCount, so start by noting that, then describe how it works. The description of the mechanism needs to be more specific, that failures up to a point (maxInitialFailureCount) are ignored until the container has successfully started. Once a probe has ever succeeded, the container is considered to have been initialized, and the existing criteria are used to determine failure.

For example, if periodSeconds is 10, maxInitialFailureCount is 20, and failureThreshold is 3:

  • The kubelet will allow the kubelet 200 seconds to start (20 probes, spaced 10 seconds apart).
  • If a probe succeeds at any time during that interval, the container is considered to have started, and failureThreshold is used thereafter.

For example:

  • The container fails 20 probes at startup. It is considered to have failed, and is terminated.
  • The container fails 10 probes at startup, succeeds once, and fails 3 more probes. The probe is considered to have started at 100 seconds, and even though it has hit the failure condition of 3 consecutive failures 140 seconds into its lifetime (less than the 200 permitted by maxInitialFailureCount, it is considered to have failed and is terminated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what happens if maxInitialFailureCount is set to less than failureThreshold (which needs to be addressed in the code as well)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to add clear numerical examples...


If the `kubelet` is restarted, and `maxInitialFailureCount` is set very high, the effect would be the same as having previously a high `failureThreshold` potentially making the container downtime much longer if a dead-lock occurs during initialization.

Documentation must be clearly stating that this feature should only be used in particular cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a risk. It's either something that needs to be done, in which case it's a goal, or that you won't do anything, in which case it's a non-goal.

There is potentially a risk involved in someone setting an extremely high threshold, and therefore stuck pods not getting killed in a timely manner, but that risk exists today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably need to rewrite the issue on stateless kubelet

@matthyx
Copy link
Contributor Author

matthyx commented Mar 6, 2019

@RobertKrawitz sorry, I am in the middle of editing... I have just marked the points already done as resolved to keep track of the things I have to finish later. (I am doing this during my lunch pause and coffee breaks)

@matthyx
Copy link
Contributor Author

matthyx commented Mar 6, 2019

@RobertKrawitz you can review again whenever you have time, thanks for your patience!


## Summary

Slow starting containers are difficult to address with the current status of health probes: they are either killed before being up, or could be left dead-locked during a very long time before being killed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: dead-locked => deadlocked


## Motivation

The main problem with slow starting containers is that you have to leave them enough time to start before having `livenessProbe` fail `failureThreshold` times (otherwise `kubelet` kills them before they have a chance to be up).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: again, don't write in the second person.

More substantively, I'm not quite sure what the "problem" is here from the way it's written. I can see a few problems, but you need to be explicit about what you're trying to solve here.

## Motivation

The main problem with slow starting containers is that you have to leave them enough time to start before having `livenessProbe` fail `failureThreshold` times (otherwise `kubelet` kills them before they have a chance to be up).
This is even worse when load on the nodes can increase this long startup time since you have to take into account the worse case scenario (let's call it **T**).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal doesn't address that, and it's hard for me to see how it could without much deeper changes (the pod creator cannot know what the worst case scenario is). That's OK; this does help, but I would like you to add as an explicit non-goal that this proposal does not attempt to take into account system load; it simply allows a longer grace period at the start.


You can handle that today with various strategies:

- Delay the initial `livenessProbe` sufficiently to permit the container to start up (set `initialDelaySeconds` greater than **T**). While this ensures that the `livenessProbe` will not fail during the startup period, it also prevents the container from being killed early in case of a deadlock. Also, since the `livenessProbe` isn't run during that delay, there is no feedback loop on the container state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does setting initialDelaySeconds prevent the container from being killed early if it deadlocks any more than this proposal does? Is the concern about the container coming up (so that a liveness probe would see it succeed) and only after that deadlock? If that's what you mean, say so. If it's something else, please state that as well.

You can handle that today with various strategies:

- Delay the initial `livenessProbe` sufficiently to permit the container to start up (set `initialDelaySeconds` greater than **T**). While this ensures that the `livenessProbe` will not fail during the startup period, it also prevents the container from being killed early in case of a deadlock. Also, since the `livenessProbe` isn't run during that delay, there is no feedback loop on the container state.
- Increase the allowed number of `livenessProbe` failures until `kubelet` kills the container (set `failureThreshold` so that `failureThreshold` times `periodSeconds` is greater than **T**). While this gives enough time for the container to start up and allows a feedback loop, it prevents the container from being killed in a timely manner in case of a deadlock.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Killed in a timely manner if it deadlocks or otherwise hangs after it has initially successfully come up.

- The container fails 10 probes at startup, starts successfully, and after a long time fails 3 probes. The container is considered to have failed and is terminated after 30 seconds of downtime.
- The container fails 10 probes at startup, succeeds once, and fails 3 more probes. The container is considered to have started at 100 seconds, and even though it is still within the first 200 seconds of its lifetime covered by `maxInitialFailureCount`, it is considered to have failed because of `failureThreshold` and the fact that it had initially started successfully, therefore it is terminated after 30 seconds of downtime.

You can see the implementation in PR [#71449].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This is implemented in PR..." -- again, avoid the second person.

- The container is running: `hasInitialized` is reverted to false until the next probe which will succeed and reset it to true immediately after.
- The container is dead-locked: `hasInitialized` is reverted to false, which means the container will have a maximum downtime of `periodSeconds` times `maxInitialFailureCount`. This is also the case today with a high `failureThreshold`.

#### maxInitialFailureCount is set smaller than failureThreshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a risk. This is a case that you need to decide how you want to handle, and state it in the description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take this out of the risks and add whatever you decide to do to the description.

The following test cases can be covered by calling the `fakeExecProber` a number of times to verify:

- the container is killed after `maxInitialFailureCount` if it has never initialized (emulated by always calling `fakeExecProber{probe.Success, nil}`)
- the container is killed after `failureThreshold` once it has initialized (emulated by calling `fakeExecProber{probe.Success, nil}` once) with total probes > `maxInitialFailureCount`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that you need to test this; what's important in this regard is simply that it fails after failureThreshold failures after it has initialized successfully. But the case of it failing with the total number of probes < maxInitialFailureCount is simply a more interesting corner case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is important to check for regressions concerning the hasInitialized boolean. If it is not properly set after the first probe success, the container will be killed when total probes surpasses maxInitialFailureCount regardless of failureThreshold.

- the container is killed after `failureThreshold` once it has initialized (emulated by calling `fakeExecProber{probe.Success, nil}` once) with total probes > `maxInitialFailureCount`
- the container is killed after `failureThreshold` once it has initialized (emulated by calling `fakeExecProber{probe.Success, nil}` once) with total probes < `maxInitialFailureCount`

## Implementation History
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've also implemented (and tested?) code in the PR, correct? You should note the prototype that is concurrently under review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, works well in my case - this is why I am pushing for this feature


### Risks and Mitigations

#### Stateless kubelet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a limitation rather than a risk; you don't have a way at present of persisting hasInitialized. But OK, it's reasonable enough to put it here.

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Mar 7, 2019
@matthyx matthyx force-pushed the master branch 2 times, most recently from 9e13b9d to 568f257 Compare March 7, 2019 10:38
@matthyx
Copy link
Contributor Author

matthyx commented Mar 7, 2019

@RobertKrawitz I have edited the KEP to reflect your excellent review, thanks!

- Delay the initial `livenessProbe` sufficiently to permit the container to start up (set `initialDelaySeconds` greater than **startup time**). While this ensures no `livenessProbe` will run and fail during the startup period (triggering a kill), it also delays deadlock detection if the container starts faster than `initialDelaySeconds`. Also, since the `livenessProbe` isn't run at all during startup, there is no feedback loop on the actual startup time of the container.
- Increase the allowed number of `livenessProbe` failures until `kubelet` kills the container (set `failureThreshold` so that `failureThreshold` times `periodSeconds` is greater than **startup time**). While this gives enough time for the container to start up and allows a feedback loop, it prevents the container from being killed in a timely manner if it deadlocks or otherwise hangs after it has initially successfully come up.

However, none of these strategies give an acceptable protection against deadlocks, which is the primary use case of `livenessProbe`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal allows faster detection of deadlocks or otherwise unresponsive pods after they have started up, which is a good thing, but it doesn't actually protect against deadlocks.


- long data initialization: only the first startup takes a lot of time
- heavy workload: every startups take a lot of time
- underpowered/overloaded node: startup times depend on external factors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a different situation; this problem is extrinsic to the pod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but the result at the pod level is also a delayed startup... let's remove it for the and I might come back to the subject in a new proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a genuine cause of slow startup. List it, but note that this proposal is not intended to solve that problem.


### Non-Goals

- Make probe parameters (`initialDelaySeconds`, `failureThreshold`, `periodSeconds`, ...) auto-tunable depending on node performance (only provide a way to allow a longer grace period at the start of the container).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More broadly, this does not address the issue of pod load affecting startup (or any other liveness probe that may be delayed due to load). This proposal is strictly at the pod level, not the node level.

- The container is running: `hasInitialized` is reverted to false until the next probe which will succeed and reset it to true immediately after.
- The container is dead-locked: `hasInitialized` is reverted to false, which means the container will have a maximum downtime of `periodSeconds` times `maxInitialFailureCount`. This is also the case today with a high `failureThreshold`.

#### maxInitialFailureCount is set smaller than failureThreshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take this out of the risks and add whatever you decide to do to the description.

@@ -74,7 +74,7 @@ There are various strategies to handle this situation with the current API:
- Delay the initial `livenessProbe` sufficiently to permit the container to start up (set `initialDelaySeconds` greater than **startup time**). While this ensures no `livenessProbe` will run and fail during the startup period (triggering a kill), it also delays deadlock detection if the container starts faster than `initialDelaySeconds`. Also, since the `livenessProbe` isn't run at all during startup, there is no feedback loop on the actual startup time of the container.
- Increase the allowed number of `livenessProbe` failures until `kubelet` kills the container (set `failureThreshold` so that `failureThreshold` times `periodSeconds` is greater than **startup time**). While this gives enough time for the container to start up and allows a feedback loop, it prevents the container from being killed in a timely manner if it deadlocks or otherwise hangs after it has initially successfully come up.

However, none of these strategies give an acceptable protection against deadlocks, which is the primary use case of `livenessProbe`.
However, none of these strategies provide an timely answer to slow starting containers stuck in a deadlock, which in theory is the primary reason of setting up a `livenessProbe`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for "in theory".

@RobertKrawitz
Copy link
Contributor

/lgtm
This looks good to me at this point.

@k8s-ci-robot
Copy link
Contributor

@RobertKrawitz: changing LGTM is restricted to assignees, and only kubernetes/enhancements repo collaborators may be assigned issues.

In response to this:

/lgtm
This looks good to me at this point.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@matthyx
Copy link
Contributor Author

matthyx commented Mar 8, 2019

@derekwaynecarr can you review it as well?

@matthyx
Copy link
Contributor Author

matthyx commented Mar 8, 2019

@RobertKrawitz after working again on the implementation, I wonder if it wouldn't make more sense to rename the new parameter to InitializationFailureThreshold to clearly understand its relationship with FailureThreshold, what do you think?

@RobertKrawitz
Copy link
Contributor

I agree, that will be more obvious.

@matthyx
Copy link
Contributor Author

matthyx commented Mar 8, 2019

I agree, that will be more obvious.

done

@matthyx matthyx changed the title KEP for adding maxInitialFailureCount to health probes KEP for adding initializationFailureThreshold to health probes Mar 8, 2019
@matthyx
Copy link
Contributor Author

matthyx commented Mar 11, 2019

Can we try to merge it in order to launch the review process initiated by an issue in enhancements?

@RobertKrawitz
Copy link
Contributor

cc @derekwaynecarr @dchen1107 unless more review is needed this probably should get merged.

@RobertKrawitz
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2019
@RobertKrawitz
Copy link
Contributor

/assign @derekwaynecarr

@matthyx
Copy link
Contributor Author

matthyx commented Mar 21, 2019

@derekwaynecarr @dchen1107 now that code thaw for 1.14 has passed, can we move on to this KEP?

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for merging as provisional.

Before it transitions to implementable, we should make clear the graduation criteria for the feature. I assume this needs to be mapped to feature gates, and we should have some documentation on how we handle apiserver and node version skew.


### Risks and Mitigations

#### Stateless kubelet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the kep should include a section on graduation criteria?

i assume this new field would have to go through the normal process of feature gates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already an API review here: https://github.com/orgs/kubernetes/projects/13#card-19727773
I am quite new to the KEP process, do you mean something else?

@matthyx
Copy link
Contributor Author

matthyx commented Apr 9, 2019

Alright, now I see what you mean by graduation criteria... well if you look at the checklist, we have other points to validate first.
Maybe we could merge first, then look at the points in order?
First step would be to open an issue in release milestones, but I need the merge first.

@derekwaynecarr
Copy link
Member

I am fine w/ merging and iterating in follow-on PRs adding missing items before it goes implementable.

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, matthyx, RobertKrawitz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 10, 2019
@k8s-ci-robot k8s-ci-robot merged commit 62727f3 into kubernetes:master Apr 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants