Pod Termination handling kicks in before the ingress controller has had time to process #106476

nirnanaaa · 2021-11-17T08:09:12Z

What happened?

When a pod is entering its Terminating state, it will receive a signal, asking it kindly to finish up work after which kubernetes will proceed deleting the pod.

At the same time that the pod starts terminating, an ingress controller will receive the updated endpoints object, which will start removing the pod from the list of targets in the load balancer, that traffic could be sent to.

Both of these processes - the signal handling at the kubelet level and the removal of the Pods IP from the list of endpoints - are decoupled from one another and the SIGTERM might have been handled before, or at the same time, that the target in the target group is being processed.

As result the ingress controller might still send traffic to targets, which are still in its endpoints, but have properly shut down already. This might result in dropped connections, as the LB is still trying to send requests to the properly shutdown pod. The LB will in-turn reply with 5xx responses.

What did you expect to happen?

no traffic being dropped during shutdown.

The SIGTERM should only start after the ingress controller/LB has removed the target from the target group. Readiness gates work pretty good for pod startup/rollout but lack support during pod deletion.

How can we reproduce it (as minimally and precisely as possible)?

This is a very theoretical problem, which is very hard to reproduce:

Provision an ingress controller (AWS LB for example)
Create an ingress
Create a service and pods (multiple ones through a deployment work best) for this ingress
(add some delay/load to the cluster, that will cause the LB synchronization to be slower or delayed)
startup an HTTP benchmark to produce some artificial load
rollout a change to the deployment or just evict some pods

Anything else we need to know?

We've been relying on Pod-Graceful-Drain, which unfortunately intercepts and breaks k8s internals.

You can achieve a pretty good result as well using a sleep as preStop, but that's not reliable at all - due to the fact that it's just a guessing game if your traffic will be drained after X seconds - and requires statically linked binaries to be mounted in each container or the existence of sleep in the operating system.

I also opened up an issue on the Ingress Controllers repo.

Kubernetes version

$ kubectl version
v1.18.20

Cloud provider

AWS/EKS

OS version

# On Linux:
sh-4.2$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

# paste output here
$ uname -a
Linux xxx 4.14.252-195.483.amzn2.x86_64 #1 SMP Mon Nov 1 20:58:46 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Install tools

- [https://github.com/kubernetes-sigs/aws-load-balancer-controller](https://github.com/kubernetes-sigs/aws-load-balancer-controller)

Container runtime (CRI) and and version (if applicable)

Docker version 20.10.7, build f0df350

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

nirnanaaa · 2021-11-17T08:12:38Z

/sig provider-aws
/sig network

k8s-ci-robot · 2021-11-17T08:12:40Z

@nirnanaaa: The label(s) sig/provider-aws cannot be applied, because the repository doesn't have them.

In response to this:

/sig provider-aws
/sig network

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nirnanaaa · 2021-11-17T08:13:06Z

/sig cloud-provider

uablrek · 2021-12-06T06:08:42Z

Blog that describes the problem at; https://medium.com/flant-com/kubernetes-graceful-shutdown-nginx-php-fpm-d5ab266963c2

I especially like the chart. It illuminates the problem in K8s; there is no kube-proxy<->kubelet communication (and there shouldn't be!). They both react independently on kube-api updates.

nirnanaaa · 2021-12-06T07:26:05Z

Indeed. (the part in their article about sigterm is very missleading, because it doesn't matter how well your app handles sigterm, if new traffic will still arrive after your app shutdown was completed, that's the "Practice. Potential problems with graceful shutdown" in their article).

I totally get that there is not supposed to be a link, don't get me wrong, most of these errors could - and probably should - be solved on client side, but unfortunately it's very hard for users to understand the behavior and mechanics. They're just used to these kinds of dependencies.

I just feel a sleep is not a guarantee. Even though it might be the most simple approach to solving a majority of the cases, it drives people crazy, who are trying to figure out the remaining errors.

I do think the mechanic of preStop is the right way to approach this. But with a more event-based release. wdyt?

uablrek · 2021-12-06T11:42:08Z

I don't think this can be done "by the means of kube-api" so to speak. A pod can get closer by monitoring it's own endpoints, but beside being quite complex it will still not know if all kube-proxys has updated the actual load-balancing. To introduce some status for (removed) endpoints that is updated when all kube-proxy instances has updated load-balancing, I find horrible. Just imagine the cluster wide sync and all possible fault cases 😧

IMO this falls into the "service mesh" domain. They monitor actual connections I think (circuit breaking?), but I am not very familiar with service meshs I must admit.

cheftako · 2021-12-08T21:35:53Z

/cc @kishorj

dcbw · 2021-12-09T22:10:21Z

/assign rikatz

dcbw · 2021-12-09T22:12:04Z

/assign bowei

nckturner · 2022-01-19T21:59:45Z

/triage-accepted

andrewsykim · 2022-01-19T22:02:13Z

EndpointSlice API now supports terminating condition, wonder if ingress controllers can be updated to leverage this for graceful termination of endpoints?

thockin · 2022-01-20T21:55:33Z

As @uablrek described, syncing state from all kube-proxies on every change to every endpoint of every service isn't feasible. I'm not sure terminating endpoints helps much, if the core problem is that the Ingress controller or proxy is out-to-lunch or otherwise not getting updates.

nirnanaaa · 2022-01-21T05:01:46Z

I feel that the actual problem is that pods just get their termination signals way too early in this process, right ? Not sure what ingress controllers or kube proxy should do about that. If a container process is down it doesn’t matter if kube-proxy or the controller stops sending traffic there eventually - all traffic in between these events will still end up hitting an already dead target.

aojea · 2022-01-21T08:00:09Z

I feel that the actual problem is that pods just get their termination signals way too early in this process, right ? Not sure what ingress controllers or kube proxy should do about that. If a container process is down it doesn’t matter if kube-proxy or the controller stops sending traffic there eventually - all traffic in between these events will still end up hitting an already dead target.

the process is not parallel is sequential and async

kill pod -> pod goes not ready -> endpoint controller receives event pod is not ready -> updates the endpoints -> ingress controller/kube-proxy receive and event endpoints has changed -> ...

nirnanaaa · 2022-01-21T08:30:08Z

true, but shouldn't the pod be able stay alive until that event has been propagated and processed (just as a preStop hook kinda does)? This would ensure that the container doesn't even enter sigterm until kube-proxy/$controller had time to remove the endpoint.

aojea · 2022-01-21T08:56:02Z

true, but shouldn't the pod be able stay alive until that event has been propagated and processed (just as a preStop hook kinda does)? This would ensure that the container doesn't even enter sigterm until kube-proxy/$controller had time to remove the endpoint.

how do the pod know that? :)

syncing state from all kube-proxies on every change to every endpoint of every service isn't feasible.

I see your point, you want to make the process completely synchronous, but is not how Kubernetes works https://kubernetes.io/docs/concepts/architecture/controller/#controller-pattern , oversimplifying you have a bunch of controllers that sync the current state to the desired state, and are eventually consistent 🦄

andrewsykim · 2022-02-02T20:17:05Z

(fixing #106476 (comment))

/triage accepted

mbyio · 2022-04-27T00:13:58Z

I've spent a lot of time debugging this issue in AWS. One particularly problematic use case is with network load balancers. They always take at least 2 minutes to deregister a pod in IP mode from when they receive a request via AWS' API to remove the pod. During this time, the NLB will still send new TCP connections to the pod.

The AWS Load Balancer controller would be able to detect when a pod is actually completely deregistered. So would other ingress controllers. So, I would love if we had a "termination gate" or similar, which could be added by a controller, similar to readiness gates. Actually, the AWS Load balancer controller already makes use of readiness gates, because NLB/ALB registration is just as slow. It's weird that Kubernetes doesn't do anything to help with deregistration as well.

If we had a termination gate, this would fix the problem, because we would complete the loop: k8s would say it wants to terminate a pod, controllers would have time to update and stop sending traffic, and then the pod would be told "okay, it's time to cleanup".

k8s-triage-robot · 2022-07-26T00:42:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mbyio · 2022-07-26T03:48:29Z

/remove-lifecycle stale

bowei · 2022-07-26T05:35:37Z

/lifecycle frozen

Arcadiyyyyyyyy · 2023-11-18T19:04:04Z

/remove-lifecycle stale

nirnanaaa added the kind/bug Categorizes issue or PR as related to a bug. label Nov 17, 2021

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 17, 2021

nirnanaaa changed the title ~~Pod Termination handling kicks in before the load balancer has had time to remove target~~ Pod Termination handling kicks in before the ingress controller has had time to process Nov 17, 2021

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Nov 17, 2021

k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 17, 2021

k8s-ci-robot added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Nov 17, 2021

k8s-ci-robot assigned rikatz Dec 9, 2021

k8s-ci-robot assigned bowei Dec 9, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 2, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2022

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jul 26, 2022

dippynark mentioned this issue Dec 18, 2023

✨ Add support for controller-manager webhook shutdown delay kubernetes-sigs/controller-runtime#2601

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod Termination handling kicks in before the ingress controller has had time to process #106476

Pod Termination handling kicks in before the ingress controller has had time to process #106476

nirnanaaa commented Nov 17, 2021

nirnanaaa commented Nov 17, 2021

k8s-ci-robot commented Nov 17, 2021

nirnanaaa commented Nov 17, 2021

uablrek commented Dec 6, 2021

nirnanaaa commented Dec 6, 2021

uablrek commented Dec 6, 2021

cheftako commented Dec 8, 2021

dcbw commented Dec 9, 2021

dcbw commented Dec 9, 2021

nckturner commented Jan 19, 2022

andrewsykim commented Jan 19, 2022

thockin commented Jan 20, 2022

nirnanaaa commented Jan 21, 2022 •

edited

Loading

aojea commented Jan 21, 2022 •

edited

Loading

nirnanaaa commented Jan 21, 2022

aojea commented Jan 21, 2022 •

edited

Loading

andrewsykim commented Feb 2, 2022

mbyio commented Apr 27, 2022

k8s-triage-robot commented Jul 26, 2022

mbyio commented Jul 26, 2022

bowei commented Jul 26, 2022

Arcadiyyyyyyyy commented Nov 18, 2023

Pod Termination handling kicks in before the ingress controller has had time to process #106476

Pod Termination handling kicks in before the ingress controller has had time to process #106476

Comments

nirnanaaa commented Nov 17, 2021

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

nirnanaaa commented Nov 17, 2021

k8s-ci-robot commented Nov 17, 2021

nirnanaaa commented Nov 17, 2021

uablrek commented Dec 6, 2021

nirnanaaa commented Dec 6, 2021

uablrek commented Dec 6, 2021

cheftako commented Dec 8, 2021

dcbw commented Dec 9, 2021

dcbw commented Dec 9, 2021

nckturner commented Jan 19, 2022

andrewsykim commented Jan 19, 2022

thockin commented Jan 20, 2022

nirnanaaa commented Jan 21, 2022 • edited Loading

aojea commented Jan 21, 2022 • edited Loading

nirnanaaa commented Jan 21, 2022

aojea commented Jan 21, 2022 • edited Loading

andrewsykim commented Feb 2, 2022

mbyio commented Apr 27, 2022

k8s-triage-robot commented Jul 26, 2022

mbyio commented Jul 26, 2022

bowei commented Jul 26, 2022

Arcadiyyyyyyyy commented Nov 18, 2023

nirnanaaa commented Jan 21, 2022 •

edited

Loading

aojea commented Jan 21, 2022 •

edited

Loading

aojea commented Jan 21, 2022 •

edited

Loading