Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod Termination handling kicks in before the ingress controller has had time to process #106476

Open
nirnanaaa opened this issue Nov 17, 2021 · 22 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@nirnanaaa
Copy link

What happened?

When a pod is entering its Terminating state, it will receive a signal, asking it kindly to finish up work after which kubernetes will proceed deleting the pod.

At the same time that the pod starts terminating, an ingress controller will receive the updated endpoints object, which will start removing the pod from the list of targets in the load balancer, that traffic could be sent to.

Both of these processes - the signal handling at the kubelet level and the removal of the Pods IP from the list of endpoints - are decoupled from one another and the SIGTERM might have been handled before, or at the same time, that the target in the target group is being processed.

As result the ingress controller might still send traffic to targets, which are still in its endpoints, but have properly shut down already. This might result in dropped connections, as the LB is still trying to send requests to the properly shutdown pod. The LB will in-turn reply with 5xx responses.

What did you expect to happen?

no traffic being dropped during shutdown.

The SIGTERM should only start after the ingress controller/LB has removed the target from the target group. Readiness gates work pretty good for pod startup/rollout but lack support during pod deletion.

How can we reproduce it (as minimally and precisely as possible)?

This is a very theoretical problem, which is very hard to reproduce:

  • Provision an ingress controller (AWS LB for example)
  • Create an ingress
  • Create a service and pods (multiple ones through a deployment work best) for this ingress
  • (add some delay/load to the cluster, that will cause the LB synchronization to be slower or delayed)
  • startup an HTTP benchmark to produce some artificial load
  • rollout a change to the deployment or just evict some pods

Anything else we need to know?

We've been relying on Pod-Graceful-Drain, which unfortunately intercepts and breaks k8s internals.

You can achieve a pretty good result as well using a sleep as preStop, but that's not reliable at all - due to the fact that it's just a guessing game if your traffic will be drained after X seconds - and requires statically linked binaries to be mounted in each container or the existence of sleep in the operating system.

I also opened up an issue on the Ingress Controllers repo.

Kubernetes version

$ kubectl version
v1.18.20

Cloud provider

AWS/EKS

OS version

# On Linux:
sh-4.2$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

# paste output here
$ uname -a
Linux xxx 4.14.252-195.483.amzn2.x86_64 #1 SMP Mon Nov 1 20:58:46 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Install tools

- [https://github.com/kubernetes-sigs/aws-load-balancer-controller](https://github.com/kubernetes-sigs/aws-load-balancer-controller)

Container runtime (CRI) and and version (if applicable)

Docker version 20.10.7, build f0df350

Related plugins (CNI, CSI, ...) and versions (if applicable)

@nirnanaaa nirnanaaa added the kind/bug Categorizes issue or PR as related to a bug. label Nov 17, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 17, 2021
@nirnanaaa nirnanaaa changed the title Pod Termination handling kicks in before the load balancer has had time to remove target Pod Termination handling kicks in before the ingress controller has had time to process Nov 17, 2021
@nirnanaaa
Copy link
Author

/sig provider-aws
/sig network

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Nov 17, 2021
@k8s-ci-robot
Copy link
Contributor

@nirnanaaa: The label(s) sig/provider-aws cannot be applied, because the repository doesn't have them.

In response to this:

/sig provider-aws
/sig network

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 17, 2021
@nirnanaaa
Copy link
Author

/sig cloud-provider

@k8s-ci-robot k8s-ci-robot added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Nov 17, 2021
@uablrek
Copy link
Contributor

uablrek commented Dec 6, 2021

Blog that describes the problem at; https://medium.com/flant-com/kubernetes-graceful-shutdown-nginx-php-fpm-d5ab266963c2

I especially like the chart. It illuminates the problem in K8s; there is no kube-proxy<->kubelet communication (and there shouldn't be!). They both react independently on kube-api updates.

@nirnanaaa
Copy link
Author

Indeed. (the part in their article about sigterm is very missleading, because it doesn't matter how well your app handles sigterm, if new traffic will still arrive after your app shutdown was completed, that's the "Practice. Potential problems with graceful shutdown" in their article).

I totally get that there is not supposed to be a link, don't get me wrong, most of these errors could - and probably should - be solved on client side, but unfortunately it's very hard for users to understand the behavior and mechanics. They're just used to these kinds of dependencies.

I just feel a sleep is not a guarantee. Even though it might be the most simple approach to solving a majority of the cases, it drives people crazy, who are trying to figure out the remaining errors.

I do think the mechanic of preStop is the right way to approach this. But with a more event-based release. wdyt?

@uablrek
Copy link
Contributor

uablrek commented Dec 6, 2021

I don't think this can be done "by the means of kube-api" so to speak. A pod can get closer by monitoring it's own endpoints, but beside being quite complex it will still not know if all kube-proxys has updated the actual load-balancing. To introduce some status for (removed) endpoints that is updated when all kube-proxy instances has updated load-balancing, I find horrible. Just imagine the cluster wide sync and all possible fault cases 😧

IMO this falls into the "service mesh" domain. They monitor actual connections I think (circuit breaking?), but I am not very familiar with service meshs I must admit.

@cheftako
Copy link
Member

cheftako commented Dec 8, 2021

/cc @kishorj

@dcbw
Copy link
Member

dcbw commented Dec 9, 2021

/assign rikatz

@dcbw
Copy link
Member

dcbw commented Dec 9, 2021

/assign bowei

@nckturner
Copy link
Contributor

/triage-accepted

@andrewsykim
Copy link
Member

EndpointSlice API now supports terminating condition, wonder if ingress controllers can be updated to leverage this for graceful termination of endpoints?

@thockin
Copy link
Member

thockin commented Jan 20, 2022

As @uablrek described, syncing state from all kube-proxies on every change to every endpoint of every service isn't feasible. I'm not sure terminating endpoints helps much, if the core problem is that the Ingress controller or proxy is out-to-lunch or otherwise not getting updates.

@nirnanaaa
Copy link
Author

nirnanaaa commented Jan 21, 2022

I feel that the actual problem is that pods just get their termination signals way too early in this process, right ? Not sure what ingress controllers or kube proxy should do about that. If a container process is down it doesn’t matter if kube-proxy or the controller stops sending traffic there eventually - all traffic in between these events will still end up hitting an already dead target.

@aojea
Copy link
Member

aojea commented Jan 21, 2022

I feel that the actual problem is that pods just get their termination signals way too early in this process, right ? Not sure what ingress controllers or kube proxy should do about that. If a container process is down it doesn’t matter if kube-proxy or the controller stops sending traffic there eventually - all traffic in between these events will still end up hitting an already dead target.

the process is not parallel is sequential and async

kill pod -> pod goes not ready -> endpoint controller receives event pod is not ready -> updates the endpoints -> ingress controller/kube-proxy receive and event endpoints has changed -> ...

@nirnanaaa
Copy link
Author

true, but shouldn't the pod be able stay alive until that event has been propagated and processed (just as a preStop hook kinda does)? This would ensure that the container doesn't even enter sigterm until kube-proxy/$controller had time to remove the endpoint.

@aojea
Copy link
Member

aojea commented Jan 21, 2022

true, but shouldn't the pod be able stay alive until that event has been propagated and processed (just as a preStop hook kinda does)? This would ensure that the container doesn't even enter sigterm until kube-proxy/$controller had time to remove the endpoint.

how do the pod know that? :)

syncing state from all kube-proxies on every change to every endpoint of every service isn't feasible.

I see your point, you want to make the process completely synchronous, but is not how Kubernetes works https://kubernetes.io/docs/concepts/architecture/controller/#controller-pattern , oversimplifying you have a bunch of controllers that sync the current state to the desired state, and are eventually consistent 🦄

@andrewsykim
Copy link
Member

(fixing #106476 (comment))

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 2, 2022
@mbyio
Copy link

mbyio commented Apr 27, 2022

I've spent a lot of time debugging this issue in AWS. One particularly problematic use case is with network load balancers. They always take at least 2 minutes to deregister a pod in IP mode from when they receive a request via AWS' API to remove the pod. During this time, the NLB will still send new TCP connections to the pod.

The AWS Load Balancer controller would be able to detect when a pod is actually completely deregistered. So would other ingress controllers. So, I would love if we had a "termination gate" or similar, which could be added by a controller, similar to readiness gates. Actually, the AWS Load balancer controller already makes use of readiness gates, because NLB/ALB registration is just as slow. It's weird that Kubernetes doesn't do anything to help with deregistration as well.

If we had a termination gate, this would fix the problem, because we would complete the loop: k8s would say it wants to terminate a pod, controllers would have time to update and stop sending traffic, and then the pod would be told "okay, it's time to cleanup".

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2022
@mbyio
Copy link

mbyio commented Jul 26, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2022
@bowei
Copy link
Member

bowei commented Jul 26, 2022

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jul 26, 2022
@Arcadiyyyyyyyy
Copy link

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests