-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pod termination might cause dropped connections #2366
Comments
@nirnanaaa In my mind, it can be handled in a protocol specific way by client/server if you have control over server's code
Once we implemented above, then we only need to set a large terminationGracePeriodSeconds. I have a sample app implemented above: https://github.com/M00nF1sh/ReInvent2019CON310R/blob/master/src/server.py#L24 |
Hey @M00nF1sh thanks for your input. I fear this is exactly what I thought was supposed to fix it. But thinking about it: if the server does all of this: what does prevent the LB from still sending traffic to the target - even for a brief period. Unfortunately We're hitting this exact scenario quite often. So
Am I the only one that sees this as a problem? A preStop hook is not very reliable IMO as it's just eyeballing the timing issue, just like |
We also ran into this issue. The preStop hack generally works, but it's still a hack and it still seems to fail to synchronize the "deregistration before SIGTERM" properly. Even with this in place, we've seen intermittent cases where the container still seems to get a SIGTERM before the target is deregistered (possibly by the DeregisterTargets API call getting throttled). This is a pretty serious issue for anyone trying to do zero-downtime deploys on top of Kubernetes with ALB's. |
@nirnanaaa For this case, a sleep is indeed needed since we don't have any information available on when the LBs actually stops sending targets. (even when the targets shows as draining after the controller made the deregisterTargets call, it still takes time for the changes to be actually propagated to ELB's dataplane). |
I was just wondering whether this should be solved at a lower level - that's why I also opened an issue on k/k. This is not something just limited to this controller (although in cluster LBs are less likely to have this issue, it's still present). And you're absolutely right. Client side retries would probably solve this. There are even some protocols like GRPC which could work around this problem, but the truth is that we cannot really control what's being run on the cluster itself - hence also my doubts about the use of sleep. I thought about maybe having a more sophisticated |
This is indeed a problem that exists in kubernetes generally, not just ALB. We ran into this a lot using classic load balancers in proxy mode to nging ingress with external traffic policy local to get real IPs. Using the v2 ELBs with ip target groups is a big improvement over the external traffic policy local mechanism requiring health check failures to get the node out of the instance list. That being said, this still is an issue that can happen. Sleeps and prestop hooks are really the only game in town. I'm not aware of any kind of prestop gate like the readiness gate this controller can inject. Is there a community binary that does this already OOTB? If not, that'd be a good little project. Another alternative is to use a reverse proxy like ingress nginx be the target for your alb ingress instead of your application. Then your container lifecycle events for your alb targets will be much, much less frequent. |
@BrianKopp we've thought about running a sidecar, which provides a statically linked binary to a common - in mem - volume, that the main container could use as preStop. I just couldn't come up with a generic solution to check for "is this pod still in any target group" - without either running into API throttles (and potentially blocking the main operation of the aws-lb-controller) - or completely breaking single responsibility principles. |
IIRC, sidecar for this sort of thing is a trap since a prestop hook interrupts the sigterm for its container, not all containers. Your http container would get its sigterm immediately. I've actually begun thinking about starting a project to address this. My thought is to have an http service inside the cluster, say at What do you think? Is this worth making a thing? |
Oh I was not talking about processing the sigterm in the sidecar. If you read closely I only spoke about providing the binary ;) For our use case even a simple HTTP query will not work if not done through some statically linked binary, since we cannot even rely on libc to be present. |
Ok I see what you were suggesting. I mean, if the requirement is that we cannot place any runtime demands on the http container, then yes we would need some kind of sidecar to be injected to provide some functionality, along with a mutating webhook to add the container and prestop hook, in case a prestop hook wasn't already present. That part seems like a bit of a minefield. I was thinking about putting together something that would work for now until such a solution would be possible, if at all possible. If one didn't want to place any dependency requirements on the main http container (eg curl), a prestop hook could wait for a signal over a shared volume from a lightweight curling sidecar |
I think this is best addressed by extending the Pod API. Yes, that's not a trivial change. However, the Pod API is what the kubelet pays attention to. If you want the cluster to hold off sending SIGTERM then there needs to be a way in the API to explain why that's happening. This code doesn't have the levers to pull to make things change how we'd need. Another option is to redefine SIGTERM to mean “get ready to shutdown but keep serving”. I don't think that's a helpful interpretation of SIGTERM though. |
question for those who've dealt with this: would you agree that a correct configuration to handle this issue looks like |
I think that's correct. This is generally what we've used in helm charts: # ingress
alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds={{ .Values.deregistrationDelay }}
# pod
terminationGracePeriodSeconds: {{ add .Values.deregistrationDelay .Values.deregistrationGracePeriod 30 }}
command: ["sleep", "{{ add .Values.deregistrationDelay .Values.deregistrationGracePeriod }}"] Where |
For NLBs in IP mode (everything should be similar for other LBs and modes):
|
That's an interesting finding. Is there some documentation you can point to that highlights this (assuming it's AWS side)? |
I believe it is all on AWS' side yes. No I couldn't find any documentation about this. All due respect to AWS engineers, their ELB documentation is horrible, missing a lot of important information. I found this by writing some programs that use raw TCP connections (so I can monitor every aspect) and manually triggering deregistration in various ELB configurations to record the timing. It consistently took 2 minutes. |
In your testing, did the target group state for the target show the ip as draining while it was receiving packets still? Did it receive packets after the target dropped completely out of the target group? I've got a project I'm working on to add a delay in the prestop hook to wait until the ip is out of the target group. Would that be helpful here? |
Yes, it did correctly show the IP on the target list as draining as soon as I requested deregistration. And when it was removed from the target list, it also stopped receiving new connections. So, if a tool was monitoring the target list and used that to delay sending SIGTERM to a pod until the pod is out of the target list (eg. using preStop), that would solve the problem. It would be the opposite of a readiness gate. I think one difficulty is AWS has some restrictive rate limits, so depending on your scale I don't think you can just have every pod hitting the API in the preStop hook. |
We were told this by AWS support:
Sleeping for 180s (3m) still hasn't been reliable for us though, so we're currently at 240s (4m) 😭 |
we've actually started sharding our services across multiple aws accounts/eks clusters, just to lower the number of pending/throttled API requests and to increase the speed at which each controller can operate. But then again: the probability of this error happening is still not zero. On a new AMI release we frequently experience dropped packages (even with X seconds of sleep as preStop) |
I think the controller should automatically add a preStop hook to pods, which waits until the controller indicates it is safe to start terminating. That would nip this in the bud once and for all. |
We came across this issue and found ours to be a combination of both the Cluster Autoscaler (CA) and the Load Balancer Controller working in tandem when external traffic policy was set to cluster. We'd have nodes getting ready to be scaled in (being pretty much empty except for kube-proxy) then the CA would kick in and taint the node and terminate it well before it was deregistered by the LB Controller. This would result in-flight requests being dropped from kube-proxy to another node (that ran our HAPRoxy pods). It appears now tho that the latest CA introduces a This would still require adequate pod termination for whatever service gets the traffic but most handle that perfectly well already. It seems like this is a pretty prevelant problem for what i assume is a very popular setup - if it does fix the issue we reaelly should call this out in the documentation that administrators need to be cognizant of the cluster autoscaler and LB controller potentially dropping requests in the externaltrafficpolicy cluster mode. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Doesn't seem to be working though as the pod dies immediately instead of gracefully shutdown, we had to remove the |
This automatic preSop hook was a suggestion for what they should implement, it isn't implemented yet. |
If you have a preStop hook that waits for shutdown, then it was not added by the controller. |
when I say wait for the shutdown I just mean a simple |
I can't think of a good way for LBC to communicate to a |
I think the issue here is just for long lived connections b/c the controller does not sever them when it goes into draining mode, so clients keep obsessing with them even when the sleep/grace period is over, hence the errors. I don't know if there is a way to force the controller to sever the long lived connection when entering draining mode, so that when the clients would reconnect to the ingress the controller should not pick the pods in |
@Ghilteras no, the pod knows about long-lived connections, is perfectly capable of closing them itself, and knows when they have gone away. The problem is new connections that keep coming in from the load balancer. The pod does not know when the load balancer has finished deregistering it and thus there will no longer be any more new incoming connections. |
I'm no longer working at the company where I needed this day-to-day. However, we would have been willing to deal with a lot of setup, including service accounts etc, in order to have an automated fix for this bug. It was a major pain point. As you can see in these comments, it is also hard for many people to understand, and therefore, hard to work around. |
How do you close and re-open a GRPC connection from the client to force it to dial another non |
@Ghilteras You may be looking at the wrong Github issue. This issue is about a case where AWS load balancers can send new requests to terminating pods that have already stopped accepting new requests. NGINX is not involved. Edit - oh I see, you mentioned the nginx ingress controller having the same problem. This issue is regarding connections from a load balancer to server pods. If you're using GRPC to connect to a load balancer, NGINX or an ALB, then that load balancer is intercepting the connections and doing load balancing. So there isn't really anything you can do on the client side to fix this. |
@Ghilteras I'm saying the terminating server pod is capable of shutting down the long-lived connection. |
I don't see any way for this ingress controller (or others, as OP said this is not a specific issue of this controller) to be capable of shutting down long-lived connections, if you are aware of a way of doing this please share. |
I wasn't talking about this ingress controller shutting down the connection. |
the GRPC server is already sending GOAWAYs, the issue is that when the clients reconnect they just end up going back to the Terminating pod, which should not be getting new traffic through the ingress, but it does and that's when we see the server throwing hundreds of |
@Ghilteras and thus the problem is with new connections coming in from the load balancer, as I previously stated. |
For anyone still struggling with the interaction between NLB and ingress-nginx-controller, this is the setup that seems to better mitigate the issues for us for HTTP traffic. ingress-nginx chart values snippet:
Also, With this setup rolling updates are reliable enough albeit reaaally slow, mostly due to having to account for #1834 (comment) Hope it helps someone, and if anyone detects any issues, improvements, or recommendations, it would be good to know. P.S: have yet to play around with pod readiness gates Other issues referenced: |
Thanks @Roberdvs this is huge. I just came across the problem in my cluster while testing an upgrade and I was able to fix it within an hour by referencing your work. I have been running statuscake.com with 30sec checks to help verify your fix works as expected. Prior to the change I had a minute or two of downtime. After the change, I appear to be at a real 100% uptime. For anyone else coming across this, I implemented the following I am using the default NLB deployment method given I don't have AWS LB Controller installed. Please see @Roberdvs comment above if you are using the LB Controller. controller:
service:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
type: "LoadBalancer"
externalTrafficPolicy: "Local"
# https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2366
# https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2366#issuecomment-1788923154
updateStrategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
# Add a pause to make time for the pod to be registered in the AWS NLB target group before proceeding with the next
# https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1834#issuecomment-781530724
# https://alexklibisz.com/2021/07/20/speed-limits-for-rolling-restarts-in-kubernetes#round-3-set-minreadyseconds-maxunavailable-to-0-and-maxsurge-to-1
minReadySeconds: 180
# Add sleep on preStop to allow for graceful shutdown with AWS NLB
# https://github.com/kubernetes/ingress-nginx/issues/6928
# https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2366#issuecomment-1118312709
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 240; /wait-shutdown"] Note: I removed My versions: |
was thinking about this again... has anyone thought about modifying the ALB controller to set a finalizer on the pods within the TG, and using that as a hook to prevent termination until it's been successfully removed from the TG? this would hinge on the SIGTERM not getting sent until after all finalizers have been removed, which I would expect to be true but would need to verify. |
@michaelsaah That sounds like an interesting approach to me. I think it is worth someone trying it out. As you said, it really just depends on if k8s will wait to send the SIGTERM until finalizers are removed. |
any update? I have same problem and I think taht it can be related also autoscaler. Please can you view it: kubernetes/autoscaler#6679 |
Describe the bug
When a pod is Terminating it will receive a SIGTERM connection asking it to finish up work and after that it will proceed with deleting the pod. At the same time that the pod starts terminating, the aws-load-balancer-controller will receive the updated object, forcing it to start removing the pod from the target group and to initialize draining.
Both of these processes - the signal handling at the kubelet level and the removal of the Pods IP from the TG - are decoupled from one another and the SIGTERM might have been handled before or at the same time, that the target in the target group has started draining.
As result the pod might be unavailable before the target group has even started its own draining process. This might result in dropped connections, as the LB is still trying to send requests to the properly shutdown pod. The LB will in-turn reply with 5xx responses.
Steps to reproduce
Expected outcome
Environment
All our ingresses have
v2.2.4
Additional Context:
We've been relying on Pod-Graceful-Drain, which unfortunately forks this controller and intercepts and breaks k8s controller internals.
You can achieve a pretty good result as well using a
sleep
aspreStop
, but that's not reliable at all - due to the fact that it's just a guessing game if your traffic will be drained after X seconds - and requires statically linked binaries to be mounted in each container or the existence of sleep in the operating system.I believe this is not only an issue to this controller, but to k8s in general. So any hints and already existing tickets would be very welcome.
The text was updated successfully, but these errors were encountered: