Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod termination might cause dropped connections #2366

Open
nirnanaaa opened this issue Nov 16, 2021 · 60 comments
Open

pod termination might cause dropped connections #2366

nirnanaaa opened this issue Nov 16, 2021 · 60 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@nirnanaaa
Copy link

nirnanaaa commented Nov 16, 2021

Describe the bug
When a pod is Terminating it will receive a SIGTERM connection asking it to finish up work and after that it will proceed with deleting the pod. At the same time that the pod starts terminating, the aws-load-balancer-controller will receive the updated object, forcing it to start removing the pod from the target group and to initialize draining.

Both of these processes - the signal handling at the kubelet level and the removal of the Pods IP from the TG - are decoupled from one another and the SIGTERM might have been handled before or at the same time, that the target in the target group has started draining.
As result the pod might be unavailable before the target group has even started its own draining process. This might result in dropped connections, as the LB is still trying to send requests to the properly shutdown pod. The LB will in-turn reply with 5xx responses.

Steps to reproduce

  • Provision an ingress with an ALB attached
      alb.ingress.kubernetes.io/certificate-arn: xxx
      alb.ingress.kubernetes.io/healthcheck-interval-seconds: "10"
      alb.ingress.kubernetes.io/healthcheck-path: /
      alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "5"
      alb.ingress.kubernetes.io/healthy-threshold-count: "2"
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
      alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=30
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/security-groups: xxx
      alb.ingress.kubernetes.io/tags: xxx
      alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=60
      alb.ingress.kubernetes.io/target-type: ip
      alb.ingress.kubernetes.io/unhealthy-threshold-count: "2"
  • Create a service and pods (multiple ones through a deployment work best) for this ingress
  • (add some delay/load to the cluster, that will cause the AWS requests to be slower or have to be retried)
  • startup an HTTP benchmark to produce some artificial load
  • rollout a change to the deployment or just evict some pods

Expected outcome

  • The SIGTERM should only start after the ALB has either removed the target from the Target Group or at least doesn't send any new traffic to the pod after it received the SIGTERM signal

Environment

  • an HTTP service

All our ingresses have

  • AWS Load Balancer controller version v2.2.4
  • Kubernetes version
    Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.20-eks-8c579e", GitCommit:"8c579edfc914f013ff48b2a2b2c1308fdcacc53f", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:13Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
    
  • Using EKS (yes), if so version ? 1.18 eks.8

Additional Context:

We've been relying on Pod-Graceful-Drain, which unfortunately forks this controller and intercepts and breaks k8s controller internals.

You can achieve a pretty good result as well using a sleep as preStop, but that's not reliable at all - due to the fact that it's just a guessing game if your traffic will be drained after X seconds - and requires statically linked binaries to be mounted in each container or the existence of sleep in the operating system.

I believe this is not only an issue to this controller, but to k8s in general. So any hints and already existing tickets would be very welcome.

@nirnanaaa nirnanaaa changed the title draining might cause dropped connections pod termination might cause dropped connections Nov 16, 2021
@M00nF1sh
Copy link
Collaborator

M00nF1sh commented Nov 17, 2021

@nirnanaaa
I'm not aware of any general solution that is better than a preStop hook. Since after all, the connection is controlled by both end of application(client/server) instead of LBs.
There are also ideas like propagate LB's connection draining status back to pod/controller, but that is not widely supported and not reliable as well (finished draining state doesn't mean all existing connection is closed)

In my mind, it can be handled in a protocol specific way by client/server if you have control over server's code
e.g. If the server receives a sigTerm, it can

  • continue listen for incoming connections. (only a limited number of new connections will coming since LB will start deregister the target soon)
  • tell the existing long running clients to migrate to use new connections in protocol-specific manner, e.g.:
    • http: connection: close header in responses
    • websocket: GOAWAY message.
  • waiting for all existing connection terminate and exit

Once we implemented above, then we only need to set a large terminationGracePeriodSeconds.

I have a sample app implemented above: https://github.com/M00nF1sh/ReInvent2019CON310R/blob/master/src/server.py#L24

@nirnanaaa
Copy link
Author

nirnanaaa commented Nov 18, 2021

Hey @M00nF1sh thanks for your input. I fear this is exactly what I thought was supposed to fix it. But thinking about it: if the server does all of this: what does prevent the LB from still sending traffic to the target - even for a brief period. Unfortunately We're hitting this exact scenario quite often.

So

  • the server starts to drain its own connections properly (like you described and like it's being handled by most frameworks automatically nowadays if signal propagation works correctly)
  • the LB controller will get the request to remove the target
  • the server is done draining connections (these don't even have to be long running connections) and is being Terminated.
  • the LB will send out the AWS request to deregisterTarget.
  • requests are still coming in, servers' already down
  • the Lb has finished processing and is now is not sending any more requests to the backend

Am I the only one that sees this as a problem? A preStop hook is not very reliable IMO as it's just eyeballing the timing issue, just like setTimeout doesn't fix a problem in javascript but only makes it less likely.

@ejholmes
Copy link

We also ran into this issue. The preStop hack generally works, but it's still a hack and it still seems to fail to synchronize the "deregistration before SIGTERM" properly. Even with this in place, we've seen intermittent cases where the container still seems to get a SIGTERM before the target is deregistered (possibly by the DeregisterTargets API call getting throttled).

This is a pretty serious issue for anyone trying to do zero-downtime deploys on top of Kubernetes with ALB's.

@M00nF1sh
Copy link
Collaborator

@nirnanaaa
I see. The remaining gap is that the pods itself finishes draining too fast before LBs deregister them.

For this case, a sleep is indeed needed since we don't have any information available on when the LBs actually stops sending targets. (even when the targets shows as draining after the controller made the deregisterTargets call, it still takes time for the changes to be actually propagated to ELB's dataplane).
If ELB have support for server-side retry, it could handle this nicely, but i'm not aware when that will be done.

@nirnanaaa
Copy link
Author

nirnanaaa commented Nov 19, 2021

I was just wondering whether this should be solved at a lower level - that's why I also opened an issue on k/k. This is not something just limited to this controller (although in cluster LBs are less likely to have this issue, it's still present).

And you're absolutely right. Client side retries would probably solve this. There are even some protocols like GRPC which could work around this problem, but the truth is that we cannot really control what's being run on the cluster itself - hence also my doubts about the use of sleep.

I thought about maybe having a more sophisticated preStop statically linked and mounted through a sidecar, that could delay the signal until a target has been removed from the LB, but fear it's also a hacky workaround that makes things even worse (especially considering API rate limits)

@nirnanaaa
Copy link
Author

nirnanaaa commented Nov 25, 2021

 Shutdown behavior

I've also drawn up a convenient picture detailling the issue further - the big, organge box is where things are happening decoupled from one another.

@BrianKopp
Copy link
Contributor

This is indeed a problem that exists in kubernetes generally, not just ALB. We ran into this a lot using classic load balancers in proxy mode to nging ingress with external traffic policy local to get real IPs.

Using the v2 ELBs with ip target groups is a big improvement over the external traffic policy local mechanism requiring health check failures to get the node out of the instance list.

That being said, this still is an issue that can happen. Sleeps and prestop hooks are really the only game in town. I'm not aware of any kind of prestop gate like the readiness gate this controller can inject.

Is there a community binary that does this already OOTB? If not, that'd be a good little project.

Another alternative is to use a reverse proxy like ingress nginx be the target for your alb ingress instead of your application. Then your container lifecycle events for your alb targets will be much, much less frequent.

@nirnanaaa
Copy link
Author

@BrianKopp we've thought about running a sidecar, which provides a statically linked binary to a common - in mem - volume, that the main container could use as preStop. I just couldn't come up with a generic solution to check for "is this pod still in any target group" - without either running into API throttles (and potentially blocking the main operation of the aws-lb-controller) - or completely breaking single responsibility principles.

@BrianKopp
Copy link
Contributor

BrianKopp commented Nov 29, 2021

IIRC, sidecar for this sort of thing is a trap since a prestop hook interrupts the sigterm for its container, not all containers. Your http container would get its sigterm immediately.

I've actually begun thinking about starting a project to address this. My thought is to have an http service inside the cluster, say at drain-waiter. And you could call http://drain-waiter/drain-delay?ip=1.2.3.4&ingress=your-ingress-name&namespace=your-namespace&max-wait=60, and then you could have a prestop hook curling that url. We can get the hostname from the ingress object, and therefore filter the target groups very easily.

What do you think? Is this worth making a thing?

@nirnanaaa
Copy link
Author

Oh I was not talking about processing the sigterm in the sidecar. If you read closely I only spoke about providing the binary ;)

For our use case even a simple HTTP query will not work if not done through some statically linked binary, since we cannot even rely on libc to be present.

@BrianKopp
Copy link
Contributor

Ok I see what you were suggesting. I mean, if the requirement is that we cannot place any runtime demands on the http container, then yes we would need some kind of sidecar to be injected to provide some functionality, along with a mutating webhook to add the container and prestop hook, in case a prestop hook wasn't already present. That part seems like a bit of a minefield.

I was thinking about putting together something that would work for now until such a solution would be possible, if at all possible.

If one didn't want to place any dependency requirements on the main http container (eg curl), a prestop hook could wait for a signal over a shared volume from a lightweight curling sidecar

@sftim
Copy link
Contributor

sftim commented Dec 21, 2021

I think this is best addressed by extending the Pod API. Yes, that's not a trivial change. However, the Pod API is what the kubelet pays attention to. If you want the cluster to hold off sending SIGTERM then there needs to be a way in the API to explain why that's happening. This code doesn't have the levers to pull to make things change how we'd need.

Another option is to redefine SIGTERM to mean “get ready to shutdown but keep serving”. I don't think that's a helpful interpretation of SIGTERM though.

@michaelsaah
Copy link
Contributor

question for those who've dealt with this: would you agree that a correct configuration to handle this issue looks like deregistrationDelay < sleep duration < terminationGracePeriod?

@ejholmes
Copy link

ejholmes commented Mar 17, 2022

I think that's correct. This is generally what we've used in helm charts:

# ingress
alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds={{ .Values.deregistrationDelay }}
# pod
terminationGracePeriodSeconds: {{ add .Values.deregistrationDelay .Values.deregistrationGracePeriod 30 }}
command: ["sleep", "{{ add .Values.deregistrationDelay .Values.deregistrationGracePeriod }}"]

Where deregistrationGracePeriod provides a buffer for DeregisterTargets getting rate limited. We still have issues with this, but the buffer period does help.

@mbyio
Copy link

mbyio commented Apr 26, 2022

For NLBs in IP mode (everything should be similar for other LBs and modes):

  • deregistration takes at least 2 minutes regardless of the deregistration delay setting
  • during deregistration, the pod may still receive new requests/connections even though it is in the process of deregistering
  • the preStop hook therefore has to sleep for at least 2 minutes
  • and the termination grace period has to be 2 minutes + the maximum time for any request to complete + time for any background processes to clean up + a short grace period in case there is a delay (eg. deregistration was rate limited)

@ejholmes
Copy link

deregistration takes at least 2 minutes regardless of the deregistration delay setting

That's an interesting finding. Is there some documentation you can point to that highlights this (assuming it's AWS side)?

@mbyio
Copy link

mbyio commented Apr 27, 2022

I believe it is all on AWS' side yes. No I couldn't find any documentation about this. All due respect to AWS engineers, their ELB documentation is horrible, missing a lot of important information. I found this by writing some programs that use raw TCP connections (so I can monitor every aspect) and manually triggering deregistration in various ELB configurations to record the timing. It consistently took 2 minutes.

@BrianKopp
Copy link
Contributor

In your testing, did the target group state for the target show the ip as draining while it was receiving packets still? Did it receive packets after the target dropped completely out of the target group? I've got a project I'm working on to add a delay in the prestop hook to wait until the ip is out of the target group. Would that be helpful here?

@mbyio
Copy link

mbyio commented Apr 28, 2022

Yes, it did correctly show the IP on the target list as draining as soon as I requested deregistration. And when it was removed from the target list, it also stopped receiving new connections. So, if a tool was monitoring the target list and used that to delay sending SIGTERM to a pod until the pod is out of the target list (eg. using preStop), that would solve the problem. It would be the opposite of a readiness gate. I think one difficulty is AWS has some restrictive rate limits, so depending on your scale I don't think you can just have every pod hitting the API in the preStop hook.

@dcarley
Copy link

dcarley commented May 5, 2022

We were told this by AWS support:

Similarly, when you deregister a target from your Network Load Balancer, it is expected to take 90-180 seconds to process the requested deregistration, after which it will no longer receive new connections. During this time the Elastic Load Balancing API will report the target in 'draining' state. The target will continue to receive new connections until the deregistration processing has completed. At the end of the configured deregistration delay, the target will not be included in the describe-target-health response for the Target Group, and will return 'unused' with reason 'Target.NotRegistered' when querying for the specific target.

Sleeping for 180s (3m) still hasn't been reliable for us though, so we're currently at 240s (4m) 😭

@nirnanaaa
Copy link
Author

we've actually started sharding our services across multiple aws accounts/eks clusters, just to lower the number of pending/throttled API requests and to increase the speed at which each controller can operate.

But then again: the probability of this error happening is still not zero. On a new AMI release we frequently experience dropped packages (even with X seconds of sleep as preStop)

@mbyio
Copy link

mbyio commented May 5, 2022

I think the controller should automatically add a preStop hook to pods, which waits until the controller indicates it is safe to start terminating. That would nip this in the bud once and for all.

@mattjamesaus
Copy link

mattjamesaus commented Mar 31, 2023

We came across this issue and found ours to be a combination of both the Cluster Autoscaler (CA) and the Load Balancer Controller working in tandem when external traffic policy was set to cluster. We'd have nodes getting ready to be scaled in (being pretty much empty except for kube-proxy) then the CA would kick in and taint the node and terminate it well before it was deregistered by the LB Controller.

This would result in-flight requests being dropped from kube-proxy to another node (that ran our HAPRoxy pods). It appears now tho that the latest CA introduces a --node-delete-delay-after-taint which will gracefully wait an amount of seconds before killing the node. I'm about to test this but my assumption here is that if this delay is set to just longer than your deregistration timeout the node will be kept around long enough to be successfully pulled out of the target group and gracefully terminate all the in-flight connections going to kube-proxy.

This would still require adequate pod termination for whatever service gets the traffic but most handle that perfectly well already.

It seems like this is a pretty prevelant problem for what i assume is a very popular setup - if it does fix the issue we reaelly should call this out in the documentation that administrators need to be cognizant of the cluster autoscaler and LB controller potentially dropping requests in the externaltrafficpolicy cluster mode.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2023
@johndistasio
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2023
@Ghilteras
Copy link

Ghilteras commented Jul 13, 2023

I think the controller should automatically add a preStop hook to pods, which waits until the controller indicates it is safe to start terminating. That would nip this in the bud once and for all.

Doesn't seem to be working though as the pod dies immediately instead of gracefully shutdown, we had to remove the preStop hook and sleep 30s instead to make all upstream GRPC errors go away

@mbyio
Copy link

mbyio commented Jul 14, 2023

I think the controller should automatically add a preStop hook to pods, which waits until the controller indicates it is safe to start terminating. That would nip this in the bud once and for all.

Doesn't seem to be working though as the pod dies immediately instead of gracefully shutdown, we had to remove the preStop hook and sleep 30s instead to make all upstream GRPC errors go away

This automatic preSop hook was a suggestion for what they should implement, it isn't implemented yet.

@mbyiounderdog
Copy link

If you have a preStop hook that waits for shutdown, then it was not added by the controller.

@Ghilteras
Copy link

when I say wait for the shutdown I just mean a simple sleep and sure there are less errors using the sleep preStop, but as the OP said it's a hack, there's no way to make the controller gracefully terminate and the OP is also correct in saying that this is not a specific aws-load-balancer controller issue b/c we have exactly the same problem on the nginx ingress controller deployed on bare metal

@johngmyers
Copy link
Contributor

I can't think of a good way for LBC to communicate to a preStop that the pod has been deregistered. The pod would have to be in a ServiceAccount that has RBAC to watch some resource that LBC would update or there'd have to be some network connection opened up between LBC and the pod.

@Ghilteras
Copy link

Ghilteras commented Jul 18, 2023

I think the issue here is just for long lived connections b/c the controller does not sever them when it goes into draining mode, so clients keep obsessing with them even when the sleep/grace period is over, hence the errors.

I don't know if there is a way to force the controller to sever the long lived connection when entering draining mode, so that when the clients would reconnect to the ingress the controller should not pick the pods in Terminating state and the newly re-established connections would be healthy

@johngmyers
Copy link
Contributor

@Ghilteras no, the pod knows about long-lived connections, is perfectly capable of closing them itself, and knows when they have gone away. The problem is new connections that keep coming in from the load balancer. The pod does not know when the load balancer has finished deregistering it and thus there will no longer be any more new incoming connections.

@mbyio
Copy link

mbyio commented Jul 18, 2023

The pod would have to be in a ServiceAccount that has RBAC to watch some resource that LBC would update or there'd have to be some network connection opened up between LBC and the pod.

I'm no longer working at the company where I needed this day-to-day. However, we would have been willing to deal with a lot of setup, including service accounts etc, in order to have an automated fix for this bug. It was a major pain point. As you can see in these comments, it is also hard for many people to understand, and therefore, hard to work around.

@Ghilteras
Copy link

Ghilteras commented Jul 18, 2023

@Ghilteras no, the pod knows about long-lived connections, is perfectly capable of closing them itself, and knows when they have gone away. The problem is new connections that keep coming in from the load balancer. The pod does not know when the load balancer has finished deregistering it and thus there will no longer be any more new incoming connections.

How do you close and re-open a GRPC connection from the client to force it to dial another non Terminating pod? I don't see anything like this in the grpc libraries. If your client gets a GOAWAY all it can do is retry it, which will leverage the existing long lived connection to the Terminating pod, which means it will keep failing. There is no way of automatically handle this from client without wrapping it with custom logic that servers the connection on GOAWAY/draining errors from servers. All this because NGINX cannot close the long lived connection on its side when it's gracefully shutting down? I think I'm missing something here

@mbyio
Copy link

mbyio commented Jul 18, 2023

@Ghilteras You may be looking at the wrong Github issue. This issue is about a case where AWS load balancers can send new requests to terminating pods that have already stopped accepting new requests. NGINX is not involved.

Edit - oh I see, you mentioned the nginx ingress controller having the same problem. This issue is regarding connections from a load balancer to server pods. If you're using GRPC to connect to a load balancer, NGINX or an ALB, then that load balancer is intercepting the connections and doing load balancing. So there isn't really anything you can do on the client side to fix this.

@johngmyers
Copy link
Contributor

@Ghilteras I'm saying the terminating server pod is capable of shutting down the long-lived connection.

@Ghilteras
Copy link

@Ghilteras I'm saying the terminating server pod is capable of shutting down the long-lived connection.

I don't see any way for this ingress controller (or others, as OP said this is not a specific issue of this controller) to be capable of shutting down long-lived connections, if you are aware of a way of doing this please share.

@johngmyers
Copy link
Contributor

I wasn't talking about this ingress controller shutting down the connection.

@Ghilteras
Copy link

the GRPC server is already sending GOAWAYs, the issue is that when the clients reconnect they just end up going back to the Terminating pod, which should not be getting new traffic through the ingress, but it does and that's when we see the server throwing hundreds of connection reset by peer errors

@johngmyers
Copy link
Contributor

@Ghilteras and thus the problem is with new connections coming in from the load balancer, as I previously stated.

@Roberdvs
Copy link

Roberdvs commented Nov 1, 2023

For anyone still struggling with the interaction between NLB and ingress-nginx-controller, this is the setup that seems to better mitigate the issues for us for HTTP traffic.

ingress-nginx chart values snippet:

service:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
    service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "https"
    service.beta.kubernetes.io/aws-load-balancer-type: nlb-ip
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: "deregistration_delay.timeout_seconds=0"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: 2
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: 5

# Mantain existing number of replicas at all times
updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0
# Add a pause to make time for the pod to be registered in the AWS NLB target group before proceeding with the next
# https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1834#issuecomment-781530724
# https://alexklibisz.com/2021/07/20/speed-limits-for-rolling-restarts-in-kubernetes#round-3-set-minreadyseconds-maxunavailable-to-0-and-maxsurge-to-1
minReadySeconds: 180
# Add sleep on preStop to allow for graceful shutdown with AWS NLB
# https://github.com/kubernetes/ingress-nginx/issues/6928
# https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2366#issuecomment-1118312709
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 240; /wait-shutdown"]
# -- Maximum unavailable pods set in PodDisruptionBudget.
maxUnavailable: 1

Also, externalTrafficPolicy on the service of type LoadBalancer is set to Cluster.

With this setup rolling updates are reliable enough albeit reaaally slow, mostly due to having to account for #1834 (comment)

Hope it helps someone, and if anyone detects any issues, improvements, or recommendations, it would be good to know.

P.S: have yet to play around with pod readiness gates

Other issues referenced:

@venkatamutyala
Copy link

venkatamutyala commented Nov 15, 2023

Thanks @Roberdvs this is huge. I just came across the problem in my cluster while testing an upgrade and I was able to fix it within an hour by referencing your work. I have been running statuscake.com with 30sec checks to help verify your fix works as expected. Prior to the change I had a minute or two of downtime. After the change, I appear to be at a real 100% uptime. For anyone else coming across this, I implemented the following I am using the default NLB deployment method given I don't have AWS LB Controller installed. Please see @Roberdvs comment above if you are using the LB Controller.

        controller:
          service:
            annotations:
              service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
            type: "LoadBalancer"
            externalTrafficPolicy: "Local"
          # https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2366
          # https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2366#issuecomment-1788923154
          updateStrategy:
            rollingUpdate:
              maxSurge: 1
              maxUnavailable: 0
          # Add a pause to make time for the pod to be registered in the AWS NLB target group before proceeding with the next
          # https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1834#issuecomment-781530724
          # https://alexklibisz.com/2021/07/20/speed-limits-for-rolling-restarts-in-kubernetes#round-3-set-minreadyseconds-maxunavailable-to-0-and-maxsurge-to-1
          minReadySeconds: 180
          # Add sleep on preStop to allow for graceful shutdown with AWS NLB
          # https://github.com/kubernetes/ingress-nginx/issues/6928
          # https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2366#issuecomment-1118312709
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 240; /wait-shutdown"]

Note: I removedmaxUnavailable defined under the lifeCycle example from @Roberdvs as it appeared to be a default in my helm chart already

My versions:
chart repo: https://kubernetes.github.io/ingress-nginx'
name: ingress-nginx
chart version: v4.8.3

@michaelsaah
Copy link
Contributor

michaelsaah commented Dec 5, 2023

was thinking about this again... has anyone thought about modifying the ALB controller to set a finalizer on the pods within the TG, and using that as a hook to prevent termination until it's been successfully removed from the TG? this would hinge on the SIGTERM not getting sent until after all finalizers have been removed, which I would expect to be true but would need to verify.

@mbyio
Copy link

mbyio commented Dec 6, 2023

@michaelsaah That sounds like an interesting approach to me. I think it is worth someone trying it out. As you said, it really just depends on if k8s will wait to send the SIGTERM until finalizers are removed.

@ibalat
Copy link

ibalat commented Apr 15, 2024

any update? I have same problem and I think taht it can be related also autoscaler. Please can you view it: kubernetes/autoscaler#6679

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests