NLB IP Target registration is extremely slow #1834

abatilo · 2021-02-18T15:30:03Z

I've configured an NLB with IP targets to point to an instance of Traefik 2 and noticed that when I have pod readiness gates enabled, it might take upwards of 5 minutes for a single target to register and be considered healthy. Is this normal/expected?

kishorj · 2021-02-18T18:01:06Z

The NLB target registration can take from 90 to 180 seconds to complete. After registration, the targets are marked healthy only after the configured health check passes. This delay is from the AWS NLB and is currently expected. It is not due to the pod readiness gate configuration.

In case of rolling updates to your application, the pod readiness gate helps mitigate the effects of this delay by making sure the existing pods will not be terminated until the newly registered targets show up as healthy.

abatilo · 2021-02-19T10:57:58Z

Ah, thank you. Is there anything at all I can do to help speed that up?

M00nF1sh · 2021-03-03T22:19:28Z

@abatilo
You can contact NLB team via support ticket to accelerate it.

From the controller's perspective, we will add some docs to note this limitation on our docs.

/kind documentation

keperry · 2021-04-09T13:37:04Z

@abatilo - I would encourage anyone encountering this issue to reach out to AWS support. They are aware of the issue. AFAIK, it has been an issue for 3+ years (per stackoverflow). The more people that contact them, the more likely it will get fixed. ;)

juozasget · 2021-06-25T20:24:22Z

Can confirm that I've observed the same behaviour when testing NLB ingress with IP targets @abatilo.
The controller registers a new pod with the target group within few seconds. I'd expect the NLB health check to kick in and register the service in 20-30s (2 or 3 health cheks, 10s interval). Instead of 20-30s, it's 3-5minutes.

paul-lupu · 2021-09-10T12:32:19Z

I can confirm this is still present... feels a bit like AWS is letting people down by delaying a fix for it for so long...

jbg · 2021-09-10T12:44:32Z

I don't think they see it as a bug :) This is not related to k8s or the load balancer controller and probably doesn't belong here. If you want NLB to take less than 3 minutes to register targets, tell your AWS support rep!

paul-lupu · 2021-09-10T13:00:07Z

@jbg I did, they mentioned this thread in their response...
"Mom, can I go out?"
"Ask your dad!"
"Dad?"
"Ask your mom!" XD

jbg · 2021-09-10T13:04:28Z

If you're contacting AWS support about this, it's probably advisable to demonstrate the issue with an NLB provisioned manually or via CloudFormation, so that first-level support can't point the finger at aws-load-balancer-controller or k8s as the source of the delay.

paul-lupu · 2021-09-10T13:38:35Z

@jbg I did, they mentioned they will add a note with my case to the existing issue on the NLB.

M00nF1sh · 2021-09-10T17:31:51Z

@paul-lupu
NLB team is already aware of this issue and have fixes in progress.
They already rolled out a new HC system that slightly improve the registration time and plan to improve it to be <=60sec(I don't have an ETA on this).

From the controller's point of view, we cannot do much until NLB improves it. If the registration time is a concern, we can use NLB instance mode(supported by newer version of this controller as well). If spot instances is been used, we can use nodeSelectors(service.beta.kubernetes.io/aws-load-balancer-target-node-labels) to use no-spot instances as NLB backend.

k8s-triage-robot · 2021-12-09T18:06:15Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

bennettellis · 2022-01-05T04:07:51Z

I have same issue with NLB fronting ECS containers. Very frustrating. The container is up and receives and responds to healthchecks in seconds, yet the TG takes minutes to recognize a container has healthy. If containers go down for some reason and need to be re-run this could potentially leave a major gap in service availability. Makes the NLB problematic to use, but I have to use it in order to do the TCP passthrough I need to the ECS container so that we can implement mTLS at the container. Very frustrating delay. Is there a general NLB ticket/issue that anyone might have a link to that I can help pile on with?

jbg · 2022-01-05T04:33:12Z

@bennettellis since it is a problem with AWS internal implementation rather than any open-source component, the best place to "pile on" is your AWS support

rayjanoka · 2022-01-12T03:03:03Z

I worked around this with traefik by setting the nlb deregistration timeout, deployment pod grace period, and container grace period to 5 minutes. I also needed to ensure the container health-check continued to report success during this inbetween time with traefik's --ping.terminatingStatusCode=204.

This leaves the old pod in a "terminating but still running" state for 5 minutes to give the nlb time to complete the registration process for the new pod.

NAME                                      READY   STATUS        RESTARTS   AGE
traefik-5fc5468b49-7htxk                  1/1     Terminating   0          6m10s
traefik-5fc5468b49-hgdll                  1/1     Terminating   0          5m50s
traefik-5fc5468b49-t2qk7                  1/1     Terminating   0          6m31s
traefik-f5c4b56fb-478hc                   1/1     Running       0          29s
traefik-f5c4b56fb-7858g                   1/1     Running       0          49s
traefik-f5c4b56fb-czqbj                   1/1     Running       0          69s

nlb service

service:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: external
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: deregistration_delay.timeout_seconds=300

k8s deployment pods

deployment:
  terminationGracePeriodSeconds: 315

traefik container settings

  --ping.terminatingStatusCode=204
  --entrypoints.metrics.transport.lifecycle.requestacceptgracetimeout=5m
  --entrypoints.traefik.transport.lifecycle.requestacceptgracetimeout=5m
  --entrypoints.web.transport.lifecycle.requestacceptgracetimeout=5m
  --entrypoints.websecure.transport.lifecycle.requestacceptgracetimeout=5m

If your app doesn't have a feature like this I think the container delay could also be accomplished with a container lifecycle sleep() as long as the health-check continues reporting success.

containers:
  - name: application
    lifecycle:
      preStop:
        exec:
          command: [
            "sh", "-c",
            # Introduce a delay to the shutdown sequence to wait for the
            # pod eviction event to propagate. Then, gracefully shutdown
            "sleep 300 && killall -SIGTERM application",
          ]

k8s-triage-robot · 2022-04-12T18:05:21Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ldemailly · 2022-04-12T18:17:19Z

Unfortunately 5 minutes is too long for when using spot

/remove-lifecycle stale

nbourdeau · 2022-04-28T13:08:27Z

Anyone has any update on this ? Do we have a way to determine when the registration is actually completed and pod start receiving requests from the NLB ?

keperry · 2022-04-28T14:29:56Z

@nbourdeau - use pod readiness gates. They are super easy to setup and essentially your pod will not be considered ready until the LB sees the target as healthy. Unfortunately, deployments are slow, but at least with pod readiness gates they are stable.

nbourdeau · 2022-04-28T15:10:51Z

@nbourdeau - use pod readiness gates. They are super easy to setup and essentially your pod will not be considered ready until the LB sees the target as healthy. Unfortunately, deployments are slow, but at least with pod readiness gates they are stable.

Well in my use case this is not really usable because it is a singleton deployment with an EBS volume mounted and I cannot have 2 pods with same volume running at the same time ...

But the strange thing is the NLB target is marked healthy in the target group but there is still a delay before the pod actually start receiving requests ... will that even work with pod readiness gates ?

hellenavilarosa · 2022-05-25T18:05:30Z

There is a way to improve this time? Its taking more than 5 min when I use NLB with TCP protocol.

nbourdeau · 2022-05-25T18:11:15Z

There is a way to improve this time? Its taking more than 5 min when I use NLB with TCP protocol.

seems like the answer is no ... I contacted AWS support and the answer is: we are working on improving the delay ...
use pod readiness gates if you can ...

ldemailly · 2022-05-26T01:53:44Z

I will comment with more details when time permits but note you can reduce de-registration to around 45s (from over 3 minutes) using http(s) instead of tcp healthchecks. which help a lot for spot (and the only 2 minutes notice you get there for reclaim, ie you can get error less spot drain now)

k8s-triage-robot · 2022-06-25T02:12:22Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

abatilo · 2023-06-12T16:08:57Z

Hi @nicc777, I just tried your changes and the registration doesn't appear to be any faster so to anyone else who tries it, YMMV.

I tested it on the v2.5.2 of the aws lb controller under chart version v1.5.3. And tried 2x and both times it still took about ~3 minutes for the readiness gate to pass.

nicc777 · 2023-06-12T18:05:26Z

@abatilo - yes, I was afraid it may have been something specific to just our environment. We are on controller v2.4.6 with Helm chart version 1.4.8 (Kubernetes/EKS v1.24)

abatilo · 2023-06-12T19:50:56Z

Thank you for sharing anyways!

dorsegal · 2023-06-15T13:51:18Z

@nicc777

yes, I was afraid it may have been something specific to just our environment. We are on controller v2.4.6 with Helm chart version 1.4.8 (Kubernetes/EKS v1.24)

I am using the same stack but changing those settings didn't help. Are you sure you use NLB and not ALB?

nicc777 · 2023-06-15T14:12:31Z

@dorsegal Yes NLB - I have about 150 of them at peak loads (EKS), each with between 3 and 9 target groups.

It seems after further analysis our problem was resolved because we were in a private VPC and there was therefore no comms to the WAF API. So it appears our problem was not related at all to what this problem is described here. Provisioning from EKS AWS Load Balancer Controller is done in roughly 3 seconds (on average) per target group - so it's now blistering fast for us. The actual registration of the targets is obviously dependent on the service itself, but it's acceptable for us (couple of seconds).

dorsegal · 2023-06-15T14:48:32Z

@nicc777 Care to share your service annotations?

nicc777 · 2023-06-15T15:09:59Z

@dorsegal - here is the annotation we use (service type is LoadBalancer, obviously):

service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags: namespace=REDACTED,desiredHostName=REDACTED
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
service.beta.kubernetes.io/aws-load-balancer-name: REDACTED
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/load-balancer-source-ranges: REDACTED

studzien · 2023-08-24T14:27:20Z

We're also experiencing unexpectedly long deregistration delays.
From a few experiments I was able to run I noticed that deregistration is very fast (a few seconds) if there are no active connections, with deregistration delay set to 0.

We have also noticed that the deregistration time depends on how many targets are deregistered concurrently.

If we deregister 4 targets at the same time, it takes approximately 2 minutes to complete the deregistration.
If we deregister just once target at once, it takes approximately 30 seconds.

stefkkkk · 2023-09-05T22:28:15Z

any ETA about progress or fix?! It's horrible, for what we are spending money? We must be sure about reliability!

kovaxur · 2023-09-20T17:49:57Z

I just contacted support, we are the helicopters, this is not a bug, it's a feature :D:D

"Thank you for contacting Elastic Load Balancing about the length of time taken for a target registration and deregistration in Target Groups on your Network Load Balancer. When you register a new target to your Network Load Balancer, it is expected to take between 3-5 minutes (180 and 300 seconds) to complete the registration process. After registration is complete, the Network Load Balancer health check systems will begin to send health checks to the target. A newly registered target must pass health checks for the configured interval to enter service and receive traffic. For example, if you configure your health check for a 30 second interval, and require 3 health checks to become healthy, the minimum time a newly registered target could enter service is 270 seconds (180 seconds for registration, and another 90 (3*30) seconds for passing health checks) after a new target passes its first health check.

Similarly, when you deregister a target from your Network Load Balancer, it is expected to take 3-5 minutes (180-300 seconds) to process the requested deregistration, after which it will no longer receive new connections. During this time the Elastic Load Balancing API will report the target in 'draining' state. The target will continue to receive new connections until the deregistration processing has completed. At the end of the configured deregistration delay, the target will not be included in the describe-target-health response for the Target Group, and will return 'unused' with reason 'Target.NotRegistered' when querying for the specific target.

We have added information about your questions and use case to a feature request to reduce the time taken to register, deregister, and health check Network Load Balancer targets."

As nothing can be done on the NLB to reduce this, I suggest taking a look at EKS best practices guide[1] to workaround the delay induced by the NLB. The guide does cover the exact scenario you're facing and includes supporting recommendations.

Please ket me know if you have any questions.

Resources:
[1] https://aws.github.io/aws-eks-best-practices/networking/loadbalancing/loadbalancing/

Our solution is to put a stable loadbalancer (eg: HAProxy) in front of our pods and use dns service discovery. It also has many feature, which NLB lacks of.

nicc777 · 2023-09-20T19:32:41Z

Thanks for the feedback @kovaxur

We have also now rolled out our own load balancer. The moment you need more then a 120 or so Load Balancers from EKS (using AWS Load Balancer Controller) the wheels really start to come of. ELBv2 API Calls go through the roof - so much so that API calls start to time out or be throttled (this start to happen after a couple of days). Our own load balancer turned out to be the path of least resistance.

AWS really tripped on this one.

ldemailly · 2023-09-20T20:17:34Z

Having nodes with public IPs to do one own's load balancing is really not a good solution (imo) - ditto reinventing a wheel.

So we need this addressed, specially considering spot termination notice. I have yet to hear a technical explanation for why it takes that long.

paymog · 2023-10-19T10:30:51Z

It seems that NLB doesn't start registering new pods until they pass the kubernetes level pod readiness check. If I set a pod readiness check of 300 seconds, new pods don't seem start the NLB registration process until after kubernetes has declared them ready.

Is there a way to get NLB to start registering new pods in a deployment before the kubernetes pod readiness check passes?

sylr · 2023-10-19T11:05:28Z

It seems that NLB doesn't start registering new pods until they pass the kubernetes level pod readiness check. If I set a pod readiness check of 300 seconds, new pods don't seem start the NLB registration process until after kubernetes has declared them ready.

Is there a way to get NLB to start registering new pods in a deployment before the kubernetes pod readiness check passes?

No, it makes no sense, your pods should not receive traffic unless they are ready.

paymog · 2023-10-19T12:22:01Z

No, it makes no sense, your pods should not receive traffic unless they are ready.

yeah, now that you mention it that totally makes sense. I just spent hours trying to figure out how to spin up all the new pods right away so they can start the registration process while also keeping the old pods around. I played with terminationGracePeriodSeconds, maxSurge, maxUnavailable, and the preStop lifecycle hook. Could not figure it out.

oliviassss · 2023-12-08T17:48:58Z

hi @nicc777 and @abatilo, we made some improvement on podReadinessGate since v2.6.1 as mentioned in the release note, would you be able to help verify if the newer version has improvement in your use cases? Thanks

js-9 · 2024-01-10T09:48:45Z

I still get pretty slow initialization times for the nlb and pod readyness gate, just over 3 minutes with v2.6.2

stefkkkk · 2024-01-10T09:50:08Z

hi @nicc777 and @abatilo, we made some improvement on podReadinessGate since v2.6.1 as mentioned in the release note, would you be able to help verify if the newer version has improvement in your use cases? Thanks

still the same

k8s-triage-robot · 2024-04-09T10:14:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ericbuehl · 2024-04-09T10:56:12Z

/remove-lifecycle stale

k8s-triage-robot · 2024-07-08T11:23:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

stefkkkk · 2024-07-10T05:27:38Z

/remove-lifecycle stale

M00nF1sh assigned kishorj Mar 3, 2021

kishorj added the kind/documentation Categorizes issue or PR as related to documentation. label Mar 3, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 9, 2021

kishorj removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 12, 2022

kishorj added this to TODO in Roadmap via automation Apr 11, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2022

sosheskaz mentioned this issue Aug 11, 2023

Deregistration Delay does not work as expectation when use IP Target-Type Load Balancers #3270

Closed

Roberdvs mentioned this issue Nov 1, 2023

pod termination might cause dropped connections #2366

Open

jwenz723 mentioned this issue Mar 25, 2024

AWS Load Balancer Controller to register new pods first and then deregister old pods #3621

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2024

awiesner4 mentioned this issue Jul 1, 2024

support to add delay in node termination to honor ELB connection draining interval aws/karpenter-provider-aws#4673

Open

k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 8, 2024

NLB IP Target registration is extremely slow #1834

NLB IP Target registration is extremely slow #1834

Comments

abatilo commented Feb 18, 2021

kishorj commented Feb 18, 2021

abatilo commented Feb 19, 2021

M00nF1sh commented Mar 3, 2021

keperry commented Apr 9, 2021

juozasget commented Jun 25, 2021

paul-lupu commented Sep 10, 2021

jbg commented Sep 10, 2021

paul-lupu commented Sep 10, 2021

jbg commented Sep 10, 2021

paul-lupu commented Sep 10, 2021

M00nF1sh commented Sep 10, 2021 • edited Loading

k8s-triage-robot commented Dec 9, 2021

bennettellis commented Jan 5, 2022

jbg commented Jan 5, 2022

rayjanoka commented Jan 12, 2022 • edited Loading

k8s-triage-robot commented Apr 12, 2022

ldemailly commented Apr 12, 2022 • edited Loading

nbourdeau commented Apr 28, 2022

keperry commented Apr 28, 2022

nbourdeau commented Apr 28, 2022

hellenavilarosa commented May 25, 2022

nbourdeau commented May 25, 2022

ldemailly commented May 26, 2022 • edited Loading

k8s-triage-robot commented Jun 25, 2022

abatilo commented Jun 12, 2023

nicc777 commented Jun 12, 2023

abatilo commented Jun 12, 2023

dorsegal commented Jun 15, 2023 • edited Loading

nicc777 commented Jun 15, 2023 • edited Loading

dorsegal commented Jun 15, 2023 • edited Loading

nicc777 commented Jun 15, 2023

studzien commented Aug 24, 2023

stefkkkk commented Sep 5, 2023 • edited Loading

kovaxur commented Sep 20, 2023

nicc777 commented Sep 20, 2023 • edited Loading

ldemailly commented Sep 20, 2023

paymog commented Oct 19, 2023

sylr commented Oct 19, 2023 • edited Loading

paymog commented Oct 19, 2023

oliviassss commented Dec 8, 2023

js-9 commented Jan 10, 2024

stefkkkk commented Jan 10, 2024

k8s-triage-robot commented Apr 9, 2024

ericbuehl commented Apr 9, 2024

k8s-triage-robot commented Jul 8, 2024

stefkkkk commented Jul 10, 2024

M00nF1sh commented Sep 10, 2021 •

edited

Loading

rayjanoka commented Jan 12, 2022 •

edited

Loading

ldemailly commented Apr 12, 2022 •

edited

Loading

ldemailly commented May 26, 2022 •

edited

Loading

dorsegal commented Jun 15, 2023 •

edited

Loading

nicc777 commented Jun 15, 2023 •

edited

Loading

dorsegal commented Jun 15, 2023 •

edited

Loading

stefkkkk commented Sep 5, 2023 •

edited

Loading

nicc777 commented Sep 20, 2023 •

edited

Loading

sylr commented Oct 19, 2023 •

edited

Loading