AWS Load Balancer Controller to register new pods first and then deregister old pods #3621

dilipkraghupatruni · 2024-03-21T22:42:29Z

Is your feature request related to a problem?
We are using Argo rollouts for application deployments to EKS.

Argo rollouts performs a blue-green flip by updating the selectorLabels on a Service object.
When "promote" is triggered, Argo switches the rollout template hash on active service to use the pod-rollout-template-hash from the new set (rollout)
When this happens, the aws-load-balancer-controller does a complete swap of all members of the target group.

During blue-green switch, we are noticing issues related to AWS Load Balancer Controller when used with NLB. NLB takes lot more time to register new targets compared to ALB. Because of the current logic in aws load balancer controller, the old pods are deregistered first and then load balancer controller tries to register the new pods. However, the lag between these two steps is significant enough with NLB to cause app impact. The order of the operation is same with ALB but ALB does the operation very fast compared to NLB and hence we dont notice any impact.

Describe the solution you'd like
We would like to get a feature toggle/ config in load balancer controller to swap the operations which means we want to register new pods first and then deregister old pods. https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/targetgroupbinding/resource_manager.go#L143-L152
And also configure the amount of time we want to wait between the operations and also max amount of time we want to wait on deregistration.

Describe alternatives you've considered
We have implemented canary deployment strategy instead of blue green and that helped us reduce the impact. But as part of the debugging, we figured out the problem is not specific to NLB. We deal with thousands of applications where majority use blue-green deployment strategy. hence we cannot implement canary for everyone. We would like to fix the problem at its core which is inside aws load balancer controller.

jwenz723 · 2024-03-25T20:43:32Z

NLB takes lot more time to register new targets compared to ALB. Because of the current logic in aws load balancer controller, the old pods are deregistered first and then load balancer controller tries to register the new pods. However, the lag between these two steps is significant enough with NLB to cause app impact. The order of the operation is same with ALB but ALB does the operation very fast compared to NLB and hence we dont notice any impact.

Just wanted to clarify that the amount of time between when deregistration executes and when registration executes is negligible. The deregistration and registration events are essentially occurring at the same time (probably just a few milliseconds between these events).

The problem is that the old targets/pods enter a deregistering state in the NLB Target Group at the same time as when the new targets/pods enter an initial state in the NLB Target Group and NLB Target Group registration takes between 3 to 5 minutes (see this comment). This means that for 3 to 5 minutes there are 0 healthy targets in the NLB Target Group.

Regarding ALBs, I've observed during my testing that ALB Target Group registration typically takes between 10-15 seconds. So during a blue-green deploy with an ALB there is a 10-15 second period of time when there are 0 healthy targets in the ALB Target Group.

Having a quantity of time to wait after target registration and before target deregistration as mentioned by @dilipkraghupatruni should help to maintain healthy targets in the NLB/ALB Target Group during a blue-green flip.

k8s-triage-robot · 2024-06-23T20:48:59Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sara-reddy-cb · 2024-06-28T18:23:05Z

/remove-lifecycle stale

M00nF1sh · 2024-07-15T18:22:09Z

The problem here is it's hard for the controller to decide whether a old target(that no longer is a valid target) shall be removed. In your case, the target shall be kept for 2~4 mins(for NLB), while in most other cases, such targets shall be deregistered asap(as they may belong to other applications or already died).

millermatt · 2024-08-12T17:42:31Z

The problem here is it's hard for the controller to decide whether a old target(that no longer is a valid target) shall be removed. In your case, the target shall be kept for 2~4 mins(for NLB), while in most other cases, such targets shall be deregistered asap(as they may belong to other applications or already died).

@M00nF1sh : part of @dilipkraghupatruni 's suggestion was to add config to control this behavior. That would take the decision making out of the controller's hands. Rather, it would do the operations in the order specified in the config and at the intervals specified in the config.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 23, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 28, 2024

M00nF1sh added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Load Balancer Controller to register new pods first and then deregister old pods #3621

AWS Load Balancer Controller to register new pods first and then deregister old pods #3621

dilipkraghupatruni commented Mar 21, 2024

jwenz723 commented Mar 25, 2024

k8s-triage-robot commented Jun 23, 2024

sara-reddy-cb commented Jun 28, 2024

M00nF1sh commented Jul 15, 2024

millermatt commented Aug 12, 2024

AWS Load Balancer Controller to register new pods first and then deregister old pods #3621

AWS Load Balancer Controller to register new pods first and then deregister old pods #3621

Comments

dilipkraghupatruni commented Mar 21, 2024

jwenz723 commented Mar 25, 2024

k8s-triage-robot commented Jun 23, 2024

sara-reddy-cb commented Jun 28, 2024

M00nF1sh commented Jul 15, 2024

millermatt commented Aug 12, 2024