-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle ELB instance deregistration #316
Comments
same issue here, in addition, when termination handler cordons node, the node marked as
relevant issues: and a partial bug fix in 1.19 |
I'm definitely interested in looking into this more. I've asked @kishorj who works on the aws-load-balancer-controller his thoughts since there needs to be a careful dance between the LB controller and NTH in the draining process. There might be more we can do in that controller without involving NTH as much. But if we need to add this logic to NTH, then I'm not opposed |
Hi Brandon, thank you for the quick response. I think an external tool such as NTH is suitable to handle such logic. Even if Kubernetes contributors solve it internally, it won't manage all cases such as draining due to spot interruptions, AZ rebalance, or spot recommendations. The current bug of removing cordon nodes immediately from the load balancer is four years old, if the service controller will be enhanced someday, it can take a lot of time till we can use it. I really hope to see this functionality in NTH. |
Linking taint-effect issue, since I think that would mitigate this: #273 |
I'm not sure it would really do what we need. The problem is that draining instances from an ELBv2 load balancer is quite slow (usually 4-5 minutes in our experience), and, at least for our nodes, draining the containers is much, much faster. lifecycle-manager is nice because it polls to make sure the instance is removed from the load balancer before it continues. If I'm reading the taint-effect issue right, it would apply a taint, which could cause an ELB drain to start, but there's not really anything that then waits for the drain to finish before the instances are terminated? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label. |
Can we get update on this? This would be a cool feature! |
we are trying to find solution for the same problem for cluster autoscaler, 1.18 and previous k8 versions used to remove node from LBs with cordon command. We want similar behaviour to retain in 1.19+ k8s, one option is to have cluster autoscaler add below label to worker node or delete the worker node with
|
With the custom termination policy supported by EC2 Auto Scaling, you would specify a Lambda function that can drain the node as well as deregister it from an ELB. This can be a solution until ELB deregistration is natively supported. Refer to the following links for more details: |
Interested to hear from contributors here whether the solution in #582, which adds the label Does that solve your problem, or do we need to do additional work to support your use cases? |
It does not, in our case. The problem is that all the pods can be drained off the node faster than the node can be deregistered from the load balancer. So something like this happens:
In our experience, and after working with AWS support, the shortest duration we've been able to get load balancer deregistration down to is 2-3 minutes. Meanwhile we can usually evict all pods in less than 1 minute. |
@sarahhodne admittedly I haven't done very comprehensive tests, but what I have observed is that if a target in a target group is draining before the associated instance is terminated then there is a much higher chance that the termination will not result in request errors. In fact I was not able to cause any requests errors in my testing this way. I use the |
we updated cluster autoscaler to add in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues |
Do you run it in IP or Instance mode @tjs-intel ? |
@kristofferahl I switched from Instance to IP mode because of general lack of support for node draining by NTH and brupop |
Thanks @tjs-intel! We use IP mode as well so I was wondering if you could possibly explain your setup a bit further as it seems you're not having any issues with dropped requests when using aws-load-balancer-controller and NTH? How do you achieve draining before the target/underlying instance is terminated? |
Currently, when Karpenter drains and then deletes a Node from the cluster, if that node is registered in a Target Group for an ALB/NLB the corresponding EC2 instance is not removed. This leads to the potential for increased errors when deleting nodes via Karpenter. In order to help resolve this issue, this change adds the well-known `node.kubernetes.io/exclude-from-external-balancers` label, which will case the AWS LB controller to remove the node from the Target Group while Karpenter is draining the node. This is similar to how the AWS Node Termination Handler works (see aws/aws-node-termination-handler#316). In future, Karpenter might be enhanced to be able to wait for a configurable period before deleting the Node and terminating the associated instance as currently there's a race condition between the Pods being drained off of the Node and the EC2 instance being removed from the target group.
Currently, when Karpenter drains and then deletes a Node from the cluster, if that node is registered in a Target Group for an ALB/NLB the corresponding EC2 instance is not removed. This leads to the potential for increased errors when deleting nodes via Karpenter. In order to help resolve this issue, this change adds the well-known `node.kubernetes.io/exclude-from-external-balancers` label, which will case the AWS LB controller to remove the node from the Target Group while Karpenter is draining the node. This is similar to how the AWS Node Termination Handler works (see aws/aws-node-termination-handler#316). In future, Karpenter might be enhanced to be able to wait for a configurable period before deleting the Node and terminating the associated instance as currently there's a race condition between the Pods being drained off of the Node and the EC2 instance being removed from the target group.
Currently, when Karpenter drains and then deletes a Node from the cluster, if that node is registered in a Target Group for an ALB/NLB the corresponding EC2 instance is not removed. This leads to the potential for increased errors when deleting nodes via Karpenter. In order to help resolve this issue, this change adds the well-known `node.kubernetes.io/exclude-from-external-balancers` label, which will case the AWS LB controller to remove the node from the Target Group while Karpenter is draining the node. This is similar to how the AWS Node Termination Handler works (see aws/aws-node-termination-handler#316). In future, Karpenter might be enhanced to be able to wait for a configurable period before deleting the Node and terminating the associated instance as currently there's a race condition between the Pods being drained off of the Node and the EC2 instance being removed from the target group.
Currently, when Karpenter drains and then deletes a Node from the cluster, if that node is registered in a Target Group for an ALB/NLB the corresponding EC2 instance is not removed. This leads to the potential for increased errors when deleting nodes via Karpenter. In order to help resolve this issue, this change adds the well-known `node.kubernetes.io/exclude-from-external-balancers` label, which will case the AWS LB controller to remove the node from the Target Group while Karpenter is draining the node. This is similar to how the AWS Node Termination Handler works (see aws/aws-node-termination-handler#316). In future, Karpenter might be enhanced to be able to wait for a configurable period before deleting the Node and terminating the associated instance as currently there's a race condition between the Pods being drained off of the Node and the EC2 instance being removed from the target group.
Currently, when Karpenter drains and then deletes a Node from the cluster, if that node is registered in a Target Group for an ALB/NLB the corresponding EC2 instance is not removed. This leads to the potential for increased errors when deleting nodes via Karpenter. In order to help resolve this issue, this change adds the well-known `node.kubernetes.io/exclude-from-external-balancers` label, which will case the AWS LB controller to remove the node from the Target Group while Karpenter is draining the node. This is similar to how the AWS Node Termination Handler works (see aws/aws-node-termination-handler#316). In future, Karpenter might be enhanced to be able to wait for a configurable period before deleting the Node and terminating the associated instance as currently there's a race condition between the Pods being drained off of the Node and the EC2 instance being removed from the target group.
@sarahhodne I think Remove nodes with Cluster Autoscaler taint from LB backends in service controller #105946 fixes the issue |
We found a pretty nice way to handle this with Graceful Node Shutdown and preStop hooks on daemonsets. Essentially you set the kubelet parameters (in our case we use karpenter, so we used specified userData in the EC2NodeClass as follows apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
userData: |
#!/bin/bash -xe
echo "$(jq '.shutdownGracePeriod="400s"' /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
echo "$(jq '.shutdownGracePeriodCriticalPods="100s"' /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json and then deploy a daemonset on all karpenter nodes with a high terminationGracePeriodSeconds and preStop hook apiVersion: apps/v1
kind: DaemonSet
metadata:
name: karpenter-termination-waiter
namespace: kube-system
labels:
k8s-app: karpenter-termination-waiter
spec:
selector:
matchLabels:
name: karpenter-termination-waiter
template:
metadata:
labels:
name: karpenter-termination-waiter
spec:
nodeSelector:
karpenter.sh/registered: "true"
containers:
- name: alpine
image: alpine:latest
command: ["sleep", "infinity"]
# wait for the node to be completely deregistered from the load balancer
lifecycle:
preStop:
exec:
command: ["sleep", "300"]
resources:
limits:
cpu: 5m
memory: 10Mi
requests:
cpu: 2m
memory: 5Mi
priorityClassName: high-priority
terminationGracePeriodSeconds: 300 the node is still running aws-node and kube-proxy behind the scene, so it can properly direct requests from the load balancer until it's completely drained. It's important that the gracePeriod and sleep hook is larger than the deregistration delay on the ALB so the node isn't terminated before being fully drained. |
@TaylorChristie similar issue with karpenter is being discussed at aws/karpenter-provider-aws#4673 in your workaround karpenter removes node from LB during draining time --> then all pods get deleted but we are waiting for out of the box solution from kerpenter but your workaround makese sense to try and use until there is some karpenter solution |
Yep, because of the |
@infa-ddeore , is there any official PR/fix to the CA for adding the 'node.kubernetes.io/exclude-from-external-load-balancers' label ? |
there isnt official PR for this, our devs made these changes and provided us custom cluster autoscaler image i havent tested this for ALB or with alb-load-balancer-controller, but i feel the alb controller also must be honoring the label, you can try adding the label manually to see if the node gets removed from ALB's target group or not |
@infa-ddeore I checked that indeed ALB is removing the node from node group when I set this label of |
Hi @deepakdeore2004 and all, I'm writing here my findings after I was able to resolve the issue without any code changes. Explanation: To summarize, the parameter effectively causes a delay between the "DeregisterTargets" API and the "TerminateInstances" API, letting the ALB to gracefully drain the connections. |
thanks for the details @oridool, i see aws lb controller understands but we use in-tree controller which doesnt understand this taint so the cluster autoscaler customization is needed from our side |
We've noticed in our production environment that we have a need for something to deregister nodes from load balancers as part of the draining procedure, before the instance is terminated. We're currently using lifecycle-manager for this, but it would be nice if this was handled by the AWS Node Termination Handler instead.
The reason this is needed is that if the instance is terminated before it's deregistered from an ELB, a number of connections will fail until the health check starts failing. This is particularly noticeable on ELBv2 (NLB+ALB), which seem to take several minutes to react, so we need to have fairly high timeout times on the health checks.
The behaviour we're looking for is that the node termination handler finds a list of classic ELBs and target groups that it's a member of, sends a deregistration request and then waits for the deregistration to finish before marking the instance as being ready to terminate.
The text was updated successfully, but these errors were encountered: