-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NTH should issue lifecycle heartbeats #493
Comments
Interesting, I hadn't expected nodes taking hours to drain. We can look into adding the heartbeat to cover the long draining case. Thanks for reporting! |
I would ask that this be made configurable so that installs that intentionally use the heartbeat timeout as the limiting factor (which is how NTH works now) can still do so. For cases where NTH can't reliably determine if progress is being made (which would probably be most of the time?), the heartbeat would be counterproductive if you want to timeout before the global timeout. |
We'd be interested in this as well, we have some applications which we run with a graceful termination threshold of upto 3 hours. In most cases the pods gracefully terminate much before the 3 hours but there would be cases where it can take close to 3 hours for the pods to terminate gracefully.
Yes, a configurable max period for sending heartbeats would be helpful. So NTH can probably be configured to send heartbeats for a given period (3 hours in our case) and the lifecycle heartbeat timeout can be set to 5 min or so, this way the node would be terminated anyway once the heartbeats stop. |
It looks like this issue has a good amount of interest. We would absolutely be open to accepting a PR for this, but right now we are focusing on the next version of NTH (V2). In V2 we hope to eliminate the need for adding so many additional configurations and solve a number of other issues on this repository. |
Is there any documentation on V2, and specifically for this issue how it would handle heartbeats? |
@stevehipwell Currently, v2 does not issue heartbeats. |
@cjerad is that a conscious decision or is it something you'd like to do if you had to resource to implement it? |
@stevehipwell It just hasn't been investigated yet. |
@cjerad are you looking for contributions? |
bump? it's been years since there was a mention of V2 |
any update on this? 🙏🏽 |
@bwagner5 This problem affects not only pods that are taking long time to terminate but also this affects any workload where many pods involved with a podAntiAffinity + pdb limiting termination multiple pods at once. This is pretty serious limitation which has to be considered before using NTH in large clusters. |
I noticed that if NTH receives ASG lifecycle events, spot interruptions and EC2 state change events via SQS, it may still leave spot instances in By implementing sending of heartbeats, we'd get the advantage of setting a low heartbeat timeout (= instances not lingering around forever if NTH isn't running or didn't get an event) but also make use of the ASG lifecycle hook "global timeout" to set the maximum time NTH can keep an instance alive. Not sure if the above spot+ASG case could be much improved as well. |
I've been using the NTH in queue processor mode. This implementation uses a lifecycle hook associated with the node instance to trigger the NTH to cordon/drain. Lifecycle hooks support two timeouts; the global timeout (max 48hrs) and the heartbeat timeout (max 7200 seconds).
https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_LifecycleHook.html
This means that if the NTH doesn't issue lifecycle heartbeats during the draining process, the node will be terminated (assuming CONTINUE on timeout vice ABANDON) within 7200 seconds (whatever the hook's heartbeat timeout is configured to).
This is problematic if you've got termination grace periods that can exceed 7200 seconds. The node will be terminated before the pod can safely evict.
If the NTH was issuing lifecycle heartbeats during the node drain, then this would effectively support grace periods that extend to the 48 hour global timeout.
https://docs.aws.amazon.com/cli/latest/reference/autoscaling/record-lifecycle-action-heartbeat.html
The text was updated successfully, but these errors were encountered: